Xuancheng Ren

CL
h-index27
60papers
39,591citations
Novelty53%
AI Score58

60 Papers

CLSep 28, 2023Code
Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu et al. · pku, tsinghua

Large language models (LLMs) have revolutionized the field of artificial intelligence, enabling natural language processing tasks that were previously thought to be exclusive to humans. In this work, we introduce Qwen, the first installment of our large language model series. Qwen is a comprehensive language model series that encompasses distinct models with varying parameter counts. It includes Qwen, the base pretrained language models, and Qwen-Chat, the chat models finetuned with human alignment techniques. The base language models consistently demonstrate superior performance across a multitude of downstream tasks, and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive. The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter. Furthermore, we have developed coding-specialized models, Code-Qwen and Code-Qwen-Chat, as well as mathematics-focused models, Math-Qwen-Chat, which are built upon base language models. These models demonstrate significantly improved performance in comparison with open-source models, and slightly fall behind the proprietary models.

CVDec 8, 2022Code
OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Jinze Bai, Rui Men, Hao Yang et al. · pku

Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys

CVSep 18, 2024Code
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan et al.

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

CVDec 19, 2022Code
Transferring General Multimodal Pretrained Models to Text Recognition

Junyang Lin, Xuancheng Ren, Yichang Zhang et al. · pku

This paper proposes a new method, OFA-OCR, to transfer multimodal pretrained models to text recognition. Specifically, we recast text recognition as image captioning and directly transfer a unified vision-language pretrained model to the end task. Without pretraining on large-scale annotated or synthetic text recognition data, OFA-OCR outperforms the baselines and achieves state-of-the-art performance in the Chinese text recognition benchmark. Additionally, we construct an OCR pipeline with OFA-OCR, and we demonstrate that it can achieve competitive performance with the product-level API. The code (https://github.com/OFA-Sys/OFA) and demo (https://modelscope.cn/studios/damo/ofa_ocr_pipeline/summary) are publicly available.

CVOct 19, 2022
Prophet Attention: Predicting Attention with Future Attention for Image Captioning

Fenglin Liu, Xuancheng Ren, Xian Wu et al. · oxford

Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40.

CLJul 15, 2024
Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui et al.

This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning. The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach. To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.

CVJun 4, 2022
Delving into the Openness of CLIP

Shuhuai Ren, Lei Li, Xuancheng Ren et al. · pku

Contrastive Language-Image Pre-training (CLIP) formulates image classification as an image-to-text matching task, i.e., matching images to the corresponding natural language descriptions instead of discrete category IDs. This allows for open-vocabulary visual recognition, where the model can recognize images from an open class set (also known as an open vocabulary) in a zero-shot manner. However, evaluating the openness of CLIP-like models is challenging, as the models are open to arbitrary vocabulary in theory, but their accuracy varies in practice. To address this, we resort to an incremental perspective to assess the openness through vocabulary expansions, and define extensibility to measure a model's ability to handle novel classes. Our evaluation shows that CLIP-like models are not truly open, and their performance deteriorates as the vocabulary expands. We further dissect the feature space of CLIP from the perspectives of representation alignment and uniformity. Our investigation reveals that the overestimation of openness is due to confusion among competing text features, rather than a failure to capture the similarity between image features and text features of novel classes. We hope that our investigation and analysis will facilitate future research on the CLIP openness issue.

CLSep 18, 2024
Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui et al.

In this report, we introduce the Qwen2.5-Coder series, a significant upgrade from its predecessor, CodeQwen1.5. This series includes six models: Qwen2.5-Coder-(0.5B/1.5B/3B/7B/14B/32B). As a code-specific model, Qwen2.5-Coder is built upon the Qwen2.5 architecture and continues pretrained on a vast corpus of over 5.5 trillion tokens. Through meticulous data cleaning, scalable synthetic data generation, and balanced data mixing, Qwen2.5-Coder demonstrates impressive code generation capabilities while retaining general and math skills. These models have been evaluated on a wide range of code-related tasks, achieving state-of-the-art (SOTA) performance across more than 10 benchmarks, including code generation, completion, reasoning, and repair, consistently outperforming larger models of the same model size. We believe that the release of the Qwen2.5-Coder series will advance research in code intelligence and, with its permissive licensing, support wider adoption by developers in real-world applications.

CLOct 11, 2022
From Mimicking to Integrating: Knowledge Integration for Pre-Trained Language Models

Lei Li, Yankai Lin, Xuancheng Ren et al. · pku, tencent-ai

Investigating better ways to reuse the released pre-trained language models (PLMs) can significantly reduce the computational cost and the potential environmental side-effects. This paper explores a novel PLM reuse paradigm, Knowledge Integration (KI). Without human annotations available, KI aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. To achieve this, we first derive the correlation between virtual golden supervision and teacher predictions. We then design a Model Uncertainty--aware Knowledge Integration (MUKI) framework to recover the golden supervision for the student. Specifically, MUKI adopts Monte-Carlo Dropout to estimate model uncertainty for the supervision integration. An instance-wise re-weighting mechanism based on the margin of uncertainty scores is further incorporated, to deal with the potential conflicting supervision from teachers. Experimental results demonstrate that MUKI achieves substantial improvements over baselines on benchmark datasets. Further analysis shows that MUKI can generalize well for merging teacher models with heterogeneous architectures, and even teachers major in cross-lingual datasets.

CVOct 28, 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

Fenglin Liu, Xian Wu, Shen Ge et al. · oxford

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

CLMar 20, 2022
Hierarchical Inductive Transfer for Continual Dialogue Learning

Shaoxiong Feng, Xuancheng Ren, Kan Li et al.

Pre-trained models have achieved excellent performance on the dialogue task. However, for the continual increase of online chit-chat scenarios, directly fine-tuning these models for each of the new tasks not only explodes the capacity of the dialogue system on the embedded devices but also causes knowledge forgetting on pre-trained models and knowledge interference among diverse dialogue tasks. In this work, we propose a hierarchical inductive transfer framework to learn and deploy the dialogue skills continually and efficiently. First, we introduce the adapter module into pre-trained models for learning new dialogue tasks. As the only trainable module, it is beneficial for the dialogue system on the embedded devices to acquire new dialogue skills with negligible additional parameters. Then, for alleviating knowledge interference between tasks yet benefiting the regularization between them, we further design hierarchical inductive transfer that enables new tasks to use general knowledge in the base adapter without being misled by diverse knowledge in task-specific adapters. Empirical evaluation and analysis indicate that our framework obtains comparable performance under deployment-friendly model capacity.

CLMay 12Code
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

Boyi Deng, Xu Wang, Yaoning Wang et al.

Large language models have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque, limiting our ability to inspect, control, and systematically improve them. This opacity motivates a growing body of research in mechanistic interpretability, with sparse autoencoders (SAEs) emerging as one of the most promising tools for decomposing model activations into sparse, interpretable feature representations. We introduce Qwen-Scope, an open-source suite of SAEs built on the Qwen model family, comprising 14 groups of SAEs across 7 model variants from the Qwen3 and Qwen3.5 series, covering both dense and mixture-of-expert architectures. Built on top of these SAEs, we show that SAEs can go beyond post-hoc analysis to serve as practical interfaces for model development along four directions: (i) inference-time steering, where SAE feature directions control language, concepts, and preferences without modifying model weights; (ii) evaluation analysis, where activated SAE features provide a representation-level proxy for benchmark redundancy and capability coverage; (iii) data-centric workflows, where SAE features support multilingual toxicity classification and safety-oriented data synthesis; and (iv) post-training optimization, where SAE-derived signals are incorporated into supervised fine-tuning and reinforcement learning objectives to mitigate undesirable behaviors such as code-switching and repetition. Together, these results demonstrate that SAEs can serve not only as post-hoc analysis tools, but also as reusable representation-level interfaces for diagnosing, controlling, evaluating, and improving large language models. By open-sourcing Qwen-Scope, we aim to support mechanistic research and accelerate practical workflows that connect model internals to downstream behavior.

CLJan 2, 2025Code
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

Shanghaoran Quan, Jiaxi Yang, Bowen Yu et al.

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that effectively test their sophisticated competition-level coding abilities. Existing benchmarks, like LiveCodeBench and USACO, fall short due to the unavailability of private test cases, lack of support for special judges, and misaligned execution environments. To bridge this gap, we introduce CodeElo, a standardized competition-level code generation benchmark that effectively addresses all these challenges for the first time. CodeElo benchmark is mainly based on the official CodeForces platform and tries to align with the platform as much as possible. We compile the recent six months of contest problems on CodeForces with detailed information such as contest divisions, problem difficulty ratings, and problem algorithm tags. We introduce a unique judging method in which problems are submitted directly to the platform and develop a reliable Elo rating calculation system that aligns with the platform and is comparable with human participants but has lower variance. By testing on our CodeElo, we provide the Elo ratings of 30 existing popular open-source and 3 proprietary LLMs for the first time. The results show that o1-mini and QwQ-32B-Preview stand out significantly, achieving Elo ratings of 1578 and 1261, respectively, while other models struggle even with the easiest problems, placing in the lowest 25 percent among all human participants. Detailed analysis experiments are also conducted to provide insights into performance across algorithms and comparisons between using C++ and Python, which can suggest directions for future studies.

CLMay 14, 2025
Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang et al. · tsinghua

In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

CLDec 19, 2024
Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang et al.

In this report, we introduce Qwen2.5, a comprehensive series of large language models (LLMs) designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has been significantly improved during both the pre-training and post-training stages. In terms of pre-training, we have scaled the high-quality pre-training datasets from the previous 7 trillion tokens to 18 trillion tokens. This provides a strong foundation for common sense, expert knowledge, and reasoning capabilities. In terms of post-training, we implement intricate supervised finetuning with over 1 million samples, as well as multistage reinforcement learning. Post-training techniques enhance human preference, and notably improve long text generation, structural data analysis, and instruction following. To handle diverse and varied use cases effectively, we present Qwen2.5 LLM series in rich sizes. Open-weight offerings include base and instruction-tuned models, with quantized versions available. In addition, for hosted solutions, the proprietary models currently include two mixture-of-experts (MoE) variants: Qwen2.5-Turbo and Qwen2.5-Plus, both available from Alibaba Cloud Model Studio. Qwen2.5 has demonstrated top-tier performance on a wide range of benchmarks evaluating language understanding, reasoning, mathematics, coding, human preference alignment, etc. Specifically, the open-weight flagship Qwen2.5-72B-Instruct outperforms a number of open and proprietary models and demonstrates competitive performance to the state-of-the-art open-weight model, Llama-3-405B-Instruct, which is around 5 times larger. Qwen2.5-Turbo and Qwen2.5-Plus offer superior cost-effectiveness while performing competitively against GPT-4o-mini and GPT-4o respectively. Additionally, as the foundation, Qwen2.5 models have been instrumental in training specialized models such as Qwen2.5-Math, Qwen2.5-Coder, QwQ, and multimodal models.

AIJan 15
LatentRefusal: Latent-Signal Refusal for Unanswerable Text-to-SQL Queries

Xuancheng Ren, Shijing Hu, Zhihui Lu et al.

In LLM-based text-to-SQL systems, unanswerable and underspecified user queries may generate not only incorrect text but also executable programs that yield misleading results or violate safety constraints, posing a major barrier to safe deployment. Existing refusal strategies for such queries either rely on output-level instruction following, which is brittle due to model hallucinations, or estimate output uncertainty, which adds complexity and overhead. To address this challenge, we formalize safe refusal in text-to-SQL systems as an answerability-gating problem and propose LatentRefusal, a latent-signal refusal mechanism that predicts query answerability from intermediate hidden activations of a large language model. We introduce the Tri-Residual Gated Encoder, a lightweight probing architecture, to suppress schema noise and amplify sparse, localized cues of question-schema mismatch that indicate unanswerability. Extensive empirical evaluations across diverse ambiguous and unanswerable settings, together with ablation studies and interpretability analyses, demonstrate the effectiveness of the proposed approach and show that LatentRefusal provides an attachable and efficient safety layer for text-to-SQL systems. Across four benchmarks, LatentRefusal improves average F1 to 88.5 percent on both backbones while adding approximately 2 milliseconds of probe overhead.

LGOct 13, 2021Code
Well-classified Examples are Underestimated in Classification with Deep Neural Networks

Guangxiang Zhao, Wenkai Yang, Xuancheng Ren et al.

The conventional wisdom behind learning deep classification models is to focus on bad-classified examples and ignore well-classified examples that are far from the decision boundary. For instance, when training with cross-entropy loss, examples with higher likelihoods (i.e., well-classified examples) contribute smaller gradients in back-propagation. However, we theoretically show that this common practice hinders representation learning, energy optimization, and margin growth. To counteract this deficiency, we propose to reward well-classified examples with additive bonuses to revive their contribution to the learning process. This counterexample theoretically addresses these three issues. We empirically support this claim by directly verifying the theoretical results or significant performance improvement with our counterexample on diverse tasks, including image classification, graph classification, and machine translation. Furthermore, this paper shows that we can deal with complex scenarios, such as imbalanced classification, OOD detection, and applications under adversarial attacks because our idea can solve these three issues. Code is available at: https://github.com/lancopku/well-classified-examples-are-underestimated.

CLMar 29, 2021Code
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models

Wenkai Yang, Lei Li, Zhiyuan Zhang et al.

Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. Victim models can maintain competitive performance on clean samples while behaving abnormally on samples with a specific trigger word inserted. Previous backdoor attacking methods usually assume that attackers have a certain degree of data knowledge, either the dataset which users would use or proxy datasets for a similar task, for implementing the data poisoning procedure. However, in this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector, with almost no accuracy sacrificed on clean samples. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier. We hope this work can raise the awareness of such a critical security risk hidden in the embedding layers of NLP models. Our code is available at https://github.com/lancopku/Embedding-Poisoning.

CLDec 25, 2019Code
Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang et al.

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called \textbf{Explicit Sparse Transformer}. Explicit Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing and computer vision tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Explicit Sparse Transformer in model performance. We also show that our proposed sparse attention method achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time. For example, the inference speed is twice that of sparsemax in Transformer model. Code will be available at \url{https://github.com/lancopku/Explicit-Sparse-Transformer}

LGOct 27, 2019Code
An Adaptive and Momental Bound Method for Stochastic Learning

Jianbang Ding, Xuancheng Ren, Ruixuan Luo et al.

Training deep neural networks requires intricate initialization and careful selection of learning rates. The emergence of stochastic gradient optimization methods that use adaptive learning rates based on squared past gradients, e.g., AdaGrad, AdaDelta, and Adam, eases the job slightly. However, such methods have also been proven problematic in recent studies with their own pitfalls including non-convergence issues and so on. Alternative variants have been proposed for enhancement, such as AMSGrad, AdaShift and AdaBound. In this work, we identify a new problem of adaptive learning rate methods that exhibits at the beginning of learning where Adam produces extremely large learning rates that inhibit the start of learning. We propose the Adaptive and Momental Bound (AdaMod) method to restrict the adaptive learning rates with adaptive and momental upper bounds. The dynamic learning rate bounds are based on the exponential moving averages of the adaptive learning rates themselves, which smooth out unexpected large learning rates and stabilize the training of deep neural networks. Our experiments verify that AdaMod eliminates the extremely large learning rates throughout the training and brings significant improvements especially on complex networks such as DenseNet and Transformer, compared to Adam. Our implementation is available at: https://github.com/lancopku/AdaMod

CLMay 15, 2019Code
Aligning Visual Regions and Textual Concepts for Semantic-Grounded Image Representations

Fenglin Liu, Yuanxin Liu, Xuancheng Ren et al.

In vision-and-language grounding problems, fine-grained representations of the image are considered to be of paramount importance. Most of the current systems incorporate visual features and textual concepts as a sketch of an image. However, plainly inferred representations are usually undesirable in that they are composed of separate components, the relations of which are elusive. In this work, we aim at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. To this end, we build the Mutual Iterative Attention (MIA) module, which integrates correlated visual features and textual concepts, respectively, by aligning the two modalities. We evaluate the proposed approach on two representative vision-and-language grounding tasks, i.e., image captioning and visual question answering. In both tasks, the semantic-grounded image representations consistently boost the performance of the baseline models under all metrics across the board. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image-related applications. (The code is available at https://github.com/fenglinliu98/MIA)

CLAug 27, 2018Code
simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

Fenglin Liu, Xuancheng Ren, Yuanxin Liu et al.

The encode-decoder framework has shown recent success in image captioning. Visual attention, which is good at detailedness, and semantic attention, which is good at comprehensiveness, have been separately proposed to ground the caption on the image. In this paper, we propose the Stepwise Image-Topic Merging Network (simNet) that makes use of the two kinds of attention at the same time. At each time step when generating the caption, the decoder adaptively merges the attentive information in the extracted topics and the image according to the generated context, so that the visual information and the semantic information can be effectively combined. The proposed approach is evaluated on two benchmark datasets and reaches the state-of-the-art performances.(The code is available at https://github.com/lancopku/simNet)

CLAug 21, 2018Code
A Skeleton-Based Model for Promoting Coherence Among Sentences in Narrative Story Generation

Jingjing Xu, Xuancheng Ren, Yi Zhang et al.

Narrative story generation is a challenging problem because it demands the generated sentences with tight semantic connections, which has not been well studied by most existing generative models. To address this problem, we propose a skeleton-based model to promote the coherence of generated stories. Different from traditional models that generate a complete sentence at a stroke, the proposed model first generates the most critical phrases, called skeleton, and then expands the skeleton to a complete and fluent sentence. The skeleton is not manually defined, but learned by a reinforcement learning method. Compared to the state-of-the-art models, our skeleton-based model can generate significantly more coherent text according to human evaluation and automatic evaluation. The G-score is improved by 20.1% in the human evaluation. The code is available at https://github.com/lancopku/Skeleton-Based-Generation-Model

CLFeb 5, 2018Code
DP-GAN: Diversity-Promoting Generative Adversarial Network for Generating Informative and Diversified Text

Jingjing Xu, Xuancheng Ren, Junyang Lin et al.

Existing text generation methods tend to produce repeated and "boring" expressions. To tackle this problem, we propose a new text generation model, called Diversity-Promoting Generative Adversarial Network (DP-GAN). The proposed model assigns low reward for repeatedly generated text and high reward for "novel" and fluent text, encouraging the generator to produce diverse and informative text. Moreover, we propose a novel language-model based discriminator, which can better distinguish novel text from repeated text without the saturation problem compared with existing classifier-based discriminators. The experimental results on review generation and dialogue generation tasks demonstrate that our model can generate substantially more diverse and informative text than existing baselines. The code is available at https://github.com/lancopku/DPGAN

CLNov 28, 2017Code
Hybrid Oracle: Making Use of Ambiguity in Transition-based Chinese Dependency Parsing

Xuancheng Ren, Xu Sun

In the training of transition-based dependency parsers, an oracle is used to predict a transition sequence for a sentence and its gold tree. However, the transition system may exhibit ambiguity, that is, there can be multiple correct transition sequences that form the gold tree. We propose to make use of the property in the training of neural dependency parsers, and present the Hybrid Oracle. The new oracle gives all the correct transitions for a parsing state, which are used in the cross entropy loss function to provide better supervisory signal. It is also used to generate different transition sequences for a sentence to better explore the training data and improve the generalization ability of the parser. Evaluations show that the parsers trained using the hybrid oracle outperform the parsers using the traditional oracle in Chinese dependency parsing. We provide analysis from a linguistic view. The code is available at https://github.com/lancopku/nndep .

LGNov 17, 2017Code
Training Simplification and Model Simplification for Deep Learning: A Minimal Effort Back Propagation Method

Xu Sun, Xuancheng Ren, Shuming Ma et al.

We propose a simple yet effective technique to simplify the training and the resulting model of neural networks. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction in the computational cost. Based on the sparsified gradients, we further simplify the model by eliminating the rows or columns that are seldom updated, which will reduce the computational cost both in the training and decoding, and potentially accelerate decoding in real-world applications. Surprisingly, experimental results demonstrate that most of time we only need to update fewer than 5% of the weights at each back propagation pass. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The model simplification results show that we could adaptively simplify the model which could often be reduced by around 9x, without any loss on accuracy or even with improved accuracy. The codes, including the extension, are available at https://github.com/lancopku/meSimp

LGOct 28, 2017Code
Label Embedding Network: Learning Label Representation for Soft Training of Deep Networks

Xu Sun, Bingzhen Wei, Xuancheng Ren et al.

We propose a method, called Label Embedding Network, which can learn label representation (label embedding) during the training process of deep networks. With the proposed method, the label embedding is adaptively and automatically learned through back propagation. The original one-hot represented loss function is converted into a new loss function with soft distributions, such that the originally unrelated labels have continuous interactions with each other during the training process. As a result, the trained model can achieve substantially higher accuracy and with faster convergence speed. Experimental results based on competitive tasks demonstrate the effectiveness of the proposed method, and the learned label embedding is reasonable and interpretable. The proposed method achieves comparable or even better results than the state-of-the-art systems. The source code is available at \url{https://github.com/lancopku/LabelEmb}.

LGJun 19, 2017Code
meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

Xu Sun, Xuancheng Ren, Shuming Ma et al.

We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ($k$ divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1-4% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. The code is available at https://github.com/lancopku/meProp

LGMar 29, 2015Code
Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Framework (SAPO)

Xu Sun, Shuming Ma, Yi Zhang et al.

There are two major approaches for sequence labeling. One is the probabilistic gradient-based methods such as conditional random fields (CRF) and neural networks (e.g., RNN), which have high accuracy but drawbacks: slow training, and no support of search-based optimization (which is important in many cases). The other is the search-based learning methods such as structured perceptron and margin infused relaxed algorithm (MIRA), which have fast training but also drawbacks: low accuracy, no probabilistic information, and non-convergence in real-world tasks. We propose a novel and "easy" solution, a search-based probabilistic online learning method, to address most of those issues. The method is "easy", because the optimization algorithm at the training stage is as simple as the decoding algorithm at the test stage. This method searches the output candidates, derives probabilities, and conducts efficient online learning. We show that this method with fast training and theoretical guarantee of convergence, which is easy to implement, can support search-based optimization and obtain top accuracy. Experiments on well-known tasks show that our method has better accuracy than CRF and BiLSTM\footnote{The SAPO code is released at \url{https://github.com/lancopku/SAPO}.}.

CVNov 26, 2025
Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen et al.

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

CLDec 14, 2021
Model Uncertainty-Aware Knowledge Amalgamation for Pre-Trained Language Models

Lei Li, Yankai Lin, Xuancheng Ren et al.

As many fine-tuned pre-trained language models~(PLMs) with promising performance are generously released, investigating better ways to reuse these models is vital as it can greatly reduce the retraining computational cost and the potential environmental side-effects. In this paper, we explore a novel model reuse paradigm, Knowledge Amalgamation~(KA) for PLMs. Without human annotations available, KA aims to merge the knowledge from different teacher-PLMs, each of which specializes in a different classification problem, into a versatile student model. The achieve this, we design a Model Uncertainty--aware Knowledge Amalgamation~(MUKA) framework, which identifies the potential adequate teacher using Monte-Carlo Dropout for approximating the golden supervision to guide the student. Experimental results demonstrate that MUKA achieves substantial improvements over baselines on benchmark datasets. Further analysis shows that MUKA can generalize well under several complicate settings with multiple teacher models, heterogeneous teachers, and even cross-dataset teachers.

LGOct 8, 2021
Topology-Imbalance Learning for Semi-Supervised Node Classification

Deli Chen, Yankai Lin, Guangxiang Zhao et al.

The class imbalance problem, as an important issue in learning node representations, has drawn increasing attention from the community. Although the imbalance considered by existing studies roots from the unequal quantity of labeled examples in different classes (quantity imbalance), we argue that graph data expose a unique source of imbalance from the asymmetric topological properties of the labeled nodes, i.e., labeled nodes are not equal in terms of their structural role in the graph (topology imbalance). In this work, we first probe the previously unknown topology-imbalance issue, including its characteristics, causes, and threats to semi-supervised node classification learning. We then provide a unified view to jointly analyzing the quantity- and topology- imbalance issues by considering the node influence shift phenomenon with the Label Propagation algorithm. In light of our analysis, we devise an influence conflict detection -- based metric Totoro to measure the degree of graph topology imbalance and propose a model-agnostic method ReNode to address the topology-imbalance issue by re-weighting the influence of labeled nodes adaptively based on their relative positions to class boundaries. Systematic experiments demonstrate the effectiveness and generalizability of our method in relieving topology-imbalance issue and promoting semi-supervised node classification. The further analysis unveils varied sensitivity of different graph neural networks (GNNs) to topology imbalance, which may serve as a new perspective in evaluating GNN architectures.

LGSep 7, 2021
Adversarial Parameter Defense by Multi-Step Risk Minimization

Zhiyuan Zhang, Ruixuan Luo, Xuancheng Ren et al.

Previous studies demonstrate DNNs' vulnerability to adversarial examples and adversarial training can establish a defense to adversarial examples. In addition, recent studies show that deep neural networks also exhibit vulnerability to parameter corruptions. The vulnerability of model parameters is of crucial value to the study of model robustness and generalization. In this work, we introduce the concept of parameter corruption and propose to leverage the loss change indicators for measuring the flatness of the loss basin and the parameter robustness of neural network parameters. On such basis, we analyze parameter corruptions and propose the multi-step adversarial corruption algorithm. To enhance neural networks, we propose the adversarial parameter defense algorithm that minimizes the average risk of multiple adversarial parameter corruptions. Experimental results show that the proposed algorithm can improve both the parameter robustness and accuracy of neural networks.

CLAug 5, 2021
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Fenglin Liu, Xuancheng Ren, Xian Wu et al.

Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed.

LGMay 15, 2021
Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Fenglin Liu, Xuancheng Ren, Zhiyuan Zhang et al.

Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

CLFeb 22, 2021
Multi-View Feature Representation for Dialogue Generation with Bidirectional Distillation

Shaoxiong Feng, Xuancheng Ren, Kan Li et al.

Neural dialogue models suffer from low-quality responses when interacted in practice, demonstrating difficulty in generalization beyond training data. Recently, knowledge distillation has been used to successfully regularize the student by transferring knowledge from the teacher. However, the teacher and the student are trained on the same dataset and tend to learn similar feature representations, whereas the most general knowledge should be found through differences. The finding of general knowledge is further hindered by the unidirectional distillation, as the student should obey the teacher and may discard some knowledge that is truly general but refuted by the teacher. To this end, we propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus, i.e., finding common knowledge that is beneficial to different yet all datasets through diversified learning partners. Concretely, the training task is divided into a group of subtasks with the same number of students. Each student assigned to one subtask not only is optimized on the allocated subtask but also imitates multi-view feature representation aggregated from other students (i.e., student peers), which induces students to capture common knowledge among different subtasks and alleviates the over-fitting of students on the allocated subtasks. To further enhance generalization, we extend the unidirectional distillation to the bidirectional distillation that encourages the student and its student peers to co-evolve by exchanging complementary knowledge with each other. Empirical results and analysis demonstrate that our training framework effectively improves the model generalization without sacrificing training efficiency.

LGDec 14, 2020
Rethinking the Promotion Brought by Contrastive Learning to Semi-Supervised Node Classification

Deli Chen, Yankai Lin, Lei Li et al.

Graph Contrastive Learning (GCL) has proven highly effective in promoting the performance of Semi-Supervised Node Classification (SSNC). However, existing GCL methods are generally transferred from other fields like CV or NLP, whose underlying working mechanism remains under-explored. In this work, we first deeply probe the working mechanism of GCL in SSNC, and find that the promotion brought by GCL is severely unevenly distributed: the improvement mainly comes from subgraphs with less annotated information, which is fundamentally different from contrastive learning in other fields. However, existing GCL methods generally ignore this uneven distribution of annotated information and apply GCL evenly to the whole graph. To remedy this issue and further improve GCL in SSNC, we propose the Topology InFormation gain-Aware Graph Contrastive Learning (TIFA-GCL) framework that considers the annotated information distribution across graph in GCL. Extensive experiments on six benchmark graph datasets, including the enormous OGB-Products graph, show that TIFA-GCL can bring a larger improvement than existing GCL methods in both transductive and inductive settings. Further experiments demonstrate the generalizability and interpretability of TIFA-GCL.

CLOct 13, 2020
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations

Fuli Luo, Pengcheng Yang, Shicheng Li et al.

Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6\% absolute gain on GLUE benchmarks and 0.8\% absolute increment on $\text{NLVR}^2$.

CLOct 5, 2020
Regularizing Dialogue Generation by Imitating Implicit Scenarios

Shaoxiong Feng, Xuancheng Ren, Hongshen Chen et al.

Human dialogues are scenario-based and appropriate responses generally relate to the latent context knowledge entailed by the specific scenario. To enable responses that are more meaningful and context-specific, we propose to improve generative dialogue systems from the scenario perspective, where both dialogue history and future conversation are taken into account to implicitly reconstruct the scenario knowledge. More importantly, the conversation scenarios are further internalized using imitation learning framework, where the conventional dialogue model that has no access to future conversations is effectively regularized by transferring the scenario knowledge contained in hierarchical supervising signals from the scenario-based dialogue model, so that the future conversation is not required in actual inference. Extensive evaluations show that our approach significantly outperforms state-of-the-art baselines on diversity and relevance, and expresses scenario-specific knowledge.

LGSep 16, 2020
Collaborative Group Learning

Shaoxiong Feng, Hongshen Chen, Xuancheng Ren et al.

Collaborative learning has successfully applied knowledge transfer to guide a pool of small student networks towards robust local minima. However, previous approaches typically struggle with drastically aggravated student homogenization when the number of students rises. In this paper, we propose Collaborative Group Learning, an efficient framework that aims to diversify the feature representation and conduct an effective regularization. Intuitively, similar to the human group study mechanism, we induce students to learn and exchange different parts of course knowledge as collaborative groups. First, each student is established by randomly routing on a modular neural network, which facilitates flexible knowledge communication between students due to random levels of representation sharing and branching. Second, to resist the student homogenization, students first compose diverse feature sets by exploiting the inductive bias from sub-sets of training data, and then aggregate and distill different complementary knowledge by imitating a random sub-group of students at each time step. Overall, the above mechanisms are beneficial for maximizing the student population to further improve the model generalization without sacrificing computational efficiency. Empirical evaluations on both image and text tasks indicate that our method significantly outperforms various state-of-the-art collaborative approaches whilst enhancing computational efficiency.

LGJun 10, 2020
Exploring the Vulnerability of Deep Neural Networks: A Study of Parameter Corruption

Xu Sun, Zhiyuan Zhang, Xuancheng Ren et al.

We argue that the vulnerability of model parameters is of crucial value to the study of model robustness and generalization but little research has been devoted to understanding this matter. In this work, we propose an indicator to measure the robustness of neural network parameters by exploiting their vulnerability via parameter corruption. The proposed indicator describes the maximum loss variation in the non-trivial worst-case scenario under parameter corruption. For practical purposes, we give a gradient-based estimation, which is far more effective than random corruption trials that can hardly induce the worst accuracy degradation. Equipped with theoretical support and empirical validation, we are able to systematically investigate the robustness of different model parameters and reveal vulnerability of deep neural networks that has been rarely paid attention to before. Moreover, we can enhance the models accordingly with the proposed adversarial corruption-resistant training, which not only improves the parameter robustness but also translates into accuracy elevation.

CLMay 16, 2020
Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

Fenglin Liu, Xuancheng Ren, Guangxiang Zhao et al.

In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise multi-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments and analyses show that we successfully address the hierarchy bypassing problem, require almost negligible parameter increase, and substantially improve the performance of sequence-to-sequence learning with deep representations on five diverse tasks, i.e., machine translation, abstractive summarization, image captioning, video captioning, medical report generation, and paraphrase generation. In particular, our approach achieves new state-of-the-art results on ten benchmark datasets, including a low-resource machine translation dataset and two low-resource medical report generation datasets.

CVFeb 28, 2020
Exploring and Distilling Cross-Modal Information for Image Captioning

Fenglin Liu, Xuancheng Ren, Yuanxin Liu et al.

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. Based on the Transformer, to perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our Transformer-based model achieves a CIDEr score of 129.3 in offline COCO evaluation on the COCO testing set with remarkable efficiency in terms of accuracy, speed, and parameter budget.

CLJun 27, 2019
PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation

Ruixuan Luo, Jingjing Xu, Yi Zhang et al.

Chinese word segmentation (CWS) is a fundamental step of Chinese natural language processing. In this paper, we build a new toolkit, named PKUSEG, for multi-domain word segmentation. Unlike existing single-model toolkits, PKUSEG targets multi-domain word segmentation and provides separate models for different domains, such as web, medicine, and tourism. Besides, due to the lack of labeled data in many domains, we propose a domain adaptation paradigm to introduce cross-domain semantic knowledge via a translation system. Through this method, we generate synthetic data using a large amount of unlabeled data in the target domain and then obtain a word segmentation model for the target domain. We also further refine the performance of the default model with the help of synthetic data. Experiments show that PKUSEG achieves high performance on multiple domains. The new toolkit also supports POS tagging and model training to adapt to various application scenarios. The toolkit is now freely and publicly available for the usage of research and industry.

CLJun 5, 2019
A Hierarchical Reinforced Sequence Operation Method for Unsupervised Text Style Transfer

Chen Wu, Xuancheng Ren, Fuli Luo et al.

Unsupervised text style transfer aims to alter text styles while preserving the content, without aligned data for supervision. Existing seq2seq methods face three challenges: 1) the transfer is weakly interpretable, 2) generated outputs struggle in content preservation, and 3) the trade-off between content and style is intractable. To address these challenges, we propose a hierarchical reinforced sequence operation method, named Point-Then-Operate (PTO), which consists of a high-level agent that proposes operation positions and a low-level agent that alters the sentence. We provide comprehensive training objectives to control the fluency, style, and content of the outputs and a mask-based inference algorithm that allows for multi-step revision based on the single-step trained agents. Experimental results on two text style transfer datasets show that our method significantly outperforms recent methods and effectively addresses the aforementioned challenges.

LGMay 24, 2019
Memorized Sparse Backpropagation

Zhiyuan Zhang, Pengcheng Yang, Xuancheng Ren et al.

Neural network learning is usually time-consuming since backpropagation needs to compute full gradients and backpropagate them across multiple layers. Despite its success of existing works in accelerating propagation through sparseness, the relevant theoretical characteristics remain under-researched and empirical studies found that they suffer from the loss of information contained in unpropagated gradients. To tackle these problems, this paper presents a unified sparse backpropagation framework and provides a detailed analysis of its theoretical characteristics. Analysis reveals that when applied to a multilayer perceptron, our framework essentially performs gradient descent using an estimated gradient similar enough to the true gradient, resulting in convergence in probability under certain conditions. Furthermore, a simple yet effective algorithm named memorized sparse backpropagation (MSBP) is proposed to remedy the problem of information loss by storing unpropagated gradients in memory for learning in the next steps. Experimental results demonstrate that the proposed MSBP is effective to alleviate the information loss in traditional sparse backpropagation while achieving comparable acceleration.

CLSep 11, 2018
Evaluating Semantic Rationality of a Sentence: A Sememe-Word-Matching Neural Network based on HowNet

Shu Liu, Jingjing Xu, Xuancheng Ren et al.

Automatic evaluation of semantic rationality is an important yet challenging task, and current automatic techniques cannot well identify whether a sentence is semantically rational. The methods based on the language model do not measure the sentence by rationality but by commonness. The methods based on the similarity with human written sentences will fail if human-written references are not available. In this paper, we propose a novel model called Sememe-Word-Matching Neural Network (SWM-NN) to tackle semantic rationality evaluation by taking advantage of sememe knowledge base HowNet. The advantage is that our model can utilize a proper combination of sememes to represent the fine-grained semantic meanings of a word within the specific contexts. We use the fine-grained semantic representation to help the model learn the semantic dependency among words. To evaluate the effectiveness of the proposed model, we build a large-scale rationality evaluation dataset. Experimental results on this dataset show that the proposed model outperforms the competitive baselines with a 5.4\% improvement in accuracy.

CLAug 23, 2018
Review-Driven Multi-Label Music Style Classification by Exploiting Style Correlations

Guangxiang Zhao, Jingjing Xu, Qi Zeng et al.

This paper explores a new natural language processing task, review-driven multi-label music style classification. This task requires the system to identify multiple styles of music based on its reviews on websites. The biggest challenge lies in the complicated relations of music styles. It has brought failure to many multi-label classification methods. To tackle this problem, we propose a novel deep learning approach to automatically learn and exploit style correlations. The proposed method consists of two parts: a label-graph based neural network, and a soft training mechanism with correlation-based continuous label representation. Experimental results show that our approach achieves large improvements over the baselines on the proposed dataset. Especially, the micro F1 is improved from 53.9 to 64.5, and the one-error is reduced from 30.5 to 22.6. Furthermore, the visualized analysis shows that our approach performs well in capturing style correlations.

CLAug 22, 2018
Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation

Junyang Lin, Xu Sun, Xuancheng Ren et al.

Most of the Neural Machine Translation (NMT) models are based on the sequence-to-sequence (Seq2Seq) model with an encoder-decoder framework equipped with the attention mechanism. However, the conventional attention mechanism treats the decoding at each time step equally with the same matrix, which is problematic since the softness of the attention for different types of words (e.g. content words and function words) should differ. Therefore, we propose a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an attention temperature. Experimental results on the Chinese-English translation and English-Vietnamese translation demonstrate that our model outperforms the baseline models, and the analysis and the case study show that our model can attend to the most relevant elements in the source-side contexts and generate the translation of high quality.

CLAug 16, 2018
Sememe Prediction: Learning Semantic Knowledge from Unstructured Textual Wiki Descriptions

Wei Li, Xuancheng Ren, Damai Dai et al.

Huge numbers of new words emerge every day, leading to a great need for representing them with semantic meaning that is understandable to NLP systems. Sememes are defined as the minimum semantic units of human languages, the combination of which can represent the meaning of a word. Manual construction of sememe based knowledge bases is time-consuming and labor-intensive. Fortunately, communities are devoted to composing the descriptions of words in the wiki websites. In this paper, we explore to automatically predict lexical sememes based on the descriptions of the words in the wiki websites. We view this problem as a weakly ordered multi-label task and propose a Label Distributed seq2seq model (LD-seq2seq) with a novel soft loss function to solve the problem. In the experiments, we take a real-world sememe knowledge base HowNet and the corresponding descriptions of the words in Baidu Wiki for training and evaluation. The results show that our LD-seq2seq model not only beats all the baselines significantly on the test set, but also outperforms amateur human annotators in a random subset of the test set.