Kun Xu

h-index24

89papers

18,454citations

Novelty56%

AI Score62

Ranked #3,031 of 201,326 authors (top 2%)#850 in CL (top 3%)

89 Papers

CVSep 9, 2023Code

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Yang Jin, Kun Xu, Kun Xu et al. · pku

Recently, the remarkable advance of the Large Language Model (LLM) has inspired researchers to transfer its extraordinary reasoning capability to both vision and language data. However, the prevailing approaches primarily regard the visual input as a prompt and focus exclusively on optimizing the text generation process conditioned upon vision content by a frozen LLM. Such an inequitable treatment of vision and language heavily constrains the model's potential. In this paper, we break through this limitation by representing both vision and language in a unified form. Specifically, we introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language that LLM can read. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. Coped with this tokenizer, the presented foundation model called LaVIT can handle both image and text indiscriminately under the same generative learning paradigm. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously. Extensive experiments further showcase that it outperforms the existing models by a large margin on massive vision-language tasks. Our code and models are available at https://github.com/jy0205/LaVIT.

CVJul 17, 2022Code

Neural Color Operators for Sequential Image Retouching

Yili Wang, Xin Li, Kun Xu et al.

We propose a novel image retouching method by modeling the retouching process as performing a sequence of newly introduced trainable neural color operators. The neural color operator mimics the behavior of traditional color operators and learns pixelwise color transformation while its strength is controlled by a scalar. To reflect the homomorphism property of color operators, we employ equivariant mapping and adopt an encoder-decoder structure which maps the non-linear color transformation to a much simpler transformation (i.e., translation) in a high dimensional space. The scalar strength of each neural color operator is predicted using CNN based strength predictors by analyzing global image statistics. Overall, our method is rather lightweight and offers flexible controls. Experiments and user studies on public datasets show that our method consistently achieves the best results compared with SOTA methods in both quantitative measures and visual qualities. The code and pretrained models are provided at https://github.com/amberwangyili/neurop

CLOct 22, 2022

Learning a Grammar Inducer from Massive Uncurated Instructional Videos

Songyang Zhang, Linfeng Song, Lifeng Jin et al. · tencent-ai

Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text. While previous work focuses on building systems for inducing grammars on text that are well-aligned with video content, we investigate the scenario, in which text and video are only in loose correspondence. Such data can be found in abundance online, and the weak correspondence is similar to the indeterminacy problem studied in language acquisition. Furthermore, we build a new model that can better learn video-span correlation without manually designed features adopted by previous work. Experiments show that our model trained only on large-scale YouTube data with no text-video alignment reports strong and robust performances across three unseen datasets, despite domain shift and noisy label issues. Furthermore our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.

NAJan 10, 2016

A Unified Gas-kinetic Scheme for Continuum and Rarefied Flows IV: full Boltzmann and Model Equations

Chang Liu, Kun Xu, Quanhua Sun et al.

Fluid dynamic equations are valid in their respective modeling scales. With a variation of the modeling scales, theoretically there should have a continuous spectrum of fluid dynamic equations. In order to study multiscale flow evolution efficiently, the dynamics in the computational fluid has to be changed with the scales. A direct modeling of flow physics with a changeable scale may become an appropriate approach. The unified gas-kinetic scheme (UGKS) is a direct modeling method in the mesh size scale, and its underlying flow physics depends on the resolution of the cell size relative to the particle mean free path. The cell size of UGKS is not limited by the particle mean free path. With the variation of the ratio between the numerical cell size and local particle mean free path, the UGKS recovers the flow dynamics from the particle transport and collision in the kinetic scale to the wave propagation in the hydrodynamic scale. The previous UGKS is mostly constructed from the evolution solution of kinetic model equations. This work is about the further development of the UGKS with the implementation of the full Boltzmann collision term in the region where it is needed. The central ingredient of the UGKS is the coupled treatment of particle transport and collision in the flux evaluation across a cell interface, where a continuous flow dynamics from kinetic to hydrodynamic scales is modeled. The newly developed UGKS has the asymptotic preserving (AP) property of recovering the NS solutions in the continuum flow regime, and the full Boltzmann solution in the rarefied regime. In the mostly unexplored transition regime, the UGKS itself provides a valuable tool for the flow study in this regime. The mathematical properties of the scheme, such as stability, accuracy, and the asymptotic preserving, will be analyzed in this paper as well.

COMP-PHFeb 22, 2017

A Unified Gas Kinetic Scheme for Multi-scale and Multi-component Plasma Transport

Chang Liu, Kun Xu

A unified gas kinetic scheme (UGKS) for multi-scale and multi-component plasma transport is constructed. The current scheme is a direct modeling method, where the time evolution solutions from the Vlasov-BGK equations of electron and ion and the Maxwell equations are used to construct a scale-dependent plasma simulation model. As a result, with the changing of modeling scales of mesh size and time step and with a variation of Knudsen number and Larmor radius, the discretized governing equations for a wide range of plasma evolution regimes can be obtained. The physics recovered in UGKS ranges from the Vlasov equation in the kinetic scale to different-type magneto-hydrodynamic (MHD) equations in the hydrodynamic scale. The key dynamics in UGKS is the un-splitting treatment of particle collision, acceleration, and transport in the construction of numerical flux across a cell interface. At the same time, the plasma evolution is coupled with the Maxwell equations in an implicit way, which automatically provides a smooth transition between the Ampere's law and the Ohm's law for the calculation of electric field. The time step of UGKS is not limited by the relaxation time, the cyclotron period, and the speed of light in the MHD regime. The UGKS is validated by numerical test cases, such as the Landau damping and two stream instability in the kinetic regime, and the Brio-Wu shock tube problem and the Orszag- Tang MHD turbulence problem in the hydrodynamic regime. The scheme is also used to study the geospace environment modeling (GEM), such as the challenging magnetic reconnection problem in the transition regime. Overall, the UGKS is a physically reliable multi-scale plasma simulation method. It provides a powerful and unified approach for the study of plasma physics.

CLNov 13, 2023Code

A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question Decomposition with Large Language Models

Hejing Cao, Zhenwei An, Jiazhan Feng et al. · pku

While large language models exhibit remarkable performance in the Question Answering task, they are susceptible to hallucinations. Challenges arise when these models grapple with understanding multi-hop relations in complex questions or lack the necessary knowledge for a comprehensive response. To address this issue, we introduce the "Decompose-and-Query" framework (D&Q). This framework guides the model to think and utilize external knowledge similar to ReAct, while also restricting its thinking to reliable information, effectively mitigating the risk of hallucinations. Experiments confirm the effectiveness of D&Q: On our ChitChatQA dataset, D&Q does not lose to ChatGPT in 67% of cases; on the HotPotQA question-only setting, D&Q achieved an F1 score of 59.6%. Our code is available at https://github.com/alkaidpku/DQ-ToolQA.

NAFeb 18, 2016

An Efficient and Accurate Two-Stage Fourth-order Gas-kinetic Scheme for the Navier-Stokes Equations

Liang Pan, Kun Xu, Qibing Li et al.

For computational fluid dynamics (CFD), the generalized Riemann problem (GRP) solver and the gas-kinetic kinetic scheme (GKS) provide a time-accurate flux function starting from a discontinuous piecewise linear flow distributions around each cell interface. With the use of time derivative of the flux function, a two-stage Lax-Wendroff-type (L-W for short) time stepping method has been recently proposed in the design of a fourth-order time accurate method [18]. In this paper, based on the same time-stepping method and the second-order GKS flux function [34], a fourth-order gas-kinetic scheme is constructed for the Euler and Navier-Stokes equations. In comparison with the formal one-stage time-stepping third-order gas-kinetic solver [21], the current fourth-order method not only reduces the complexity of the flux function, but also improves the accuracy of the scheme, even though the third- and fourth-order schemes have similar computation cost. Most importantly, the robustness of the fourth-order GKS is as good as the second-order one. Perfect numerical solutions can be obtained from the high Reynolds number boundary layer solutions to the hypersonic viscous heat conducting flow computations. Many numerical tests, including many difficult ones for the Navier-Stokes solvers, have been used to validate the current fourth-order method. Following the two-stage time-stepping framework, the one-stage third-order GKS can be easily extended to a fifth-order method with the usage of both first-order and second-order time derivatives of the flux function. The use of time-accurate flux function may have great impact on the development of higher-order CFD methods.

NANov 17, 2018

Unified Gas-kinetic Wave-Particle methods I: Continuum and Rarefied Gas Flow

Chang Liu, Yajun Zhu, Kun Xu

The unified gas-kinetic scheme (UGKS) provides a framework for simulating multiscale transport with the updates of both gas distribution function and macroscopic flow variables on the cell size and time step scales. The multiscale dynamics in UGKS is achieved through the coupled particle transport and collision in the particle evolution process within a time step. In this paper, under the UGKS framework, we propose an efficient multiscale unified gas-kinetic wave-particle (UGKWP) method. The gas dynamics in UGKWP method is described by the individual particle movement coupled with the evolution of the probability density function (PDF). During a time step, the trajectories of simulation particles are tracked until collision happens, and the post-collision particles are evolved collectively through the evolution of the corresponding distribution function. The evolution of simulation particles and distribution function is guided by evolution of macroscopic variables. The two descriptions on a gas particle, i.e. wave and particle, switch dynamically with time. A new concept of multiscale multi-efficiency preserving (MMP) method is introduced, and the UGKWP method is shown to be an MMP scheme. The UGKWP method is specially efficient for hypersonic flow simulation in all regimes in comparison with the wave-type discrete ordinate methods, and presents a much lower stochastic noise in the continuum flow regime in comparison with the particle-based Monte Carlo methods. Numerical tests for flows over a wide range of Mach and Knudsen numbers are presented. The examples include mainly the hypersonic flow passing a circular cylinder at Mach numbers $20$ and $30$ and Knudsen numbers $1$ and $10^{-4}$, low speed lid-driven cavity flow, and laminar boundary layer. These results validate the accuracy, efficiency, and multiscale property of UGKWP method.

NAFeb 1, 2016

A Third-order Compact Gas-kinetic Scheme on Unstructured Meshes for Compressible Navier-Stokes Solutions

Liang Pan, Kun Xu

In this paper, for the first time a compact third-order gas-kinetic scheme is proposed on unstructured meshes for the compressible viscous flow computations. The possibility to de sign such a third-order compact scheme is due to the high-order gas evolution model, where a time-dependent gas distribution function at a cell interface not only provides the fluxes across a cell interface, but also the time evolution of the flow variables at the cell interface as well. As a result, both cell averaged and cell interface flow variables can be used for the initial data reconstruction at the beginning of next time step. A weighted least-square reconstruction has been used for the construction of a third-order initial condition. Therefore, a compact third-order gas-kinetic scheme with the involvement of neighboring cells only can be developed on unstructured meshes. In comparison with other conventional high-order schemes, the current method avoids the use of Gaussian points for the flux integration along a cell interface and the multi-stage Runge-Kutta time stepping technique. The third-order compact scheme is numerically stable under CFL condition above 0.5. Due to the multidimensional gas-kinetic formulation and the coupling of inviscid and viscous terms, even with unstructured meshes the boundary layer solution and the vortex structure can be accurately captured in the current scheme. At the same time, the compact scheme can capture strong shocks as well.

NAJan 25, 2018

Two-Stage Fourth-order Gas-kinetic Scheme for Three-dimensional Euler and Navier-Stokes Solutions

Liang Pan, Kun Xu

For the one-stage third-order gas-kinetic scheme (GKS), success applications have been achieved for the three-dimensional compressible flow computations [33]. The high-order accuracy of the scheme is obtained directly by integrating a multidimensional time-accurate gas distribution function over the cell interface within a time step without implementing Gaussian quadrature points and Runge-Kutta time-stepping technique. However, for the further increasing the order of the scheme, such as the fourth-order one, the formulation becomes very complicated for the multidimensional flow. Recently, a two-stage fourth-order GKS with high efficiency has been constructed for two-dimensional inviscid and viscous flow computations [22,32], and the scheme uses the time accurate flux function and its time derivatives. In this paper, a fourth-order GKS is developed for the threedimensional flows under the two-stage framework. Based on the three-dimensional WENO reconstruction and flux evaluation at Gaussian quadrature points on a cell interface, the high-order accuracy in space is achieved first. Then, the two-stage time stepping method provides the high accuracy in time. In comparison with the formal third-order GKS [33], the current fourth-order method not only improves the accuracy of the scheme, but also reduces the complexity of the gas-kinetic solver greatly. More importantly, the fourth-order GKS has the same robustness as the second-order shock capturing scheme. This scheme is applied to both inviscid and viscous, and low and high speed flow computations. Numerical results validate the outstanding reliability and applicability of the scheme for three-dimensional flows, such as turbulent one.

CVMar 31Code

ResAdapt: Adaptive Resolution for Efficient Multimodal Reasoning

Huanxuan Liao, Zhongtao Jiang, Yupu Hao et al.

Multimodal Large Language Models (MLLMs) achieve stronger visual understanding by scaling input fidelity, yet the resulting visual token growth makes jointly sustaining high spatial resolution and long temporal context prohibitive. We argue that the bottleneck lies not in how post-encoding representations are compressed but in the volume of pixels the encoder receives, and address it with ResAdapt, an Input-side adaptation framework that learns how much visual budget each frame should receive before encoding. ResAdapt couples a lightweight Allocator with an unchanged MLLM backbone, so the backbone retains its native visual-token interface while receiving an operator-transformed input. We formulate allocation as a contextual bandit and train the Allocator with Cost-Aware Policy Optimization (CAPO), which converts sparse rollout feedback into a stable accuracy-cost learning signal. Across budget-controlled video QA, temporal grounding, and image reasoning tasks, ResAdapt improves low-budget operating points and often lies on or near the efficiency-accuracy frontier, with the clearest gains on reasoning-intensive benchmarks under aggressive compression. Notably, ResAdapt supports up to 16x more frames at the same visual budget while delivering over 15% performance gain. Code is available at https://github.com/Xnhyacinth/ResAdapt.

CVDec 25, 2022

PaletteNeRF: Palette-based Color Editing for NeRFs

Qiling Wu, Jianchao Tan, Kun Xu

Neural Radiance Field (NeRF) is a powerful tool to faithfully generate novel views for scenes with only sparse captured images. Despite its strong capability for representing 3D scenes and their appearance, its editing ability is very limited. In this paper, we propose a simple but effective extension of vanilla NeRF, named PaletteNeRF, to enable efficient color editing on NeRF-represented scenes. Motivated by recent palette-based image decomposition works, we approximate each pixel color as a sum of palette colors modulated by additive weights. Instead of predicting pixel colors as in vanilla NeRFs, our method predicts additive weights. The underlying NeRF backbone could also be replaced with more recent NeRF models such as KiloNeRF to achieve real-time editing. Experimental results demonstrate that our method achieves efficient, view-consistent, and artifact-free color editing on a wide range of NeRF-represented scenes.

LGMay 2, 2022

FastGCL: Fast Self-Supervised Learning on Graphs via Contrastive Neighborhood Aggregation

Yuansheng Wang, Wangbin Sun, Kun Xu et al.

Graph contrastive learning (GCL), as a popular approach to graph self-supervised learning, has recently achieved a non-negligible effect. To achieve superior performance, the majority of existing GCL methods elaborate on graph data augmentation to construct appropriate contrastive pairs. However, existing methods place more emphasis on the complex graph data augmentation which requires extra time overhead, and pay less attention to developing contrastive schemes specific to encoder characteristics. We argue that a better contrastive scheme should be tailored to the characteristics of graph neural networks (e.g., neighborhood aggregation) and propose a simple yet effective method named FastGCL. Specifically, by constructing weighted-aggregated and non-aggregated neighborhood information as positive and negative samples respectively, FastGCL identifies the potential semantic information of data without disturbing the graph topology and node attributes, resulting in faster training and convergence speeds. Extensive experiments have been conducted on node classification and graph classification tasks, showing that FastGCL has competitive classification performance and significant training speedup compared to existing state-of-the-art methods.

CLApr 11, 2022

Zero-shot Cross-lingual Conversational Semantic Role Labeling

Han Wu, Haochen Tan, Kun Xu et al. · tencent-ai

While conversational semantic role labeling (CSRL) has shown its usefulness on Chinese conversational tasks, it is still under-explored in non-Chinese languages due to the lack of multilingual CSRL annotations for the parser training. To avoid expensive data collection and error-propagation of translation-based methods, we present a simple but effective approach to perform zero-shot cross-lingual CSRL. Our model implicitly learns language-agnostic, conversational structure-aware and semantically rich representations with the hierarchical encoders and elaborately designed pre-training objectives. Experimental results show that our model outperforms all baselines by large margins on two newly collected English CSRL test sets. More importantly, we confirm the usefulness of CSRL to non-Chinese conversational tasks such as the question-in-context rewriting task in English and the multi-turn dialogue response generation tasks in English, German and Japanese by incorporating the CSRL information into the downstream conversation-based models. We believe this finding is significant and will facilitate the research of non-Chinese dialogue tasks which suffer the problems of ellipsis and anaphora.

CLJul 4, 2022

Discourse-Aware Graph Networks for Textual Logical Reasoning

Yinya Huang, Lemao Liu, Kun Xu et al.

Textual logical reasoning, especially question-answering (QA) tasks with logical reasoning, requires awareness of particular logical structures. The passage-level logical relations represent entailment or contradiction between propositional units (e.g., a concluding sentence). However, such structures are unexplored as current QA systems focus on entity-based relations. In this work, we propose logic structural-constraint modeling to solve the logical reasoning QA and introduce discourse-aware graph networks (DAGNs). The networks first construct logic graphs leveraging in-line discourse connectives and generic logic theories, then learn logic representations by end-to-end evolving the logic relations with an edge-reasoning mechanism and updating the graph features. This pipeline is applied to a general encoder, whose fundamental features are joined with the high-level logic features for answer prediction. Experiments on three textual logical reasoning datasets demonstrate the reasonability of the logical structures built in DAGNs and the effectiveness of the learned logic features. Moreover, zero-shot transfer results show the features' generality to unseen logical texts.

NAJan 19, 2016

An Alternative Analysis of Discontinuous Galerkin Method for Hyperbolic Conservation Law

Kun Xu, Chang Liu, Xiaodong Ren

The development and application of the Discontinuous Galerkin (DG) method have attracted great attention in computational fluid dynamics (CFD) com- munity in the past decades. The underlying reason for such an intensive investigation is due to favorable properties of the DG method, such as higher order, compactness, and easy parallelization. However, for the compressible flow simulations, the DG method is also associated with unfavorable prop- erties, such as the frequent instabilities in non-smooth regions. Due to the finite element formulation, except the update of the cell averaged conserva- tive flow variable, it becomes rather difficulty to fully understand the physical mechanism in the DG method for the updates of other degrees of freedom in- side each element. In this short note, based on the linear advection equation we are going to analyze the DG method in an alternative way. The results obtained may be useful for the CFD community for its further development of the DG methods.

CVFeb 5, 2024Code

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

Yang Jin, Zhicheng Sun, Kun Xu et al. · pku

In light of recent advances in multimodal Large Language Models (LLMs), there is increasing attention to scaling them from image-text data to more informative real-world videos. Compared to static images, video poses unique challenges for effective large-scale pre-training due to the modeling of its spatiotemporal dynamics. In this paper, we address such limitations in video-language pre-training with an efficient video decomposition that represents each video as keyframes and temporal motions. These are then adapted to an LLM using well-designed tokenizers that discretize visual and temporal information as a few tokens, thus enabling unified generative pre-training of videos, images, and text. At inference, the generated tokens from the LLM are carefully recovered to the original continuous pixel space to create various video content. Our proposed framework is both capable of comprehending and generating image and video content, as demonstrated by its competitive performance across 13 multimodal benchmarks in image and video understanding and generation. Our code and models are available at https://video-lavit.github.io.

LGMar 12, 2024Code

Harder Tasks Need More Experts: Dynamic Routing in MoE Models

Quzhe Huang, Zhenwei An, Nan Zhuang et al. · pku

In this paper, we introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models, aiming to enhance computational efficiency and model performance by adjusting the number of activated experts based on input difficulty. Unlike traditional MoE approaches that rely on fixed Top-K routing, which activates a predetermined number of experts regardless of the input's complexity, our method dynamically selects experts based on the confidence level in expert selection for each input. This allows for a more efficient utilization of computational resources, activating more experts for complex tasks requiring advanced reasoning and fewer for simpler tasks. Through extensive evaluations, our dynamic routing method demonstrates substantial improvements over conventional Top-2 routing across various benchmarks, achieving an average improvement of 0.7% with less than 90% activated parameters. Further analysis shows our model dispatches more experts to tasks requiring complex reasoning skills, like BBH, confirming its ability to dynamically allocate computational resources in alignment with the input's complexity. Our findings also highlight a variation in the number of experts needed across different layers of the transformer model, offering insights into the potential for designing heterogeneous MoE frameworks. The code and models are available at https://github.com/ZhenweiAn/Dynamic_MoE.

SPDec 27, 2022

Semantic optical fiber communication system

Zhenming Yu, Hongyu Huang, Liming Cheng et al.

The current optical communication systems minimize bit or symbol errors without considering the semantic meaning behind digital bits, thus transmitting a lot of unnecessary information. We propose and experimentally demonstrate a semantic optical fiber communication (SOFC) system. Instead of encoding information into bits for transmission, semantic information is extracted from the source using deep learning. The generated semantic symbols are then directly transmitted through an optical fiber. Compared with the bit-based structure, the SOFC system achieved higher information compression and a more stable performance, especially in the low received optical power regime, and enhanced the robustness against optical link impairments. This work introduces an intelligent optical communication system at the human analytical thinking level, which is a significant step toward a breakthrough in the current optical communication architecture.

CLApr 27, 2022

Distant finetuning with discourse relations for stance classification

Lifeng Jin, Kun Xu, Linfeng Song et al.

Approaches for the stance classification task, an important task for understanding argumentation in debates and detecting fake news, have been relying on models which deal with individual debate topics. In this paper, in order to train a system independent from topics, we propose a new method to extract data with silver labels from raw text to finetune a model for stance classification. The extraction relies on specific discourse relation information, which is shown as a reliable and accurate source for providing stance information. We also propose a 3-stage training framework where the noisy level in the data used for finetuning decreases over different stages going from the most noisy to the least noisy. Detailed experiments show that the automatically annotated dataset as well as the 3-stage training help improve model performance in stance classification. Our approach ranks 1st among 26 competing teams in the stance classification track of the NLPCC 2021 shared task Argumentative Text Understanding for AI Debater, which confirms the effectiveness of our approach.

CLFeb 27, 2024Code

Probing Multimodal Large Language Models for Global and Local Semantic Representations

Mingxu Tao, Quzhe Huang, Kun Xu et al. · pku

The advancement of Multimodal Large Language Models (MLLMs) has greatly accelerated the development of applications in understanding integrated texts and images. Recent works leverage image-caption datasets to train MLLMs, achieving state-of-the-art performance on image-to-text tasks. However, there are few studies exploring which layers of MLLMs make the most effort to the global image information, which plays vital roles in multimodal comprehension and generation. In this study, we find that the intermediate layers of models can encode more global semantic information, whose representation vectors perform better on visual-language entailment tasks, rather than the topmost layers. We further probe models regarding local semantic representations through object recognition tasks. We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information. Our code and data are released via https://github.com/kobayashikanna01/probing_MLLM_rep.

NAMar 25

A High-Order Finite Volume GENO Scheme with Implicit Time Integration for Three-Temperature Radiation Diffusion Equations

Fengxiang Zhao, Yaqing Yang, Yibing Chen et al.

This study presents a high-order finite volume scheme capable of large time-step integration for three-temperature radiation diffusion (3TRD) equations, where conservation is naturally achieved through energy update. To handle local large gradients and discontinuities in temperature, a central generalized ENO (GENO) reconstruction is developed for diffusion systems, which achieves essentially non-oscillatory reconstruction for discontinuous solutions. Compared to conventional nonlinear reconstruction methods, its most distinctive feature is the central-type symmetric sub-stencils, which ensure consistency between the numerics and the isotropic nature of thermal diffusion. Additionally, the central GENO method provides smooth states of temperature and temperature gradient at interfaces, facilitating the evaluation of numerical fluxes. Furthermore, interface flux evaluation for cases with discontinuous physical property parameters is modeled. To address the extremely small time-step issue caused by stiff diffusion and source terms, a dual-time-stepping method based on implicit time discretization is developed for the first time in 3TRD systems, with the advantage of decoupling temporal discretization from complex nonlinear spatial discretization. A series of numerical examples validates the high accuracy, physical property preservation, strong robustness, and large time-step integration capability of the present high-order central GENO scheme.

CVMay 23, 2024Code

RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance

Zhicheng Sun, Zhenhao Yang, Yang Jin et al.

Customizing diffusion models to generate identity-preserving images from user-provided reference images is an intriguing new problem. The prevalent approaches typically require training on extensive domain-specific images to achieve identity preservation, which lacks flexibility across different use cases. To address this issue, we exploit classifier guidance, a training-free technique that steers diffusion models using an existing classifier, for personalized image generation. Our study shows that based on a recent rectified flow framework, the major limitation of vanilla classifier guidance in requiring a special classifier can be resolved with a simple fixed-point solution, allowing flexible personalization with off-the-shelf image discriminators. Moreover, its solving procedure proves to be stable when anchored to a reference flow trajectory, with a convergence guarantee. The derived method is implemented on rectified flow with different off-the-shelf image discriminators, delivering advantageous personalization results for human faces, live subjects, and certain objects. Code is available at https://github.com/feifeiobama/RectifID.

CLOct 8, 2023

Exploring the Usage of Chinese Pinyin in Pretraining

Baojun Wang, Kun Xu, Lifeng Shang

Unlike alphabetic languages, Chinese spelling and pronunciation are different. Both characters and pinyin take an important role in Chinese language understanding. In Chinese NLP tasks, we almost adopt characters or words as model input, and few works study how to use pinyin. However, pinyin is essential in many scenarios, such as error correction and fault tolerance for ASR-introduced errors. Most of these errors are caused by the same or similar pronunciation words, and we refer to this type of error as SSP(the same or similar pronunciation) errors for short. In this work, we explore various ways of using pinyin in pretraining models and propose a new pretraining method called PmBERT. Our method uses characters and pinyin in parallel for pretraining. Through delicate pretraining tasks, the characters and pinyin representation are fused, which can enhance the error tolerance for SSP errors. We do comprehensive experiments and ablation tests to explore what makes a robust phonetic enhanced Chinese language model. The experimental results on both the constructed noise-added dataset and the public error-correction dataset demonstrate that our model is more robust compared to SOTA models.

CVDec 24, 2025

Learning from Next-Frame Prediction: Autoregressive Video Modeling Encodes Effective Representations

Jinghan Li, Yang Jin, Hao Jiang et al.

Recent advances in pretraining general foundation models have significantly improved performance across diverse downstream tasks. While autoregressive (AR) generative models like GPT have revolutionized NLP, most visual generative pretraining methods still rely on BERT-style masked modeling, which often disregards the temporal information essential for video analysis. The few existing autoregressive visual pretraining methods suffer from issues such as inaccurate semantic localization and poor generation quality, leading to poor semantics. In this work, we propose NExT-Vid, a novel autoregressive visual generative pretraining framework that utilizes masked next-frame prediction to jointly model images and videos. NExT-Vid introduces a context-isolated autoregressive predictor to decouple semantic representation from target decoding, and a conditioned flow-matching decoder to enhance generation quality and diversity. Through context-isolated flow-matching pretraining, our approach achieves strong representations. Extensive experiments on large-scale pretrained models demonstrate that our proposed method consistently outperforms previous generative pretraining methods for visual representation learning via attentive probing in downstream classification.

CLMay 12

BitLM: Unlocking Multi-Token Language Generation with Bitwise Continuous Diffusion

Shaobin Zhuang, Yuang Ai, Jiaming Han et al.

Autoregressive language models generate text one token at a time, yet natural language is inherently structured in multi-token units, including phrases, n-grams, and collocations that carry meaning jointly. This one-token bottleneck limits both the expressiveness of the model during pre-training and its throughput at inference time. Existing remedies such as speculative decoding or diffusion-based language models either leave the underlying bottleneck intact or sacrifice the causal structure essential to language modeling. We propose BitLM, a language model that represents each token as a fixed-length binary code and employs a lightweight diffusion head to denoise multiple tokens in parallel within each block. Crucially, BitLM preserves left-to-right causal attention across blocks while making joint lexical decisions within each block, combining the reliability of autoregressive modeling with the parallelism of iterative refinement. By replacing the large-vocabulary softmax with bitwise denoising, BitLM reframes token generation as iterative commitment in a compact binary space, enabling more efficient pre-training and substantially faster inference without altering the causal foundation that makes language models effective. Our results demonstrate that the one-token-at-a-time paradigm is not a fundamental requirement but an interface choice, and that changing it can yield a stronger and faster language model. We hope BitLM points toward a promising direction for next-generation language model architectures.

IVAug 28, 2021Code

Self-supervised Neural Networks for Spectral Snapshot Compressive Imaging

Ziyi Meng, Zhenming Yu, Kun Xu et al.

We consider using {\bf\em untrained neural networks} to solve the reconstruction problem of snapshot compressive imaging (SCI), which uses a two-dimensional (2D) detector to capture a high-dimensional (usually 3D) data-cube in a compressed manner. Various SCI systems have been built in recent years to capture data such as high-speed videos, hyperspectral images, and the state-of-the-art reconstruction is obtained by the deep neural networks. However, most of these networks are trained in an end-to-end manner by a large amount of corpus with sometimes simulated ground truth, measurement pairs. In this paper, inspired by the untrained neural networks such as deep image priors (DIP) and deep decoders, we develop a framework by integrating DIP into the plug-and-play regime, leading to a self-supervised network for spectral SCI reconstruction. Extensive synthetic and real data results show that the proposed algorithm without training is capable of achieving competitive results to the training based networks. Furthermore, by integrating the proposed method with a pre-trained deep denoising prior, we have achieved state-of-the-art results. {Our code is available at \url{https://github.com/mengziyi64/CASSI-Self-Supervised}.}

AIFeb 16, 2021Code

GraphGallery: A Platform for Fast Benchmarking and Easy Development of Graph Neural Networks Based Intelligent Software

Jintang Li, Kun Xu, Liang Chen et al.

Graph Neural Networks (GNNs) have recently shown to be powerful tools for representing and analyzing graph data. So far GNNs is becoming an increasingly critical role in software engineering including program analysis, type inference, and code representation. In this paper, we introduce GraphGallery, a platform for fast benchmarking and easy development of GNNs based software. GraphGallery is an easy-to-use platform that allows developers to automatically deploy GNNs even with less domain-specific knowledge. It offers a set of implementations of common GNN models based on mainstream deep learning frameworks. In addition, existing GNNs toolboxes such as PyG and DGL can be easily incorporated into the platform. Experiments demonstrate the reliability of implementations and superiority in fast coding. The official source code of GraphGallery is available at https://github.com/EdisonLeeeee/GraphGallery and a demo video can be found at https://youtu.be/mv7Zs1YeaYo.

CLFeb 12, 2021Code

Structural Information Preserving for Graph-to-Text Generation

Linfeng Song, Ante Wang, Jinsong Su et al.

The task of graph-to-text generation aims at producing sentences that preserve the meaning of input graphs. As a crucial defect, the current state-of-the-art models may mess up or even drop the core structural information of input graphs when generating outputs. We propose to tackle this problem by leveraging richer training signals that can guide our model for preserving input information. In particular, we introduce two types of autoencoding losses, each individually focusing on different aspects (a.k.a. views) of input graphs. The losses are then back-propagated to better calibrate our model via multi-task training. Experiments on two benchmarks for graph-to-text generation show the effectiveness of our approach over a state-of-the-art baseline. Our code is available at \url{http://github.com/Soistesimmer/AMR-multiview}.

LGMar 10, 2020Code

A Survey of Adversarial Learning on Graphs

Liang Chen, Jintang Li, Jiaying Peng et al.

Deep learning models on graphs have achieved remarkable performance in various graph analysis tasks, e.g., node classification, link prediction, and graph clustering. However, they expose uncertainty and unreliability against the well-designed inputs, i.e., adversarial examples. Accordingly, a line of studies has emerged for both attack and defense addressed in different graph analysis tasks, leading to the arms race in graph adversarial learning. Despite the booming works, there still lacks a unified problem definition and a comprehensive review. To bridge this gap, we investigate and summarize the existing works on graph adversarial learning tasks systemically. Specifically, we survey and unify the existing works w.r.t. attack and defense in graph analysis tasks, and give appropriate definitions and taxonomies at the same time. Besides, we emphasize the importance of related evaluation metrics, investigate and summarize them comprehensively. Hopefully, our works can provide a comprehensive overview and offer insights for the relevant researchers. Latest advances in graph adversarial learning are summarized in our GitHub repository https://github.com/EdisonLeeeee/Graph-Adversarial-Learning.

CLFeb 24

CAMEL: Confidence-Gated Reflection for Reward Modeling

Zirui Zhu, Hailun Xu, Yang Luo et al.

Reward models play a fundamental role in aligning large language models with human preferences. Existing methods predominantly follow two paradigms: scalar discriminative preference models, which are efficient but lack interpretability, and generative judging models, which offer richer reasoning at the cost of higher computational overhead. We observe that the log-probability margin between verdict tokens strongly correlates with prediction correctness, providing a reliable proxy for instance difficulty without additional inference cost. Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances. To induce effective self-correction, we train the model via reinforcement learning with counterfactual prefix augmentation, which exposes the model to diverse initial verdicts and encourages genuine revision. Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters, while establishing a strictly better accuracy-efficiency Pareto frontier.

CLFeb 17, 2025

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment

Jingcheng Deng, Zhongtao Jiang, Liang Pang et al.

A new trend uses LLMs as dense text encoders via contrastive learning. However, since LLM embeddings predict the probability distribution of the next token, they are inherently generative and distributive, conflicting with contrastive learning, which requires embeddings to capture full-text semantics and align via cosine similarity. This discrepancy hinders the full utilization of LLMs' pre-training capabilities, resulting in inefficient learning. In response to this issue, we propose AutoRegEmbed, a new contrastive learning method built on embedding conditional probability distributions, which integrates two core tasks: information compression and conditional distribution alignment. The information compression task encodes text into the embedding space, ensuring that the embedding vectors capture global semantics. The conditional distribution alignment task focuses on aligning text embeddings with positive samples embeddings by leveraging the conditional distribution of embeddings while simultaneously reducing the likelihood of generating negative samples from text embeddings, thereby achieving embedding alignment and uniformity. Experimental results demonstrate that our method significantly outperforms traditional contrastive learning approaches and achieves performance comparable to state-of-the-art models when using the same amount of data.

CLOct 17, 2025

Latent Reasoning in LLMs as a Vocabulary-Space Superposition

Jingcheng Deng, Liang Pang, Zihao Wei et al.

Large language models (LLMs) demonstrate strong reasoning abilities with chain-of-thought prompting, but explicit reasoning introduces substantial computational overhead. Recent work on latent reasoning reduces this cost by reasoning in latent space without explicit supervision, but performance drops significantly. Our preliminary experiments suggest that this degradation stems from the unstructured latent space, which makes fitting latent tokens difficult. To address this, we restrict the latent space to the column space of the LLM vocabulary, treating latent reasoning as a superposition over vocabulary probabilities. Once latent reasoning concludes, it collapses into an eigenstate of explicit reasoning to yield the final answer. Based on this idea, we propose Latent-SFT, a two-stage learning framework. In the first stage, we design two specialized attention masks to guide the Latent Token Encoder in generating latent tokens, allowing the LLM to produce the correct answer conditioned on them. In the second stage, the Latent Token Encoder is discarded, and the LLM is directly trained to generate these latent tokens autonomously for latent reasoning, optimized with KL and CE losses. Latent-SFT sets a new state of the art on GSM8k, matching explicit SFT performance while cutting reasoning chains by up to 4 times and outperforming prior latent methods. On Math500 and AIME24, lexical probability-based latent reasoning also clearly surpasses hidden-state-based approaches. Our metrics of effective compression rate and effective global parallelism further show that latent reasoning is both the compression of a single path and the superposition of multiple paths.

CLFeb 17, 2025

On the Diminishing Returns of Complex Robust RAG Training in the Era of Powerful LLMs

Hanxing Ding, Shuchang Tao, Liang Pang et al.

Retrieval-augmented generation (RAG) systems traditionally employ sophisticated training strategies to enhance robustness against retrieval noise. In this work, we investigate a critical question: does the benefit of these complex robust training methods diminish as language models become more powerful? Through systematic evaluation across multiple model scales and question-answering datasets, our analysis reveals a consistent trend: \emph{the marginal robustness benefit of sophisticated training strategies decreases substantially as model capacity increases.} While smaller models show significant performance improvements from complex document selection and adversarial objectives, more capable models achieve comparable or even superior performance with simpler training approaches. Further investigation demonstrates that stronger models naturally exhibit better confidence calibration, cross-dataset generalization capability, and more effective attention patterns, even under simple training regimes. These findings suggest that as foundation models evolve, the engineering effort invested in complex robust training may yield diminishing returns, indicating that simplified RAG pipelines could suffice for powerful models while maintaining competitive performance.

CRFeb 13, 2025

Detecting Malicious Concepts Without Image Generation in AIGC

Kun Xu, Yushu Zhang, Shuren Qi et al.

The task of text-to-image generation has achieved tremendous success in practice, with emerging concept generation models capable of producing highly personalized and customized content. Fervor for concept generation is increasing rapidly among users, and platforms for concept sharing have sprung up. The concept owners may upload malicious concepts and disguise them with non-malicious text descriptions and example images to deceive users into downloading and generating malicious content. The platform needs a quick method to determine whether a concept is malicious to prevent the spread of malicious concepts. However, simply relying on concept image generation to judge whether a concept is malicious requires time and computational resources. Especially, as the number of concepts uploaded and downloaded on the platform continues to increase, this approach becomes impractical and poses a risk of generating malicious content. In this paper, we propose Concept QuickLook, the first systematic work to incorporate malicious concept detection into research, which performs detection based solely on concept files without generating any images. We define malicious concepts and design two work modes for detection: concept matching and fuzzy detection. Extensive experiments demonstrate that the proposed Concept QuickLook can detect malicious concepts and demonstrate practicality in concept sharing platforms. We also design robustness experiments to further validate the effectiveness of the solution. We hope this work can initiate malicious concept detection tasks and provide some inspiration.

CLFeb 2

WideSeek: Advancing Wide Research via Multi-Agent Scaling

Ziyang Huang, Haolin Ren, Xiaowei Yuan et al.

Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.

CVFeb 15

UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Shaobin Zhuang, Yuang Ai, Jiaming Han et al.

Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook ($\mathit{2^{128}}$). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.

CVJun 27, 2025

Generating Attribute-Aware Human Motions from Textual Prompt

Xinghan Wang, Kun Xu, Fei Li et al.

Text-driven human motion generation has recently attracted considerable attention, allowing models to generate human motions based on textual descriptions. However, current methods neglect the influence of human attributes-such as age, gender, weight, and height-which are key factors shaping human motion patterns. This work represents a pilot exploration for bridging this gap. We conceptualize each motion as comprising both attribute information and action semantics, where textual descriptions align exclusively with action semantics. To achieve this, a new framework inspired by Structural Causal Models is proposed to decouple action semantics from human attributes, enabling text-to-semantics prediction and attribute-controlled generation. The resulting model is capable of generating attribute-aware motion aligned with the user's text and attribute inputs. For evaluation, we introduce a comprehensive dataset containing attribute annotations for text-motion pairs, setting the first benchmark for attribute-aware motion generation. Extensive experiments validate our model's effectiveness.

CVMay 20, 2025

InstanceBEV: Unifying Instance and BEV Representation for 3D Panoptic Segmentation

Feng Li, Zhaoyue Wang, Enyuan Zhang et al.

BEV-based 3D perception has emerged as a focal point of research in end-to-end autonomous driving. However, existing BEV approaches encounter significant challenges due to the large feature space, complicating efficient modeling and hindering effective integration of global attention mechanisms. We propose a novel modeling strategy, called InstanceBEV, that synergistically combines the strengths of both map-centric approaches and object-centric approaches. Our method effectively extracts instance-level features within the BEV features, facilitating the implementation of global attention modeling in a highly compressed feature space, thereby addressing the efficiency challenges inherent in map-centric global modeling. Furthermore, our approach enables effective multi-task learning without introducing additional module. We validate the efficiency and accuracy of the proposed model through predicting occupancy, achieving 3D occupancy panoptic segmentation by combining instance information. Experimental results on the OCC3D-nuScenes dataset demonstrate that InstanceBEV, utilizing only 8 frames, achieves a RayPQ of 15.3 and a RayIoU of 38.2. This surpasses SparseOcc's RayPQ by 9.3% and RayIoU by 10.7%, showcasing the effectiveness of multi-task synergy.

LGMay 19, 2025

Efficient training for large-scale optical neural network using an evolutionary strategy and attention pruning

Zhiwei Yang, Zeyang Fan, Yihang Lai et al.

MZI-based block optical neural networks (BONNs), which can achieve large-scale network models, have increasingly drawn attentions. However, the robustness of the current training algorithm is not high enough. Moreover, large-scale BONNs usually contain numerous trainable parameters, resulting in expensive computation and power consumption. In this article, by pruning matrix blocks and directly optimizing the individuals in population, we propose an on-chip covariance matrix adaptation evolution strategy and attention-based pruning (CAP) algorithm for large-scale BONNs. The calculated results demonstrate that the CAP algorithm can prune 60% and 80% of the parameters for MNIST and Fashion-MNIST datasets, respectively, while only degrades the performance by 3.289% and 4.693%. Considering the influence of dynamic noise in phase shifters, our proposed CAP algorithm (performance degradation of 22.327% for MNIST dataset and 24.019% for Fashion-MNIST dataset utilizing a poor fabricated chip and electrical control with a standard deviation of 0.5) exhibits strongest robustness compared with both our previously reported block adjoint training algorithm (43.963% and 41.074%) and the covariance matrix adaptation evolution strategy (25.757% and 32.871%), respectively. Moreover, when 60% of the parameters are pruned, the CAP algorithm realizes 88.5% accuracy in experiment for the simplified MNIST dataset, which is similar to the simulation result without noise (92.1%). Additionally, we simulationally and experimentally demonstrate that using MZIs with only internal phase shifters to construct BONNs is an efficient way to reduce both the system area and the required trainable parameters. Notably, our proposed CAP algorithm show excellent potential for larger-scale network models and more complex tasks.

IVOct 25, 2024

Integration of Communication and Computational Imaging

Zhenming Yu, Liming Cheng, Hongyu Huang et al.

Communication enables the expansion of human visual perception beyond the limitations of time and distance, while computational imaging overcomes the constraints of depth and breadth. Although impressive achievements have been witnessed with the two types of technologies, the occlusive information flow between the two domains is a bottleneck hindering their ulterior progression. Herein, we propose a novel framework that integrates communication and computational imaging (ICCI) to break through the inherent isolation between communication and computational imaging for remote perception. By jointly considering the sensing and transmitting of remote visual information, the ICCI framework performs a full-link information transfer optimization, aiming to minimize information loss from the generation of the information source to the execution of the final vision tasks. We conduct numerical analysis and experiments to demonstrate the ICCI framework by integrating communication systems and snapshot compressive imaging systems. Compared with straightforward combination schemes, which sequentially execute sensing and transmitting, the ICCI scheme shows greater robustness against channel noise and impairments while achieving higher data compression. Moreover, an 80 km 27-band hyperspectral video perception with a rate of 30 fps is experimentally achieved. This new ICCI remote perception paradigm offers a highefficiency solution for various real-time computer vision tasks.

CLSep 23, 2021

CSAGN: Conversational Structure Aware Graph Network for Conversational Semantic Role Labeling

Han Wu, Kun Xu, Linqi Song

Conversational semantic role labeling (CSRL) is believed to be a crucial step towards dialogue understanding. However, it remains a major challenge for existing CSRL parser to handle conversational structural information. In this paper, we present a simple and effective architecture for CSRL which aims to address this problem. Our model is based on a conversational structure-aware graph network which explicitly encodes the speaker dependent information. We also propose a multi-task learning method to further improve the model. Experimental results on benchmark datasets show that our model with our proposed training objectives significantly outperforms previous baselines.

CLSep 18, 2021

DyLex: Incorporating Dynamic Lexicons into BERT for Sequence Labeling

Baojun Wang, Zhao Zhang, Kun Xu et al.

Incorporating lexical knowledge into deep learning models has been proved to be very effective for sequence labeling tasks. However, previous works commonly have difficulty dealing with large-scale dynamic lexicons which often cause excessive matching noise and problems of frequent updates. In this paper, we propose DyLex, a plug-in lexicon incorporation approach for BERT based sequence labeling tasks. Instead of leveraging embeddings of words in the lexicon as in conventional methods, we adopt word-agnostic tag embeddings to avoid re-training the representation while updating the lexicon. Moreover, we employ an effective supervised lexical knowledge denoising method to smooth out matching noise. Finally, we introduce a col-wise attention based knowledge fusion mechanism to guarantee the pluggability of the proposed framework. Experiments on ten datasets of three tasks show that the proposed framework achieves new SOTA, even with very large scale lexicons.

CLSep 10, 2021

Exophoric Pronoun Resolution in Dialogues with Topic Regularization

Xintong Yu, Hongming Zhang, Yangqiu Song et al.

Resolving pronouns to their referents has long been studied as a fundamental natural language understanding problem. Previous works on pronoun coreference resolution (PCR) mostly focus on resolving pronouns to mentions in text while ignoring the exophoric scenario. Exophoric pronouns are common in daily communications, where speakers may directly use pronouns to refer to some objects present in the environment without introducing the objects first. Although such objects are not mentioned in the dialogue text, they can often be disambiguated by the general topics of the dialogue. Motivated by this, we propose to jointly leverage the local context and global topics of dialogues to solve the out-of-text PCR problem. Extensive experiments demonstrate the effectiveness of adding topic regularization for resolving exophoric pronouns.

CLMay 28, 2021

Domain-Adaptive Pretraining Methods for Dialogue Understanding

Han Wu, Kun Xu, Linfeng Song et al.

Language models like BERT and SpanBERT pretrained on open-domain data have obtained impressive gains on various NLP tasks. In this paper, we probe the effectiveness of domain-adaptive pretraining objectives on downstream tasks. In particular, three objectives, including a novel objective focusing on modeling predicate-argument relations, are evaluated on two challenging dialogue understanding tasks. Experimental results demonstrate that domain-adaptive pretraining with proper objectives can significantly improve the performance of a strong baseline on these tasks, achieving the new state-of-the-art performances.

LGMay 10, 2021

Rethinking and Reweighting the Univariate Losses for Multi-Label Ranking: Consistency and Generalization

Guoqiang Wu, Chongxuan Li, Kun Xu et al.

(Partial) ranking loss is a commonly used evaluation measure for multi-label classification, which is usually optimized with convex surrogates for computational efficiency. Prior theoretical work on multi-label ranking mainly focuses on (Fisher) consistency analyses. However, there is a gap between existing theory and practice -- some pairwise losses can lead to promising performance but lack consistency, while some univariate losses are consistent but usually have no clear superiority in practice. In this paper, we attempt to fill this gap through a systematic study from two complementary perspectives of consistency and generalization error bounds of learning algorithms. Our results show that learning algorithms with the consistent univariate loss have an error bound of $O(c)$ ($c$ is the number of labels), while algorithms with the inconsistent pairwise loss depend on $O(\sqrt{c})$ as shown in prior work. This explains that the latter can achieve better performance than the former in practice. Moreover, we present an inconsistent reweighted univariate loss-based learning algorithm that enjoys an error bound of $O(\sqrt{c})$ for promising performance as well as the computational efficiency of univariate losses. Finally, experimental results validate our theoretical analyses.

CLApr 11, 2021

Conversational Semantic Role Labeling

Kun Xu, Han Wu, Linfeng Song et al.

Semantic role labeling (SRL) aims to extract the arguments for each predicate in an input sentence. Traditional SRL can fail to analyze dialogues because it only works on every single sentence, while ellipsis and anaphora frequently occur in dialogues. To address this problem, we propose the conversational SRL task, where an argument can be the dialogue participants, a phrase in the dialogue history or the current sentence. As the existing SRL datasets are in the sentence level, we manually annotate semantic roles for 3,000 chit-chat dialogues (27,198 sentences) to boost the research in this direction. Experiments show that while traditional SRL systems (even with the help of coreference resolution or rewriting) perform poorly for analyzing dialogues, modeling dialogue histories and participants greatly helps the performance, indicating that adapting SRL to conversations is very promising for universal dialogue understanding. Our initial study by applying CSRL to two mainstream conversational tasks, dialogue response generation and dialogue context rewriting, also confirms the usefulness of CSRL.

CVApr 9, 2021

Video-aided Unsupervised Grammar Induction

Songyang Zhang, Linfeng Song, Lifeng Jin et al.

We investigate video-aided grammar induction, which learns a constituency parser from both unlabeled text and its corresponding video. Existing methods of multi-modal grammar induction focus on learning syntactic grammars from text-image pairs, with promising results showing that the information from static images is useful in induction. However, videos provide even richer information, including not only static objects but also actions and state changes useful for inducing verb phrases. In this paper, we explore rich features (e.g. action, object, scene, audio, face, OCR and speech) from videos, taking the recent Compound PCFG model as the baseline. We further propose a Multi-Modal Compound PCFG model (MMC-PCFG) to effectively aggregate these rich features from different modalities. Our proposed MMC-PCFG is trained end-to-end and outperforms each individual modality and previous state-of-the-art systems on three benchmarks, i.e. DiDeMo, YouCook2 and MSRVTT, confirming the effectiveness of leveraging video information for unsupervised grammar induction.

CLJan 27, 2021

Joint Coreference Resolution and Character Linking for Multiparty Conversation

Jiaxin Bai, Hongming Zhang, Yangqiu Song et al.

Character linking, the task of linking mentioned people in conversations to the real world, is crucial for understanding the conversations. For the efficiency of communication, humans often choose to use pronouns (e.g., "she") or normal phrases (e.g., "that girl") rather than named entities (e.g., "Rachel") in the spoken language, which makes linking those mentions to real people a much more challenging than a regular entity linking task. To address this challenge, we propose to incorporate the richer context from the coreference relations among different mentions to help the linking. On the other hand, considering that finding coreference clusters itself is not a trivial task and could benefit from the global character information, we propose to jointly solve these two tasks. Specifically, we propose C$^2$, the joint learning model of Coreference resolution and Character linking. The experimental results demonstrate that C$^2$ can significantly outperform previous works on both tasks. Further analyses are conducted to analyze the contribution of all modules in the proposed model and the effect of all hyper-parameters.

CLDec 31, 2020

TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis

Haisong Zhang, Lemao Liu, Haiyun Jiang et al.

This technique report introduces TexSmart, a text understanding system that supports fine-grained named entity recognition (NER) and enhanced semantic analysis functionalities. Compared to most previous publicly available text understanding systems and tools, TexSmart holds some unique features. First, the NER function of TexSmart supports over 1,000 entity types, while most other public tools typically support several to (at most) dozens of entity types. Second, TexSmart introduces new semantic analysis functions like semantic expansion and deep semantic representation, that are absent in most previous systems. Third, a spectrum of algorithms (from very fast algorithms to those that are relatively slow but more accurate) are implemented for one function in TexSmart, to fulfill the requirements of different academic and industrial applications. The adoption of unsupervised or weakly-supervised algorithms is especially emphasized, with the goal of easily updating our models to include fresh data with less human annotation efforts. The main contents of this report include major functions of TexSmart, algorithms for achieving these functions, how to use the TexSmart toolkit and Web APIs, and evaluation results of some key algorithms.