h-index9
43papers
1,081citations
Novelty52%
AI Score60

43 Papers

MLDec 8, 2022Code
CausalEGM: a general causal inference framework by encoding generative modeling

Qiao Liu, Zhongren Chen, Wing Hung Wong

Although understanding and characterizing causal effects have become essential in observational studies, it is challenging when the confounders are high-dimensional. In this article, we develop a general framework $\textit{CausalEGM}$ for estimating causal effects by encoding generative modeling, which can be applied in both binary and continuous treatment settings. Under the potential outcome framework with unconfoundedness, we establish a bidirectional transformation between the high-dimensional confounders space and a low-dimensional latent space where the density is known (e.g., multivariate normal distribution). Through this, CausalEGM simultaneously decouples the dependencies of confounders on both treatment and outcome and maps the confounders to the low-dimensional latent space. By conditioning on the low-dimensional latent features, CausalEGM can estimate the causal effect for each individual or the average causal effect within a population. Our theoretical analysis shows that the excess risk for CausalEGM can be bounded through empirical process theory. Under an assumption on encoder-decoder networks, the consistency of the estimate can be guaranteed. In a series of experiments, CausalEGM demonstrates superior performance over existing methods for both binary and continuous treatments. Specifically, we find CausalEGM to be substantially more powerful than competing methods in the presence of large sample sizes and high dimensional confounders. The software of CausalEGM is freely available at https://github.com/SUwonglab/CausalEGM.

CLAug 22, 2023Code
Aspect-oriented Opinion Alignment Network for Aspect-Based Sentiment Classification

Xueyi Liu, Rui Hou, Yanglei Gan et al.

Aspect-based sentiment classification is a crucial problem in fine-grained sentiment analysis, which aims to predict the sentiment polarity of the given aspect according to its context. Previous works have made remarkable progress in leveraging attention mechanism to extract opinion words for different aspects. However, a persistent challenge is the effective management of semantic mismatches, which stem from attention mechanisms that fall short in adequately aligning opinions words with their corresponding aspect in multi-aspect sentences. To address this issue, we propose a novel Aspect-oriented Opinion Alignment Network (AOAN) to capture the contextual association between opinion words and the corresponding aspect. Specifically, we first introduce a neighboring span enhanced module which highlights various compositions of neighboring words and given aspects. In addition, we design a multi-perspective attention mechanism that align relevant opinion information with respect to the given aspect. Extensive experiments on three benchmark datasets demonstrate that our model achieves state-of-the-art results. The source code is available at https://github.com/AONE-NLP/ABSA-AOAN.

LGMay 31
GLIDE: Graph-guided Leap Inference for Diffusion Estimation of Spatio-Temporal Point Processes

Guanyu Zhou, Yao Liu, Yanglei Gan et al.

Spatio-temporal point processes (STPPs) provide a principled framework for modeling asynchronous events in continuous time and space. Recent diffusion-based approaches offer a flexible alternative to deterministic prediction by modeling complex conditional distributions, but their application to STPPs remains challenging: reverse sampling from pure noise is costly, and weak structural constraints in sparse spatial domains can lead to poorly localized probability mass. We propose \textbf{GLIDE} (Graph-guided Leap Inference for Diffusion Estimation), a conditional diffusion framework for next-event modeling in STPPs. GLIDE organizes historical events into a multi-scale historical graph and encodes temporal evolution and spatial topology through a dual-stream architecture, yielding a structured conditioning context for a dual-branch diffusion denoiser. It further introduces a prior-guided leap inference mechanism, in which a lightweight mean predictor provides a deterministic anchor and the reverse process starts from an intermediate diffusion step instead of from pure Gaussian noise. Experiments on multiple real-world datasets show that GLIDE improves both distribution fitting and next-event prediction, with the largest gains appearing on the spatial side. The results also indicate that prior-guided leap inference substantially reduces reverse-sampling cost while preserving the stochastic generation capability of diffusion models.

LGMay 31
FAiT: Frequency-Aware Inverted Transformer for Multivariate Time Series Forecasting

Peng He, Yao Liu, Yanglei Gan et al.

While Transformer-based architectures have established themselves as a dominant paradigm in Multivariate Time Series Forecasting (MTSF), their core self-attention mechanism inherently functions as a low-pass filter, systematically smoothing out high-frequency signals vital for sharp local changes. Recent advancements have increasingly incorporated frequency-domain operations to address this bias, however, most existing designs rely on fixed spectral bases and apply sequence-wise (uniform) modulation, implicitly assuming a time-invariant frequency response. This overlooks a key property of real-world series that their spectral characteristics often evolve over time, making uniform modulation insufficient for capturing fine-grained temporal dynamics. To tackle these limitations, we propose FAiT, a Frequency-Aware inverted Transformer. Specifically, FAiT rectifies the spectral bias internally through Inverted Attention, which interprets the attention map as a learnable low-pass operator and constructs a dedicated complementary high-pass branch by inverting the attention matrix to recover attenuated transient signals. Furthermore, FAiT introduces Dynamic Temporal-Frequency Modulation (DTFM), which synthesizes instance-conditioned weights to adaptively re-calibrate the energy of spectral sub-bands, enabling fine-grained control over evolving multi-scale patterns. Extensive experiments on widely used benchmarks demonstrate that FAiT consistently outperforms state-of-the-art Transformer-based and frequency-enhanced baselines, while maintaining computational efficiency.

CVSep 23, 2022
Pose-Aided Video-based Person Re-Identification via Recurrent Graph Convolutional Network

Honghu Pan, Qiao Liu, Yongyong Chen et al.

Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body proportions, we propose to learn the discriminative pose feature beyond the appearance feature for video retrieval. Specifically, we implement a two-branch architecture to separately learn the appearance feature and pose feature, and then concatenate them together for inference. To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector, and construct a temporal graph using the pose sequence. We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph, which devises a global information propagation mechanism to simultaneously achieve the neighborhood aggregation of intra-frame nodes and message passing among inter-frame graphs. Finally, we propose a dual-attention method consisting of node-attention and time-attention to obtain the temporal graph representation from the node embeddings, where the self-attention mechanism is employed to learn the importance of each node and each frame. We verify the proposed method on three video-based ReID datasets, i.e., Mars, DukeMTMC and iLIDS-VID, whose experimental results demonstrate that the learned pose feature can effectively improve the performance of existing appearance models.

LGMar 12Code
Bridging Discrete Marks and Continuous Dynamics: Dual-Path Cross-Interaction for Marked Temporal Point Processes

Yuxiang Liu, Qiao Liu, Tong Luo et al.

Predicting irregularly spaced event sequences with discrete marks poses significant challenges due to the complex, asynchronous dependencies embedded within continuous-time data streams.Existing sequential approaches capture dependencies among event tokens but ignore the continuous evolution between events, while Neural Ordinary Differential Equation (Neural ODE) methods model smooth dynamics yet fail to account for how event types influence future timing.To overcome these limitations, we propose NEXTPP, a dual-channel framework that unifies discrete and continuous representations via Event-granular Neural Evolution with Cross-Interaction for Marked Temporal Point Processes. Specifically, NEXTPP encodes discrete event marks via a self-attention mechanism, simultaneously evolving a latent continuous-time state using a Neural ODE. These parallel streams are then fused through a crossattention module to enable explicit bidirectional interaction between continuous and discrete representations. The fused representations drive the conditional intensity function of the neural Hawkes process, while an iterative thinning sampler is employed to generate future events. Extensive evaluations on five real-world datasets demonstrate that NEXTPP consistently outperforms state-of-the-art models. The source code can be found at https://github.com/AONE-NLP/NEXTPP.

CVMar 14Code
GenLie: A Global-Enhanced Lie Detection Network under Sparsity and Semantic Interference

Zongshun Zhang, Yao Liu, Qiao Liu et al.

Video-based lie detection aims to identify deceptive behaviors from visual cues. Despite recent progress, its core challenge lies in learning sparse yet discriminative representations. Deceptive signals are typically subtle and short-lived, easily overwhelmed by redundant information, while individual and contextual variations introduce strong identity-related noise. To address this issue, we propose GenLie, a Global-Enhanced Lie Detection Network that performs local feature modeling under global supervision. Specifically, sparse and subtle deceptive cues are captured at the local level, while global supervision and optimization ensure robust and discriminative representations by suppressing identity-related noise. Experiments on three public datasets, covering both high- and low-stakes scenarios, show that GenLie consistently outperforms state-of-the-art methods. Source code is available at https://github.com/AliasDictusZ1/GenLie.

MLJan 8Code
A Bayesian Generative Modeling Approach for Arbitrary Conditional Inference

Qiao Liu, Wing Hung Wong

Modern data analysis increasingly requires flexible conditional inference P(X_B | X_A) where (X_A, X_B) is an arbitrary partition of observed variable X. Existing conditional inference methods lack this flexibility as they are tied to a fixed conditioning structure and cannot perform new conditional inference once trained. To solve this, we propose a Bayesian generative modeling (BGM) approach for arbitrary conditional inference without retraining. BGM learns a generative model of X through an iterative Bayesian updating algorithm where model parameters and latent variables are updated until convergence. Once trained, any conditional distribution can be obtained without retraining. Empirically, BGM achieves superior prediction performance with well calibrated predictive intervals, demonstrating that a single learned model can serve as a universal engine for conditional prediction with uncertainty quantification. We provide theoretical guarantees for the convergence of the stochastic iterative algorithm, statistical consistency and conditional-risk bounds. The proposed BGM framework leverages the power of AI to capture complex relationships among variables while adhering to Bayesian principles, emerging as a promising framework for advancing various applications in modern data science. The code for BGM is freely available at https://github.com/liuq-lab/bayesgm.

MLMay 7Code
BGM-IV: an AI-powered Bayesian generative modeling approach for instrumental variable analysis

Guyue Luo, Qiao Liu

Instrumental-variable (IV) regression enables causal estimation under endogeneity, but modern IV problems often involve nonlinear structural effects and high-dimensional covariates. Existing nonlinear IV methods directly learn the causal relation in observed feature space or rely on learned representations within two-stage or moment-based procedures, which can struggle when the causal information is embedded in a high-dimensional representation. We propose BGM-IV, a latent Bayesian generative modeling approach that reframes nonlinear IV regression as posterior inference in a causally structured latent space. BGM-IV infers latent components that separately capture shared confounding structure, outcome-specific variation, treatment-specific variation, and covariate-only nuisance information. To account for endogeneity, BGM-IV replaces the confounded outcome likelihood with an IV-integrated pseudo-likelihood that averages over instrument-induced treatment values within the latent model. Across various benchmark datasets, BGM-IV remains competitive in the classical low-dimensional regime and performs best in high-dimensional covariate regimes. Together, these results show that structured latent generative modeling provides a principled and effective strategy to nonlinear IV estimation with rich covariates. The code of BGM-IV is available at https://github.com/liuq-lab/BGM-IV.

MLMay 3Code
Missingness-aware Data Imputation via AI-powered Bayesian Generative Modeling

Qiao Liu

Missing data imputation remains a fundamental challenge in modern data science, especially when uncertainty quantification is essential. In this work, we propose MissBGM, an AI-powered missing data imputation method via Bayesian generative modeling that bridges the expressive flexibility of neural networks with the statistical rigor of Bayesian inference. Unlike existing methods that often focus on point estimates or treat the missingness mechanism implicitly, MissBGM explicitly and jointly models the data-generating and missingness mechanisms, providing principled posterior uncertainty over imputations rather than a single point estimate. We develop a stochastic optimization framework with alternating updates among missing values, model parameters, and latent variables until convergence. Our theoretical analysis shows that estimates of missing values from MissBGM converge consistently under mild assumptions. Empirically, we demonstrate that MissBGM achieves superior performance over traditional imputers and recent neural network-based methods across extensive experimental settings. These results establish MissBGM as a principled and scalable solution for modern missing data imputation. The code for MissBGM is open sourced at https://github.com/liuq-lab/MissBGM.

MLNov 23, 2022
Mutual Information Learned Regressor: an Information-theoretic Viewpoint of Training Regression Systems

Jirong Yi, Qiaosheng Zhang, Zhen Chen et al.

As one of the central tasks in machine learning, regression finds lots of applications in different fields. An existing common practice for solving regression problems is the mean square error (MSE) minimization approach or its regularized variants which require prior knowledge about the models. Recently, Yi et al., proposed a mutual information based supervised learning framework where they introduced a label entropy regularization which does not require any prior knowledge. When applied to classification tasks and solved via a stochastic gradient descent (SGD) optimization algorithm, their approach achieved significant improvement over the commonly used cross entropy loss and its variants. However, they did not provide a theoretical convergence analysis of the SGD algorithm for the proposed formulation. Besides, applying the framework to regression tasks is nontrivial due to the potentially infinite support set of the label. In this paper, we investigate the regression under the mutual information based supervised learning framework. We first argue that the MSE minimization approach is equivalent to a conditional entropy learning problem, and then propose a mutual information learning formulation for solving regression problems by using a reparameterization technique. For the proposed formulation, we give the convergence analysis of the SGD algorithm for solving it in practice. Finally, we consider a multi-output regression data model where we derive the generalization performance lower bound in terms of the mutual information associated with the underlying data distribution. The result shows that the high dimensionality can be a bless instead of a curse, which is controlled by a threshold. We hope our work will serve as a good starting point for further research on the mutual information based regression.

LGOct 3, 2022
Mutual Information Learned Classifiers: an Information-theoretic Viewpoint of Training Deep Learning Classification Systems

Jirong Yi, Qiaosheng Zhang, Zhen Chen et al.

Deep learning systems have been reported to acheive state-of-the-art performances in many applications, and one of the keys for achieving this is the existence of well trained classifiers on benchmark datasets which can be used as backbone feature extractors in downstream tasks. As a main-stream loss function for training deep neural network (DNN) classifiers, the cross entropy loss can easily lead us to find models which demonstrate severe overfitting behavior when no other techniques are used for alleviating it such as data augmentation. In this paper, we prove that the existing cross entropy loss minimization for training DNN classifiers essentially learns the conditional entropy of the underlying data distribution of the dataset, i.e., the information or uncertainty remained in the labels after revealing the input. In this paper, we propose a mutual information learning framework where we train DNN classifiers via learning the mutual information between the label and input. Theoretically, we give the population error probability lower bound in terms of the mutual information. In addition, we derive the mutual information lower and upper bounds for a concrete binary classification data model in $\mbR^n$, and also the error probability lower bound in this scenario. Besides, we establish the sample complexity for accurately learning the mutual information from empirical data samples drawn from the underlying data distribution. Empirically, we conduct extensive experiments on several benchmark datasets to support our theory. Without whistles and bells, the proposed mutual information learned classifiers (MILCs) acheive far better generalization performances than the state-of-the-art classifiers with an improvement which can exceed more than 10\% in testing accuracy.

LGSep 21, 2022
Mutual Information Learned Classifiers: an Information-theoretic Viewpoint of Training Deep Learning Classification Systems

Jirong Yi, Qiaosheng Zhang, Zhen Chen et al.

Deep learning systems have been reported to achieve state-of-the-art performances in many applications, and a key is the existence of well trained classifiers on benchmark datasets. As a main-stream loss function, the cross entropy can easily lead us to find models which demonstrate severe overfitting behavior. In this paper, we show that the existing cross entropy loss minimization problem essentially learns the label conditional entropy (CE) of the underlying data distribution of the dataset. However, the CE learned in this way does not characterize well the information shared by the label and the input. In this paper, we propose a mutual information learning framework where we train deep neural network classifiers via learning the mutual information between the label and the input. Theoretically, we give the population classification error lower bound in terms of the mutual information. In addition, we derive the mutual information lower and upper bounds for a concrete binary classification data model in $\mathbb{R}^n$, and also the error probability lower bound in this scenario. Empirically, we conduct extensive experiments on several benchmark datasets to support our theory. The mutual information learned classifiers (MILCs) achieve far better generalization performances than the conditional entropy learned classifiers (CELCs) with an improvement which can exceed more than 10\% in testing accuracy.

CLDec 19, 2023Code
Synergistic Anchored Contrastive Pre-training for Few-Shot Relation Extraction

Da Luo, Yanglei Gan, Rui Hou et al.

Few-shot Relation Extraction (FSRE) aims to extract relational facts from a sparse set of labeled corpora. Recent studies have shown promising results in FSRE by employing Pre-trained Language Models (PLMs) within the framework of supervised contrastive learning, which considers both instances and label facts. However, how to effectively harness massive instance-label pairs to encompass the learned representation with semantic richness in this learning paradigm is not fully explored. To address this gap, we introduce a novel synergistic anchored contrastive pre-training framework. This framework is motivated by the insight that the diverse viewpoints conveyed through instance-label pairs capture incomplete yet complementary intrinsic textual semantics. Specifically, our framework involves a symmetrical contrastive objective that encompasses both sentence-anchored and label-anchored contrastive losses. By combining these two losses, the model establishes a robust and uniform representation space. This space effectively captures the reciprocal alignment of feature distributions among instances and relational facts, simultaneously enhancing the maximization of mutual information across diverse perspectives within the same relation. Experimental results demonstrate that our framework achieves significant performance enhancements compared to baseline models in downstream FSRE tasks. Furthermore, our approach exhibits superior adaptability to handle the challenges of domain shift and zero-shot relation extraction. Our code is available online at https://github.com/AONE-NLP/FSRE-SaCon.

CVJul 28, 2024
Progressive Domain Adaptation for Thermal Infrared Object Tracking

Qiao Li, Kanlun Tan, Qiao Liu et al.

Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data due to the domain shift issue. To this end, in this work, we propose a Progressive Domain Adaptation framework for TIR Tracking (PDAT), which transfers useful knowledge learned from RGB tracking to TIR tracking. The framework makes full use of large-scale labeled RGB datasets without requiring time-consuming and labor-intensive labeling of large-scale TIR data. Specifically, we first propose an adversarial-based global domain adaptation module to reduce domain gap on the feature level coarsely. Second, we design a clustering-based subdomain adaptation method to further align the feature distributions of the RGB and TIR datasets finely. These two domain adaptation modules gradually eliminate the discrepancy between the two domains, and thus learn domain-invariant fine-grained features through progressive training. Additionally, we collect a largescale TIR dataset with over 1.48 million unlabeled TIR images for training the proposed domain adaptation framework. Experimental results on five TIR tracking benchmarks show that the proposed method gains a nearly 6% success rate, demonstrating its effectiveness.

CVMar 27
MuDD: A Multimodal Deception Detection Dataset and GSR-Guided Progressive Distillation for Non-Contact Deception Detection

Peiyuan Jiang, Yao Liu, Yanglei Gan et al.

Non-contact automatic deception detection remains challenging because visual and auditory deception cues often lack stable cross-subject patterns. In contrast, galvanic skin response (GSR) provides more reliable physiological cues and has been widely used in contact-based deception detection. In this work, we leverage stable deception-related knowledge in GSR to guide representation learning in non-contact modalities through cross-modal knowledge distillation. A key obstacle, however, is the lack of a suitable dataset for this setting. To address this, we introduce MuDD, a large-scale Multimodal Deception Detection dataset containing recordings from 130 participants over 690 minutes. In addition to video, audio, and GSR, MuDD also provides Photoplethysmography, heart rate, and personality traits, supporting broader scientific studies of deception. Based on this dataset, we propose GSR-guided Progressive Distillation (GPD), a cross-modal distillation framework for mitigating the negative transfer caused by the large modality mismatch between GSR and non-contact signals. The core innovation of GPD is the integration of progressive feature-level and digit-level distillation with dynamic routing, which allows the model to adaptively determine how teacher knowledge should be transferred during training, leading to more stable cross-modal knowledge transfer. Extensive experiments and visualizations show that GPD outperforms existing methods and achieves state-of-the-art performance on both deception detection and concealed-digit identification.

AIMay 13
Conditional Attribute Estimation with Autoregressive Sequence Models

Erica Stutz, Giacomo Marino, Daniella Meeker et al.

Generative models are often trained with a next-token prediction objective, yet many downstream applications require the ability to estimate or control sequence-level properties. Next-token prediction can lead to overfitting of local patterns during training, underfitting of global structure, and requires significant downstream modifications or expensive sampling to guide or predict the global attributes of generated samples at inference time. Here, we introduce Conditional Attribute Transformers, a novel method for jointly estimating the next-token probability and the value of an attribute conditional on each potential next token selection. This framework enables three critical capabilities within a single forward pass, without modification of the input sequence: (1) per-token credit assignment across an entire sequence, by identifying how each token in a sequence is associated with an attribute's value; (2) counterfactual analysis, by quantifying attribute differences conditional on alternative next token choices; (3) steerable generation, by decoding sequences based on a combination of next-token and attribute likelihoods. Our approach achieves state of the art performance on sparse reward tasks, improves next-token prediction at sufficient model sizes, estimates attribute probabilities orders of magnitude faster than sampling, and can guide decoding of autoregressive sequence models on a range of language tasks.

CVMay 13
CA-GCL: Cross-Anatomy Global-Local Contrastive Learning for Robust 3D Medical Image Understanding

Hanwen Zhang, Yao Liu, Die Dai et al.

Fine-grained Vision-Language Pre-training (FVLP) demonstrates significant potential in 3D medical image understanding by aligning anatomy-level visual representations with corresponding textual descriptions. However, existing FVLP paradigms often suffer from severe representation collapse in the textual embedding space, where text embeddings of distinct anatomical structures become highly clustered and indistinguishable. This distributional degeneracy renders the model hypersensitive to prompt variations, hindering reliable clinical deployment. To address these challenges, we propose a novel Cross-Anatomy Global-Local Contrastive Learning framework (CA-GCL). CA-GCL introduces a global contrastive objective that enforces separation between anatomical categories in the latent space, effectively counteracting the aggregation tendency induced by local alignment. Furthermore, we incorporate a clinical-aware text augmentation strategy based on permutation invariance and partial completeness to enhance robustness against descriptive incompleteness. Extensive evaluations on the CT-RATE and Rad-ChestCT datasets demonstrate that CA-GCL consistently outperforms existing VLP paradigms in zero-shot abnormality detection, achieving superior performance while exhibiting strong cross-dataset generalization. Crucially, CA-GCL reduces performance variance across diverse prompt templates, transforming the collapsed textual similarity distribution into a bell-shaped distribution. These results validate CA-GCL as an effective framework for robust 3D medical image understanding.

AIFeb 9
Negative-Aware Diffusion Process for Temporal Knowledge Graph Extrapolation

Yanglei Gan, Peng He, Yuxiang Cai et al.

Temporal Knowledge Graph (TKG) reasoning seeks to predict future missing facts from historical evidence. While diffusion models (DM) have recently gained attention for their ability to capture complex predictive distributions, two gaps remain: (i) the generative path is conditioned only on positive evidence, overlooking informative negative context, and (ii) training objectives are dominated by cross-entropy ranking, which improves candidate ordering but provides little supervision over the calibration of the denoised embedding. To bridge this gap, we introduce Negative-Aware Diffusion model for TKG Extrapolation (NADEx). Specifically, NADEx encodes subject-centric histories of entities, relations and temporal intervals into sequential embeddings. NADEx perturbs the query object in the forward process and reconstructs it in reverse with a Transformer denoiser conditioned on the temporal-relational context. We further derive a cosine-alignment regularizer derived from batch-wise negative prototypes, which tightens the decision boundary against implausible candidates. Comprehensive experiments on four public TKG benchmarks demonstrate that NADEx delivers state-of-the-art performance.

CLMar 26
LLM-Driven Reasoning for Constraint-Aware Feature Selection in Industrial Systems

Yuhang Zhou, Zhuokai Zhao, Ke Li et al.

Feature selection is a crucial step in large-scale industrial machine learning systems, directly affecting model accuracy, efficiency, and maintainability. Traditional feature selection methods rely on labeled data and statistical heuristics, making them difficult to apply in production environments where labeled data are limited and multiple operational constraints must be satisfied. To address this, we propose Model Feature Agent (MoFA), a model-driven framework that performs sequential, reasoning-based feature selection using both semantic and quantitative feature information. MoFA incorporates feature definitions, importance scores, correlations, and metadata (e.g., feature groups or types) into structured prompts and selects features through interpretable, constraint-aware reasoning. We evaluate MoFA in three real-world industrial applications: (1) True Interest and Time-Worthiness Prediction, where it improves accuracy while reducing feature group complexity, (2) Value Model Enhancement, where it discovers high-order interaction terms that yield substantial engagement gains in online experiments, and (3) Notification Behavior Prediction, where it selects compact, high-value feature subsets that improve both model accuracy and inference efficiency. Together, these results demonstrate the practicality and effectiveness of LLM-based reasoning for feature selection in real production systems.

IRNov 9, 2025
Exploiting Inter-Session Information with Frequency-enhanced Dual-Path Networks for Sequential Recommendation

Peng He, Yao Liu, Yanglei Gan et al.

Sequential recommendation (SR) aims to predict a user's next item preference by modeling historical interaction sequences. Recent advances often integrate frequency-domain modules to compensate for self-attention's low-pass nature by restoring the high-frequency signals critical for personalized recommendations. Nevertheless, existing frequency-aware solutions process each session in isolation and optimize exclusively with time-domain objectives. Consequently, they overlook cross-session spectral dependencies and fail to enforce alignment between predicted and actual spectral signatures, leaving valuable frequency information under-exploited. To this end, we propose FreqRec, a Frequency-Enhanced Dual-Path Network for sequential Recommendation that jointly captures inter-session and intra-session behaviors via a learnable Frequency-domain Multi-layer Perceptrons. Moreover, FreqRec is optimized under a composite objective that combines cross entropy with a frequency-domain consistency loss, explicitly aligning predicted and true spectral signatures. Extensive experiments on three benchmarks show that FreqRec surpasses strong baselines and remains robust under data sparsity and noisy-log conditions.

CVNov 17, 2025Code
FusionFM: All-in-One Multi-Modal Image Fusion with Flow Matching

Huayi Zhu, Xiu Shu, Youqiang Xiong et al.

Current multi-modal image fusion methods typically rely on task-specific models, leading to high training costs and limited scalability. While generative methods provide a unified modeling perspective, they often suffer from slow inference due to the complex sampling trajectories from noise to image. To address this, we formulate image fusion as a direct probabilistic transport from source modalities to the fused image distribution, leveraging the flow matching paradigm to improve sampling efficiency and structural consistency. To mitigate the lack of high-quality fused images for supervision, we collect fusion results from multiple state-of-the-art models as priors, and employ a task-aware selection function to select the most reliable pseudo-labels for each task. We further introduce a Fusion Refiner module that employs a divide-and-conquer strategy to systematically identify, decompose, and enhance degraded components in selected pseudo-labels. For multi-task scenarios, we integrate elastic weight consolidation and experience replay mechanisms to preserve cross-task performance and enhance continual learning ability from both parameter stability and memory retention perspectives. Our approach achieves competitive performance across diverse fusion tasks, while significantly improving sampling efficiency and maintaining a lightweight model design. The code will be available at: https://github.com/Ist-Zhy/FusionFM.

MMAug 3, 2025Code
DRKF: Decoupled Representations with Knowledge Fusion for Multimodal Emotion Recognition

Peiyuan Jiang, Yao Liu, Qiao Liu et al.

Multimodal emotion recognition (MER) aims to identify emotional states by integrating and analyzing information from multiple modalities. However, inherent modality heterogeneity and inconsistencies in emotional cues remain key challenges that hinder performance. To address these issues, we propose a Decoupled Representations with Knowledge Fusion (DRKF) method for MER. DRKF consists of two main modules: an Optimized Representation Learning (ORL) Module and a Knowledge Fusion (KF) Module. ORL employs a contrastive mutual information estimation method with progressive modality augmentation to decouple task-relevant shared representations and modality-specific features while mitigating modality heterogeneity. KF includes a lightweight self-attention-based Fusion Encoder (FE) that identifies the dominant modality and integrates emotional information from other modalities to enhance the fused representation. To handle potential errors from incorrect dominant modality selection under emotionally inconsistent conditions, we introduce an Emotion Discrimination Submodule (ED), which enforces the fused representation to retain discriminative cues of emotional inconsistency. This ensures that even if the FE selects an inappropriate dominant modality, the Emotion Classification Submodule (EC) can still make accurate predictions by leveraging preserved inconsistency information. Experiments show that DRKF achieves state-of-the-art (SOTA) performance on IEMOCAP, MELD, and M3ED. The source code is publicly available at https://github.com/PANPANKK/DRKF.

CVAug 3, 2020Code
LSOTB-TIR:A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark

Qiao Liu, Xin Li, Zhenyu He et al.

In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTBTIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we re-train several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.

CVNov 26, 2019Code
Multi-Task Driven Feature Models for Thermal Infrared Tracking

Qiao Liu, Xin Li, Zhenyu He et al.

Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at {https://github.com/QiaoLiuHit/MMNet}.

CVJan 18, 2018Code
PTB-TIR: A Thermal Infrared Pedestrian Tracking Benchmark

Qiao Liu, Zhenyu He, Xin Li et al.

Thermal infrared (TIR) pedestrian tracking is one of the important components among numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carry out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers.In addition, in order to gain more insight into the TIR pedestrian tracker, we divide its functions into three components: feature extractor, motion model, and observation model. Then, we conduct three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research. The dataset and evaluation toolkit can be downloaded at {https://github.com/QiaoLiuHit/PTB-TIR_Evaluation_toolkit}.

MMFeb 5
XEmoGPT: An Explainable Multimodal Emotion Recognition Framework with Cue-Level Perception and Reasoning

Hanwen Zhang, Yao Liu, Peiyuan Jiang et al.

Explainable Multimodal Emotion Recognition plays a crucial role in applications such as human-computer interaction and social media analytics. However, current approaches struggle with cue-level perception and reasoning due to two main challenges: 1) general-purpose modality encoders are pretrained to capture global structures and general semantics rather than fine-grained emotional cues, resulting in limited sensitivity to emotional signals; and 2) available datasets usually involve a trade-off between annotation quality and scale, which leads to insufficient supervision for emotional cues and ultimately limits cue-level reasoning. Moreover, existing evaluation metrics are inadequate for assessing cue-level reasoning performance. To address these challenges, we propose eXplainable Emotion GPT (XEmoGPT), a novel EMER framework capable of both perceiving and reasoning over emotional cues. It incorporates two specialized modules: the Video Emotional Cue Bridge (VECB) and the Audio Emotional Cue Bridge (AECB), which enhance the video and audio encoders through carefully designed tasks for fine-grained emotional cue perception. To further support cue-level reasoning, we construct a large-scale dataset, EmoCue, designed to teach XEmoGPT how to reason over multimodal emotional cues. In addition, we introduce EmoCue-360, an automated metric that extracts and matches emotional cues using semantic similarity, and release EmoCue-Eval, a benchmark of 400 expert-annotated samples covering diverse emotional scenarios. Experimental results show that XEmoGPT achieves strong performance in both emotional cue perception and reasoning.

MLJan 1, 2025
An AI-powered Bayesian generative modeling approach for causal inference in observational studies

Qiao Liu, Wing Hung Wong

Causal inference in observational studies with high-dimensional covariates presents significant challenges. We introduce CausalBGM, an AI-powered Bayesian generative modeling approach that captures the causal relationship among covariates, treatment, and outcome variables. The core innovation of CausalBGM lies in its ability to estimate the individual treatment effect (ITE) by learning individual-specific distributions of a low-dimensional latent feature set (e.g., latent confounders) that drives changes in both treatment and outcome. This approach not only effectively mitigates confounding effects but also provides comprehensive uncertainty quantification, offering reliable and interpretable causal effect estimates at the individual level. CausalBGM adopts a Bayesian model and uses a novel iterative algorithm to update the model parameters and the posterior distribution of latent features until convergence. This framework leverages the power of AI to capture complex dependencies among variables while adhering to the Bayesian principles. Extensive experiments demonstrate that CausalBGM consistently outperforms state-of-the-art methods, particularly in scenarios with high-dimensional covariates and large-scale datasets. Its Bayesian foundation ensures statistical rigor, providing robust and well-calibrated posterior intervals. By addressing key limitations of existing methods, CausalBGM emerges as a robust and promising framework for advancing causal inference in modern applications in fields such as genomics, healthcare, and social sciences. CausalBGM is maintained at the website https://causalbgm.readthedocs.io/.

CLOct 23, 2025
Mixture-of-Minds: Multi-Agent Reinforcement Learning for Table Understanding

Yuhang Zhou, Mingrui Zhang, Ke Li et al.

Understanding and reasoning over tables is a critical capability for many real-world applications. Large language models (LLMs) have shown promise on this task, but current approaches remain limited. Fine-tuning based methods strengthen language reasoning; yet they are prone to arithmetic errors and hallucination. In contrast, tool-based methods enable precise table manipulation but rely on rigid schemas and lack semantic understanding. These complementary drawbacks highlight the need for approaches that integrate robust reasoning with reliable table processing. In this work, we propose Mixture-of-Minds, a multi-agent framework that decomposes table reasoning into three specialized roles: planning, coding, and answering. This design enables each agent to focus on a specific aspect of the task while leveraging code execution for precise table manipulation. Building on this workflow, we introduce a self-improvement training framework that employs Monte Carlo Tree Search (MCTS) rollouts to generate pseudo-gold trajectories and optimize agents with reinforcement learning (RL). Extensive experiments show that Mixture-of-Minds delivers substantial gains, reaching 62.13% on TableBench and surpassing OpenAI-o4-mini-high. These results demonstrate the promise of combining structured multi-agent workflows with RL to advance table understanding.

CLApr 9, 2025
More diverse more adaptive: Comprehensive Multi-task Learning for Improved LLM Domain Adaptation in E-commerce

Tong Piao, Pei Tang, Zhipeng Zhang et al.

In recent years, Large Language Models (LLMs) have been widely applied across various domains due to their powerful domain adaptation capabilities. Previous studies have suggested that diverse, multi-modal data can enhance LLMs' domain adaptation performance. However, this hypothesis remains insufficiently validated in the e-commerce sector. To address this gap, we propose a comprehensive e-commerce multi-task framework and design empirical experiments to examine the impact of diverse data and tasks on LLMs from two perspectives: "capability comprehensiveness" and "task comprehensiveness." Specifically, we observe significant improvements in LLM performance by progressively introducing tasks related to new major capability areas and by continuously adding subtasks within different major capability domains. Furthermore, we observe that increasing model capacity amplifies the benefits of diversity, suggesting a synergistic relationship between model capacity and data diversity. Finally, we validate the best-performing model from our empirical experiments in the KDD Cup 2024, achieving a rank 5 in Task 1. This outcome demonstrates the significance of our research for advancing LLMs in the e-commerce domain.

LGNov 7, 2024
Sampling-guided Heterogeneous Graph Neural Network with Temporal Smoothing for Scalable Longitudinal Data Imputation

Zhaoyang Zhang, Ziqi Chen, Qiao Liu et al.

In this paper, we propose a novel framework, the Sampling-guided Heterogeneous Graph Neural Network (SHT-GNN), to effectively tackle the challenge of missing data imputation in longitudinal studies. Unlike traditional methods, which often require extensive preprocessing to handle irregular or inconsistent missing data, our approach accommodates arbitrary missing data patterns while maintaining computational efficiency. SHT-GNN models both observations and covariates as distinct node types, connecting observation nodes at successive time points through subject-specific longitudinal subnetworks, while covariate-observation interactions are represented by attributed edges within bipartite graphs. By leveraging subject-wise mini-batch sampling and a multi-layer temporal smoothing mechanism, SHT-GNN efficiently scales to large datasets, while effectively learning node representations and imputing missing data. Extensive experiments on both synthetic and real-world datasets, including the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, demonstrate that SHT-GNN significantly outperforms existing imputation methods, even with high missing data rates. The empirical results highlight SHT-GNN's robust imputation capabilities and superior performance, particularly in the context of complex, large-scale longitudinal data.

CVOct 12, 2024
DiffuTraj: A Stochastic Vessel Trajectory Prediction Approach via Guided Diffusion Process

Changlin Li, Yanglei Gan, Tian Lan et al.

Maritime vessel maneuvers, characterized by their inherent complexity and indeterminacy, requires vessel trajectory prediction system capable of modeling the multi-modality nature of future motion states. Conventional stochastic trajectory prediction methods utilize latent variables to represent the multi-modality of vessel motion, however, tends to overlook the complexity and dynamics inherent in maritime behavior. In contrast, we explicitly simulate the transition of vessel motion from uncertainty towards a state of certainty, effectively handling future indeterminacy in dynamic scenes. In this paper, we present a novel framework (\textit{DiffuTraj}) to conceptualize the trajectory prediction task as a guided reverse process of motion pattern uncertainty diffusion, in which we progressively remove uncertainty from maritime regions to delineate the intended trajectory. Specifically, we encode the previous states of the target vessel, vessel-vessel interactions, and the environment context as guiding factors for trajectory generation. Subsequently, we devise a transformer-based conditional denoiser to capture spatio-temporal dependencies, enabling the generation of trajectories better aligned for particular maritime environment. Comprehensive experiments on vessel trajectory prediction benchmarks demonstrate the superiority of our method.

IVFeb 17, 2022
Graph Convolutional Networks for Multi-modality Medical Imaging: Methods, Architectures, and Clinical Applications

Kexin Ding, Mu Zhou, Zichen Wang et al.

Image-based characterization and disease understanding involve integrative analysis of morphological, spatial, and topological information across biological scales. The development of graph convolutional networks (GCNs) has created the opportunity to address this information complexity via graph-driven architectures, since GCNs can perform feature aggregation, interaction, and reasoning with remarkable flexibility and efficiency. These GCNs capabilities have spawned a new wave of research in medical imaging analysis with the overarching goal of improving quantitative disease understanding, monitoring, and diagnosis. Yet daunting challenges remain for designing the important image-to-graph transformation for multi-modality medical imaging and gaining insights into model interpretation and enhanced clinical decision support. In this review, we present recent GCNs developments in the context of medical image analysis including imaging data from radiology and histopathology. We discuss the fast-growing use of graph network architectures in medical image analysis to improve disease diagnosis and patient outcomes in clinical practice. To foster cross-disciplinary research, we present GCNs technical advancements, emerging medical applications, identify common challenges in the use of image-based GCNs and their extensions in model interpretation, large-scale benchmarks that promise to transform the scope of medical image studies and related graph-driven medical research.

CVDec 13, 2021
An Informative Tracking Benchmark

Xin Li, Qiao Liu, Wenjie Pei et al.

Along with the rapid progress of visual tracking, existing benchmarks become less informative due to redundancy of samples and weak discrimination between current trackers, making evaluations on all datasets extremely time-consuming. Thus, a small and informative benchmark, which covers all typical challenging scenarios to facilitate assessing the tracker performance, is of great interest. In this work, we develop a principled way to construct a small and informative tracking benchmark (ITB) with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, we first design a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, we collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario. By analyzing the results of 15 state-of-the-art trackers re-trained on the same data, we determine the effective methods for robust tracking under each scenario and demonstrate new challenges for future research direction in this field.

CVOct 17, 2021
Active Learning for Deep Visual Tracking

Di Yuan, Xiaojun Chang, Yi Yang et al.

Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years. Generally, training a deep CNN model requires numerous labeled training samples, and the number and quality of these samples directly affect the representational capability of the trained model. However, this approach is restrictive in practice, because manually labeling such a large number of training samples is time-consuming and prohibitively expensive. In this paper, we propose an active learning method for deep visual tracking, which selects and annotates the unlabeled samples to train the deep CNNs model. Under the guidance of active learning, the tracker based on the trained deep CNNs model can achieve competitive tracking performance while reducing the labeling cost. More specifically, to ensure the diversity of selected samples, we propose an active learning method based on multi-frame collaboration to select those training samples that should be and need to be annotated. Meanwhile, considering the representativeness of these selected samples, we adopt a nearest neighbor discrimination method based on the average nearest neighbor distance to screen isolated samples and low-quality samples. Therefore, the training samples subset selected based on our method requires only a given budget to maintain the diversity and representativeness of the entire sample set. Furthermore, we adopt a Tversky loss to improve the bounding box estimation of our tracker, which can ensure that the tracker achieves more accurate target states. Extensive experimental results confirm that our active learning-based tracker (ALT) achieves competitive tracking accuracy and speed compared with state-of-the-art trackers on the seven most challenging evaluation benchmarks.

SDOct 12, 2021
Foster Strengths and Circumvent Weaknesses: a Speech Enhancement Framework with Two-branch Collaborative Learning

Wenxin Tai, Jiajia Li, Yixiang Wang et al.

Recent single-channel speech enhancement methods usually convert waveform to the time-frequency domain and use magnitude/complex spectrum as the optimizing target. However, both magnitude-spectrum-based methods and complex-spectrum-based methods have their respective pros and cons. In this paper, we propose a unified two-branch framework to foster strengths and circumvent weaknesses of different paradigms. The proposed framework could take full advantage of the apparent spectral regularity in magnitude spectrogram and break the bottleneck that magnitude-based methods have suffered. Within each branch, we use collaborative expert block and its variants as substitutes for regular convolution layers. Experiments on TIMIT benchmark demonstrate that our method is superior to existing state-of-the-art ones.

LGOct 3, 2021
Boost Neural Networks by Checkpoints

Feng Wang, Guoyizhe Wei, Qiao Liu et al.

Training multiple deep neural networks (DNNs) and averaging their outputs is a simple way to improve the predictive performance. Nevertheless, the multiplied training cost prevents this ensemble method to be practical and efficient. Several recent works attempt to save and ensemble the checkpoints of DNNs, which only requires the same computational cost as training a single network. However, these methods suffer from either marginal accuracy improvements due to the low diversity of checkpoints or high risk of divergence due to the cyclical learning rates they adopted. In this paper, we propose a novel method to ensemble the checkpoints, where a boosting scheme is utilized to accelerate model convergence and maximize the checkpoint diversity. We theoretically prove that it converges by reducing exponential loss. The empirical evaluation also indicates our proposed ensemble outperforms single model and existing ensembles in terms of accuracy and efficiency. With the same training budget, our method achieves 4.16% lower error on Cifar-100 and 6.96% on Tiny-ImageNet with ResNet-110 architecture. Moreover, the adaptive sample weights in our method make it an effective solution to address the imbalanced class distribution. In the experiments, it yields up to 5.02% higher accuracy over single EfficientNet-B0 on the imbalanced datasets.

AIApr 27, 2021
Relational Learning with Gated and Attentive Neighbor Aggregator for Few-Shot Knowledge Graph Completion

Guanglin Niu, Yang Li, Chengguang Tang et al.

Aiming at expanding few-shot relations' coverage in knowledge graphs (KGs), few-shot knowledge graph completion (FKGC) has recently gained more research interests. Some existing models employ a few-shot relation's multi-hop neighbor information to enhance its semantic representation. However, noise neighbor information might be amplified when the neighborhood is excessively sparse and no neighbor is available to represent the few-shot relation. Moreover, modeling and inferring complex relations of one-to-many (1-N), many-to-one (N-1), and many-to-many (N-N) by previous knowledge graph completion approaches requires high model complexity and a large amount of training instances. Thus, inferring complex relations in the few-shot scenario is difficult for FKGC models due to limited training instances. In this paper, we propose a few-shot relational learning with global-local framework to address the above issues. At the global stage, a novel gated and attentive neighbor aggregator is built for accurately integrating the semantics of a few-shot relation's neighborhood, which helps filtering the noise neighbors even if a KG contains extremely sparse neighborhoods. For the local stage, a meta-learning based TransH (MTransH) method is designed to model complex relations and train our model in a few-shot learning fashion. Extensive experiments show that our model outperforms the state-of-the-art FKGC approaches on the frequently-used benchmark datasets NELL-One and Wiki-One. Compared with the strong baseline model MetaR, our model achieves 5-shot FKGC performance improvements of 8.0% on NELL-One and 2.8% on Wiki-One by the metric Hits@10.

LGNov 14, 2020
Reinforced Molecular Optimization with Neighborhood-Controlled Grammars

Chencheng Xu, Qiao Liu, Minlie Huang et al.

A major challenge in the pharmaceutical industry is to design novel molecules with specific desired properties, especially when the property evaluation is costly. Here, we propose MNCE-RL, a graph convolutional policy network for molecular optimization with molecular neighborhood-controlled embedding grammars through reinforcement learning. We extend the original neighborhood-controlled embedding grammars to make them applicable to molecular graph generation and design an efficient algorithm to infer grammatical production rules from given molecules. The use of grammars guarantees the validity of the generated molecular structures. By transforming molecular graphs to parse trees with the inferred grammars, the molecular structure generation task is modeled as a Markov decision process where a policy gradient strategy is utilized. In a series of experiments, we demonstrate that our approach achieves state-of-the-art performance in a diverse range of molecular optimization tasks and exhibits significant superiority in optimizing molecular properties with a limited number of property evaluations.

CVJul 3, 2020
Accurate Bounding-box Regression with Distance-IoU Loss for Visual Tracking

Di Yuan, Xiu Shu, Nana Fan et al.

Most existing trackers are based on using a classifier and multi-scale estimation to estimate the target state. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While trackers adopt a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defects in the IoU loss itself, that make it impossible to continue to optimize the objective function when a given bounding box is completely contained within/without another bounding box; this makes it very challenging to accurately estimate the target state. Accordingly, in this paper, we address the above-mentioned problem by proposing a novel tracking method based on a distance-IoU (DIoU) loss, such that the proposed tracker consists of target estimation and target classification. The target estimation part is trained to predict the DIoU score between the target ground-truth bounding-box and the estimated bounding-box. The DIoU loss can maintain the advantage provided by the IoU loss while minimizing the distance between the center points of two bounding boxes, thereby making the target estimation more accurate. Moreover, we introduce a classification part that is trained online and optimized with a Conjugate-Gradient-based strategy to guarantee real-time tracking speed. Comprehensive experimental results demonstrate that the proposed method achieves competitive tracking accuracy when compared to state-of-the-art trackers while with a real-time tracking speed.

LGApr 20, 2020
Roundtrip: A Deep Generative Neural Density Estimator

Qiao Liu, Jiaze Xu, Rui Jiang et al.

Density estimation is a fundamental problem in both statistics and machine learning. In this study, we proposed Roundtrip as a general-purpose neural density estimator based on deep generative models. Roundtrip retains the generative power of generative adversarial networks (GANs) but also provides estimates of density values. Unlike previous neural density estimators that put stringent conditions on the transformation from the latent space to the data space, Roundtrip enables the use of much more general mappings. In a series of experiments, Roundtrip achieves state-of-the-art performance in a diverse range of density estimation tasks.

CVJun 9, 2019
Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking

Qiao Liu, Xin Li, Zhenyu He et al.

Existing deep Thermal InfraRed (TIR) trackers only use semantic features to describe the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images.To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities on two convolutional layers using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we construct the first large scale TIR video sequence dataset for training the proposed model. The proposed TIR dataset not only benefits the training for TIR tracking but also can be applied to numerous TIR vision tasks. Extensive experimental results on the VOT-TIR2015 and VOT-TIR2017 benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.

CVNov 27, 2017
Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking

Xin Li, Qiao Liu, Nana Fan et al.

Most thermal infrared (TIR) tracking methods are discriminative, treating the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is coupled well to the objective of the tracking task. We propose a TIR tracker via a Hierarchical Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN that coalesces the multiple hierarchical convolutional layers. Then, we propose a spatial-aware network to enhance the discriminative ability of the coalesced hierarchical feature. Subsequently, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the candidate that is most similar to the tracked target. Extensive experimental results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed method achieves favourable performance compared to the state-of-the-art methods.