Longbing Cao

LG
h-index19
89papers
5,169citations
Novelty47%
AI Score57

89 Papers

LGApr 26, 2023
Bayesian Federated Learning: A Survey

Longbing Cao, Hui Chen, Xuhui Fan et al.

Federated learning (FL) demonstrates its advantages in integrating distributed infrastructure, communication, computing and learning in a privacy-preserving manner. However, the robustness and capabilities of existing FL methods are challenged by limited and dynamic data and conditions, complexities including heterogeneities and uncertainties, and analytical explainability. Bayesian federated learning (BFL) has emerged as a promising approach to address these issues. This survey presents a critical overview of BFL, including its basic concepts, its relations to Bayesian learning in the context of FL, and a taxonomy of BFL from both Bayesian and federated perspectives. We categorize and discuss client- and server-side and FL-based BFL methods and their pros and cons. The limitations of the existing BFL methods and the future directions of BFL research further address the intricate requirements of real-life FL applications.

LGFeb 4, 2023
A Survey on Deep Learning based Time Series Analysis with Frequency Transformation

Kun Yi, Qi Zhang, Wei Fan et al.

Recently, frequency transformation (FT) has been increasingly incorporated into deep learning models to significantly enhance state-of-the-art accuracy and efficiency in time series analysis. The advantages of FT, such as high efficiency and a global view, have been rapidly explored and exploited in various time series tasks and applications, demonstrating the promising potential of FT as a new deep learning paradigm for time series analysis. Despite the growing attention and the proliferation of research in this emerging field, there is currently a lack of a systematic review and in-depth analysis of deep learning-based time series models with FT. It is also unclear why FT can enhance time series analysis and what its limitations are in the field. To address these gaps, we present a comprehensive review that systematically investigates and summarizes the recent research advancements in deep learning-based time series analysis with FT. Specifically, we explore the primary approaches used in current models that incorporate FT, the types of neural networks that leverage FT, and the representative FT-equipped models in deep time series analysis. We propose a novel taxonomy to categorize the existing methods in this field, providing a structured overview of the diverse approaches employed in incorporating FT into deep learning models for time series analysis. Finally, we highlight the advantages and limitations of FT for time series modeling and identify potential future research directions that can further contribute to the community of time series analysis.

NEJan 1, 2023
eVAE: Evolutionary Variational Autoencoder

Zhangkai Wu, Longbing Cao, Lei Qi

The surrogate loss of variational autoencoders (VAEs) poses various challenges to their training, inducing the imbalance between task fitting and representation inference. To avert this, the existing strategies for VAEs focus on adjusting the tradeoff by introducing hyperparameters, deriving a tighter bound under some mild assumptions, or decomposing the loss components per certain neural settings. VAEs still suffer from uncertain tradeoff learning.We propose a novel evolutionary variational autoencoder (eVAE) building on the variational information bottleneck (VIB) theory and integrative evolutionary neural learning. eVAE integrates a variational genetic algorithm into VAE with variational evolutionary operators including variational mutation, crossover, and evolution. Its inner-outer-joint training mechanism synergistically and dynamically generates and updates the uncertain tradeoff learning in the evidence lower bound (ELBO) without additional constraints. Apart from learning a lossy compression and representation of data under the VIB assumption, eVAE presents an evolutionary paradigm to tune critical factors of VAEs and deep neural networks and addresses the premature convergence and random search problem by integrating evolutionary optimization into deep learning. Experiments show that eVAE addresses the KL-vanishing problem for text generation with low reconstruction loss, generates all disentangled factors with sharp images, and improves the image generation quality,respectively. eVAE achieves better reconstruction loss, disentanglement, and generation-inference balance than its competitors.

LGJan 27, 2023
Learning Informative Representation for Fairness-aware Multivariate Time-series Forecasting: A Group-based Perspective

Hui He, Qi Zhang, Shoujin Wang et al.

Performance unfairness among variables widely exists in multivariate time series (MTS) forecasting models since such models may attend/bias to certain (advantaged) variables. Addressing this unfairness problem is important for equally attending to all variables and avoiding vulnerable model biases/risks. However, fair MTS forecasting is challenging and has been less studied in the literature. To bridge such significant gap, we formulate the fairness modeling problem as learning informative representations attending to both advantaged and disadvantaged variables. Accordingly, we propose a novel framework, named FairFor, for fairness-aware MTS forecasting. FairFor is based on adversarial learning to generate both group-independent and group-relevant representations for the downstream forecasting. The framework first leverages a spectral relaxation of the K-means objective to infer variable correlations and thus to group variables. Then, it utilizes a filtering&fusion component to filter the group-relevant information and generate group-independent representations via orthogonality regularization. The group-independent and group-relevant representations form highly informative representations, facilitating to sharing knowledge from advantaged variables to disadvantaged variables to guarantee fairness. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed FairFor for fair forecasting and significant performance improvement.

LGJun 19, 2022
Supervision Adaptation Balancing In-distribution Generalization and Out-of-distribution Detection

Zhilin Zhao, Longbing Cao, Kun-Yu Lin

The discrepancy between in-distribution (ID) and out-of-distribution (OOD) samples can lead to \textit{distributional vulnerability} in deep neural networks, which can subsequently lead to high-confidence predictions for OOD samples. This is mainly due to the absence of OOD samples during training, which fails to constrain the network properly. To tackle this issue, several state-of-the-art methods include adding extra OOD samples to training and assign them with manually-defined labels. However, this practice can introduce unreliable labeling, negatively affecting ID classification. The distributional vulnerability presents a critical challenge for non-IID deep learning, which aims for OOD-tolerant ID classification by balancing ID generalization and OOD detection. In this paper, we introduce a novel \textit{supervision adaptation} approach to generate adaptive supervision information for OOD samples, making them more compatible with ID samples. Firstly, we measure the dependency between ID samples and their labels using mutual information, revealing that the supervision information can be represented in terms of negative probabilities across all classes. Secondly, we investigate data correlations between ID and OOD samples by solving a series of binary regression problems, with the goal of refining the supervision information for more distinctly separable ID classes. Our extensive experiments on four advanced network architectures, two ID datasets, and eleven diversified OOD datasets demonstrate the efficacy of our supervision adaptation approach in improving both ID classification and OOD detection capabilities.

LGSep 1, 2022
Distributional Drift Adaptation with Temporal Conditional Variational Autoencoder for Multivariate Time Series Forecasting

Hui He, Qi Zhang, Kun Yi et al.

Due to the non-stationary nature, the distribution of real-world multivariate time series (MTS) changes over time, which is known as distribution drift. Most existing MTS forecasting models greatly suffer from distribution drift and degrade the forecasting performance over time. Existing methods address distribution drift via adapting to the latest arrived data or self-correcting per the meta knowledge derived from future data. Despite their great success in MTS forecasting, these methods hardly capture the intrinsic distribution changes, especially from a distributional perspective. Accordingly, we propose a novel framework temporal conditional variational autoencoder (TCVAE) to model the dynamic distributional dependencies over time between historical observations and future data in MTSs and infer the dependencies as a temporal conditional distribution to leverage latent variables. Specifically, a novel temporal Hawkes attention mechanism represents temporal factors subsequently fed into feed-forward networks to estimate the prior Gaussian distribution of latent variables. The representation of temporal factors further dynamically adjusts the structures of Transformer-based encoder and decoder to distribution changes by leveraging a gated attention mechanism. Moreover, we introduce conditional continuous normalization flow to transform the prior Gaussian to a complex and form-free distribution to facilitate flexible inference of the temporal conditional distribution. Extensive experiments conducted on six real-world MTS datasets demonstrate the TCVAE's superior robustness and effectiveness over the state-of-the-art MTS forecasting baselines. We further illustrate the TCVAE applicability through multifaceted case studies and visualization in real-world scenarios.

LGJun 19, 2022
Dual Representation Learning for Out-of-Distribution Detection

Zhilin Zhao, Longbing Cao

To classify in-distribution samples, deep neural networks explore strongly label-related information and discard weakly label-related information according to the information bottleneck. Out-of-distribution samples drawn from distributions differing from that of in-distribution samples could be assigned with unexpected high-confidence predictions because they could obtain minimum strongly label-related information. To distinguish in- and out-of-distribution samples, Dual Representation Learning (DRL) makes out-of-distribution samples harder to have high-confidence predictions by exploring both strongly and weakly label-related information from in-distribution samples. For a pretrained network exploring strongly label-related information to learn label-discriminative representations, DRL trains its auxiliary network exploring the remaining weakly label-related information to learn distribution-discriminative representations. Specifically, for a label-discriminative representation, DRL constructs its complementary distribution-discriminative representation by integrating diverse representations less similar to the label-discriminative representation. Accordingly, DRL combines label- and distribution-discriminative representations to detect out-of-distribution samples. Experiments show that DRL outperforms the state-of-the-art methods for out-of-distribution detection.

LGJun 19, 2022
Out-of-distribution Detection by Cross-class Vicinity Distribution of In-distribution Data

Zhilin Zhao, Longbing Cao, Kun-Yu Lin

Deep neural networks for image classification only learn to map in-distribution inputs to their corresponding ground truth labels in training without differentiating out-of-distribution samples from in-distribution ones. This results from the assumption that all samples are independent and identically distributed (IID) without distributional distinction. Therefore, a pretrained network learned from in-distribution samples treats out-of-distribution samples as in-distribution and makes high-confidence predictions on them in the test phase. To address this issue, we draw out-of-distribution samples from the vicinity distribution of training in-distribution samples for learning to reject the prediction on out-of-distribution inputs. A \textit{Cross-class Vicinity Distribution} is introduced by assuming that an out-of-distribution sample generated by mixing multiple in-distribution samples does not share the same classes of its constituents. We thus improve the discriminability of a pretrained network by finetuning it with out-of-distribution samples drawn from the cross-class vicinity distribution, where each out-of-distribution input corresponds to a complementary label. Experiments on various in-/out-of-distribution datasets show that the proposed method significantly outperforms the existing methods in improving the capacity of discriminating between in- and out-of-distribution samples.

LGNov 10, 2023
Frequency-domain MLPs are More Effective Learners in Time Series Forecasting

Kun Yi, Qi Zhang, Wei Fan et al.

Time series forecasting has played the key role in different industrial, including finance, traffic, energy, and healthcare domains. While existing literatures have designed many sophisticated architectures based on RNNs, GNNs, or Transformers, another kind of approaches based on multi-layer perceptrons (MLPs) are proposed with simple structure, low complexity, and {superior performance}. However, most MLP-based forecasting methods suffer from the point-wise mappings and information bottleneck, which largely hinders the forecasting performance. To overcome this problem, we explore a novel direction of applying MLPs in the frequency domain for time series forecasting. We investigate the learned patterns of frequency-domain MLPs and discover their two inherent characteristic benefiting forecasting, (i) global view: frequency spectrum makes MLPs own a complete view for signals and learn global dependencies more easily, and (ii) energy compaction: frequency-domain MLPs concentrate on smaller key part of frequency components with compact signal energy. Then, we propose FreTS, a simple yet effective architecture built upon Frequency-domain MLPs for Time Series forecasting. FreTS mainly involves two stages, (i) Domain Conversion, that transforms time-domain signals into complex numbers of frequency domain; (ii) Frequency Learning, that performs our redesigned MLPs for the learning of real and imaginary part of frequency components. The above stages operated on both inter-series and intra-series scales further contribute to channel-wise and time-wise dependency learning. Extensive experiments on 13 real-world benchmarks (including 7 benchmarks for short-term forecasting and 6 benchmarks for long-term forecasting) demonstrate our consistent superiority over state-of-the-art methods.

LGOct 6, 2022
Edge-Varying Fourier Graph Networks for Multivariate Time Series Forecasting

Kun Yi, Qi Zhang, Liang Hu et al.

The key problem in multivariate time series (MTS) analysis and forecasting aims to disclose the underlying couplings between variables that drive the co-movements. Considerable recent successful MTS methods are built with graph neural networks (GNNs) due to their essential capacity for relational modeling. However, previous work often used a static graph structure of time-series variables for modeling MTS failing to capture their ever-changing correlations over time. To this end, a fully-connected supra-graph connecting any two variables at any two timestamps is adaptively learned to capture the high-resolution variable dependencies via an efficient graph convolutional network. Specifically, we construct the Edge-Varying Fourier Graph Networks (EV-FGN) equipped with Fourier Graph Shift Operator (FGSO) which efficiently performs graph convolution in the frequency domain. As a result, a high-efficiency scale-free parameter learning scheme is derived for MTS analysis and forecasting according to the convolution theorem. Extensive experiments show that EV-FGN outperforms state-of-the-art methods on seven real-world MTS datasets.

LGNov 10, 2023
FourierGNN: Rethinking Multivariate Time Series Forecasting from a Pure Graph Perspective

Kun Yi, Qi Zhang, Wei Fan et al.

Multivariate time series (MTS) forecasting has shown great importance in numerous industries. Current state-of-the-art graph neural network (GNN)-based forecasting methods usually require both graph networks (e.g., GCN) and temporal networks (e.g., LSTM) to capture inter-series (spatial) dynamics and intra-series (temporal) dependencies, respectively. However, the uncertain compatibility of the two networks puts an extra burden on handcrafted model designs. Moreover, the separate spatial and temporal modeling naturally violates the unified spatiotemporal inter-dependencies in real world, which largely hinders the forecasting performance. To overcome these problems, we explore an interesting direction of directly applying graph networks and rethink MTS forecasting from a pure graph perspective. We first define a novel data structure, hypervariate graph, which regards each series value (regardless of variates or timestamps) as a graph node, and represents sliding windows as space-time fully-connected graphs. This perspective considers spatiotemporal dynamics unitedly and reformulates classic MTS forecasting into the predictions on hypervariate graphs. Then, we propose a novel architecture Fourier Graph Neural Network (FourierGNN) by stacking our proposed Fourier Graph Operator (FGO) to perform matrix multiplications in Fourier space. FourierGNN accommodates adequate expressiveness and achieves much lower complexity, which can effectively and efficiently accomplish the forecasting. Besides, our theoretical analysis reveals FGO's equivalence to graph convolutions in the time domain, which further verifies the validity of FourierGNN. Extensive experiments on seven datasets have demonstrated our superior performance with higher efficiency and fewer parameters compared with state-of-the-art methods.

AIApr 4, 2022
Explicit and Implicit Pattern Relation Analysis for Discovering Actionable Negative Sequences

Wei Wang, Longbing Cao

Real-life events, behaviors and interactions produce sequential data. An important but rarely explored problem is to analyze those nonoccurring (also called negative) yet important sequences, forming negative sequence analysis (NSA). A typical NSA area is to discover negative sequential patterns (NSPs) consisting of important non-occurring and occurring elements and patterns. The limited existing work on NSP mining relies on frequentist and downward closure property-based pattern selection, producing large and highly redundant NSPs, nonactionable for business decision-making. This work makes the first attempt for actionable NSP discovery. It builds an NSP graph representation, quantify both explicit occurrence and implicit non-occurrence-based element and pattern relations, and then discover significant, diverse and informative NSPs in the NSP graph to represent the entire NSP set for discovering actionable NSPs. A DPP-based NSP representation and actionable NSP discovery method EINSP introduces novel and significant contributions for NSA and sequence analysis: (1) it represents NSPs by a determinantal point process (DPP) based graph; (2) it quantifies actionable NSPs in terms of their statistical significance, diversity, and strength of explicit/implicit element/pattern relations; and (3) it models and measures both explicit and implicit element/pattern relations in the DPP-based NSP graph to represent direct and indirect couplings between NSP items, elements and patterns. We substantially analyze the effectiveness of EINSP in terms of various theoretical and empirical aspects including complexity, item/pattern coverage, pattern size and diversity, implicit pattern relation strength, and data factors.

LGJun 19, 2022
Gray Learning from Non-IID Data with Out-of-distribution Samples

Zhilin Zhao, Longbing Cao, Chang-Dong Wang

The integrity of training data, even when annotated by experts, is far from guaranteed, especially for non-IID datasets comprising both in- and out-of-distribution samples. In an ideal scenario, the majority of samples would be in-distribution, while samples that deviate semantically would be identified as out-of-distribution and excluded during the annotation process. However, experts may erroneously classify these out-of-distribution samples as in-distribution, assigning them labels that are inherently unreliable. This mixture of unreliable labels and varied data types makes the task of learning robust neural networks notably challenging. We observe that both in- and out-of-distribution samples can almost invariably be ruled out from belonging to certain classes, aside from those corresponding to unreliable ground-truth labels. This opens the possibility of utilizing reliable complementary labels that indicate the classes to which a sample does not belong. Guided by this insight, we introduce a novel approach, termed \textit{Gray Learning} (GL), which leverages both ground-truth and complementary labels. Crucially, GL adaptively adjusts the loss weights for these two label types based on prediction confidence levels. By grounding our approach in statistical learning theory, we derive bounds for the generalization error, demonstrating that GL achieves tight constraints even in non-IID settings. Extensive experimental evaluations reveal that our method significantly outperforms alternative approaches grounded in robust statistics.

CLNov 5, 2022
Tri-Attention: Explicit Context-Aware Attention Mechanism for Natural Language Processing

Rui Yu, Yifeng Li, Wenpeng Lu et al.

In natural language processing (NLP), the context of a word or sentence plays an essential role. Contextual information such as the semantic representation of a passage or historical dialogue forms an essential part of a conversation and a precise understanding of the present phrase or sentence. However, the standard attention mechanisms typically generate weights using query and key but ignore context, forming a Bi-Attention framework, despite their great success in modeling sequence alignment. This Bi-Attention mechanism does not explicitly model the interactions between the contexts, queries and keys of target sequences, missing important contextual information and resulting in poor attention performance. Accordingly, a novel and general triple-attention (Tri-Attention) framework expands the standard Bi-Attention mechanism and explicitly interacts query, key, and context by incorporating context as the third dimension in calculating relevance scores. Four variants of Tri-Attention are generated by expanding the two-dimensional vector-based additive, dot-product, scaled dot-product, and bilinear operations in Bi-Attention to the tensor operations for Tri-Attention. Extensive experiments on three NLP tasks demonstrate that Tri-Attention outperforms about 30 state-of-the-art non-attention, standard Bi-Attention, contextual Bi-Attention approaches and pretrained neural language models1.

LGAug 6, 2024
Can LLMs Serve As Time Series Anomaly Detectors?

Manqing Dong, Hao Huang, Longbing Cao

An emerging topic in large language models (LLMs) is their application to time series forecasting, characterizing mainstream and patternable characteristics of time series. A relevant but rarely explored and more challenging question is whether LLMs can detect and explain time series anomalies, a critical task across various real-world applications. In this paper, we investigate the capabilities of LLMs, specifically GPT-4 and LLaMA3, in detecting and explaining anomalies in time series. Our studies reveal that: 1) LLMs cannot be directly used for time series anomaly detection. 2) By designing prompt strategies such as in-context learning and chain-of-thought prompting, GPT-4 can detect time series anomalies with results competitive to baseline methods. 3) We propose a synthesized dataset to automatically generate time series anomalies with corresponding explanations. By applying instruction fine-tuning on this dataset, LLaMA3 demonstrates improved performance in time series anomaly detection tasks. In summary, our exploration shows the promising potential of LLMs as time series anomaly detectors.

SDMay 20
DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech

Xu Zhang, Longbing Cao, Zhangkai Wu

Diffusion and flow-matching based text-to-speech (TTS) models excel in naturalness but often lack explicit emotion control, as emotional signals remain entangled with speaker identity. We discover that emotion embedding emerges as a linearly decodable direction of frozen hidden states, nearly orthogonal to the direction embedding speaker identity. This inspires a plug-and-play framework DUET for emotion control over pretrained diffusion and flow-matching based TTS models. During generation, DUET unifies dual-space control to achieve fine-grained emotion intervention in a single per-step update: hidden space steering shifts generation along the target emotion direction, while mel-space guidance refines spectral details through gradients backpropagated from a differentiable vocoder. We validate DUET on five architecturally diverse pretrained TTS backbones across three datasets, where it outperforms 10 supervised state-of-the-art emotional TTS baselines across paradigms and achieves the highest human-rated emotion appropriateness. To further showcase its qualitative behavior, we deploy DUET on an Ameca humanoid robot, where it produces richly expressive emotional speech on the humanoid, demonstrating the strong potential for plug-and-play affective interaction for embodied agents.

LGJul 27, 2024
Graph Memory Learning: Imitating Lifelong Remembering and Forgetting of Brain Networks

Jiaxing Miao, Liang Hu, Qi Zhang et al.

Graph data in real-world scenarios undergo rapid and frequent changes, making it challenging for existing graph models to effectively handle the continuous influx of new data and accommodate data withdrawal requests. The approach to frequently retraining graph models is resource intensive and impractical. To address this pressing challenge, this paper introduces a new concept of graph memory learning. Its core idea is to enable a graph model to selectively remember new knowledge but forget old knowledge. Building on this approach, the paper presents a novel graph memory learning framework - Brain-inspired Graph Memory Learning (BGML), inspired by brain network dynamics and function-structure coupling strategies. BGML incorporates a multi-granular hierarchical progressive learning mechanism rooted in feature graph grain learning to mitigate potential conflict between memorization and forgetting in graph memory learning. This mechanism allows for a comprehensive and multi-level perception of local details within evolving graphs. In addition, to tackle the issue of unreliable structures in newly added incremental information, the paper introduces an information self-assessment ownership mechanism. This mechanism not only facilitates the propagation of incremental information within the model but also effectively preserves the integrity of past experiences. We design five types of graph memory learning tasks: regular, memory, unlearning, data-incremental, and class-incremental to evaluate BGML. Its excellent performance is confirmed through extensive experiments on multiple real-world node classification datasets.

LGSep 27, 2023
Bayesian Personalized Federated Learning with Shared and Personalized Uncertainty Representations

Hui Chen, Hengyu Liu, Longbing Cao et al.

Bayesian personalized federated learning (BPFL) addresses challenges in existing personalized FL (PFL). BPFL aims to quantify the uncertainty and heterogeneity within and across clients towards uncertainty representations by addressing the statistical heterogeneity of client data. In PFL, some recent preliminary work proposes to decompose hidden neural representations into shared and local components and demonstrates interesting results. However, most of them do not address client uncertainty and heterogeneity in FL systems, while appropriately decoupling neural representations is challenging and often ad hoc. In this paper, we make the first attempt to introduce a general BPFL framework to decompose and jointly learn shared and personalized uncertainty representations on statistically heterogeneous client data over time. A Bayesian federated neural network BPFed instantiates BPFL by jointly learning cross-client shared uncertainty and client-specific personalized uncertainty over statistically heterogeneous and randomly participating clients. We further involve continual updating of prior distribution in BPFed to speed up the convergence and avoid catastrophic forgetting. Theoretical analysis and guarantees are provided in addition to the experimental evaluation of BPFed against the diversified baselines.

CVAug 24, 2024
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Jiwei Guan, Tianyu Ding, Longbing Cao et al.

Vision-language pretraining (VLP) with transformers has demonstrated exceptional performance across numerous multimodal tasks. However, the adversarial robustness of these models has not been thoroughly investigated. Existing multimodal attack methods have largely overlooked cross-modal interactions between visual and textual modalities, particularly in the context of cross-attention mechanisms. In this paper, we study the adversarial vulnerability of recent VLP transformers and design a novel Joint Multimodal Transformer Feature Attack (JMTFA) that concurrently introduces adversarial perturbations in both visual and textual modalities under white-box settings. JMTFA strategically targets attention relevance scores to disrupt important features within each modality, generating adversarial samples by fusing perturbations and leading to erroneous model predictions. Experimental results indicate that the proposed approach achieves high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, our findings reveal that the textual modality significantly influences the complex fusion processes within VLP transformers. Moreover, we observe no apparent relationship between model size and adversarial robustness under our proposed attacks. These insights emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.

LGNov 14, 2023
Distilling the Unknown to Unveil Certainty

Zhilin Zhao, Longbing Cao, Yixuan Zhang et al.

Out-of-distribution (OOD) detection is critical for identifying test samples that deviate from in-distribution (ID) data, ensuring network robustness and reliability. This paper presents a flexible framework for OOD knowledge distillation that extracts OOD-sensitive information from a network to develop a binary classifier capable of distinguishing between ID and OOD samples in both scenarios, with and without access to training ID data. To accomplish this, we introduce Confidence Amendment (CA), an innovative methodology that transforms an OOD sample into an ID one while progressively amending prediction confidence derived from the network to enhance OOD sensitivity. This approach enables the simultaneous synthesis of both ID and OOD samples, each accompanied by an adjusted prediction confidence, thereby facilitating the training of a binary classifier sensitive to OOD. Theoretical analysis provides bounds on the generalization error of the binary classifier, demonstrating the pivotal role of confidence amendment in enhancing OOD sensitivity. Extensive experiments spanning various datasets and network architectures confirm the efficacy of the proposed method in detecting OOD samples.

LGJul 18, 2024
Robust Multivariate Time Series Forecasting against Intra- and Inter-Series Transitional Shift

Hui He, Qi Zhang, Kun Yi et al.

The non-stationary nature of real-world Multivariate Time Series (MTS) data presents forecasting models with a formidable challenge of the time-variant distribution of time series, referred to as distribution shift. Existing studies on the distribution shift mostly adhere to adaptive normalization techniques for alleviating temporal mean and covariance shifts or time-variant modeling for capturing temporal shifts. Despite improving model generalization, these normalization-based methods often assume a time-invariant transition between outputs and inputs but disregard specific intra-/inter-series correlations, while time-variant models overlook the intrinsic causes of the distribution shift. This limits model expressiveness and interpretability of tackling the distribution shift for MTS forecasting. To mitigate such a dilemma, we present a unified Probabilistic Graphical Model to Jointly capturing intra-/inter-series correlations and modeling the time-variant transitional distribution, and instantiate a neural framework called JointPGM for non-stationary MTS forecasting. Specifically, JointPGM first employs multiple Fourier basis functions to learn dynamic time factors and designs two distinct learners: intra-series and inter-series learners. The intra-series learner effectively captures temporal dynamics by utilizing temporal gates, while the inter-series learner explicitly models spatial dynamics through multi-hop propagation, incorporating Gumbel-softmax sampling. These two types of series dynamics are subsequently fused into a latent variable, which is inversely employed to infer time factors, generate final prediction, and perform reconstruction. We validate the effectiveness and efficiency of JointPGM through extensive experiments on six highly non-stationary MTS datasets, achieving state-of-the-art forecasting performance of MTS forecasting.

LGSep 23, 2023
C$^2$VAE: Gaussian Copula-based VAE Differing Disentangled from Coupled Representations with Contrastive Posterior

Zhangkai Wu, Longbing Cao

We present a self-supervised variational autoencoder (VAE) to jointly learn disentangled and dependent hidden factors and then enhance disentangled representation learning by a self-supervised classifier to eliminate coupled representations in a contrastive manner. To this end, a Contrastive Copula VAE (C$^2$VAE) is introduced without relying on prior knowledge about data in the probabilistic principle and involving strong modeling assumptions on the posterior in the neural architecture. C$^2$VAE simultaneously factorizes the posterior (evidence lower bound, ELBO) with total correlation (TC)-driven decomposition for learning factorized disentangled representations and extracts the dependencies between hidden features by a neural Gaussian copula for copula coupled representations. Then, a self-supervised contrastive classifier differentiates the disentangled representations from the coupled representations, where a contrastive loss regularizes this contrastive classification together with the TC loss for eliminating entangled factors and strengthening disentangled representations. C$^2$VAE demonstrates a strong effect in enhancing disentangled representation learning. C$^2$VAE further contributes to improved optimization addressing the TC-based VAE instability and the trade-off between reconstruction and representation.

LGOct 2, 2023
R-divergence for Estimating Model-oriented Distribution Discrepancy

Zhilin Zhao, Longbing Cao

Real-life data are often non-IID due to complex distributions and interactions, and the sensitivity to the distribution of samples can differ among learning models. Accordingly, a key question for any supervised or unsupervised model is whether the probability distributions of two given datasets can be considered identical. To address this question, we introduce R-divergence, designed to assess model-oriented distribution discrepancies. The core insight is that two distributions are likely identical if their optimal hypothesis yields the same expected risk for each distribution. To estimate the distribution discrepancy between two datasets, R-divergence learns a minimum hypothesis on the mixed data and then gauges the empirical risk difference between them. We evaluate the test power across various unsupervised and supervised tasks and find that R-divergence achieves state-of-the-art performance. To demonstrate the practicality of R-divergence, we employ R-divergence to train robust neural networks on samples with noisy labels.

CLFeb 4
Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models

Yu Zhang, Xinchen Li, Jialei Zhou et al.

Block-wise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations.

ROMar 17
UGotMe: An Embodied System for Affective Human-Robot Interaction

Peizhen Li, Longbing Cao, Xiao-Ming Wu et al.

Equipping humanoid robots with the capability to understand emotional states of human interactants and express emotions appropriately according to situations is essential for affective human-robot interaction. However, enabling current vision-aware multimodal emotion recognition models for affective human-robot interaction in the real-world raises embodiment challenges: addressing the environmental noise issue and meeting real-time requirements. First, in multiparty conversation scenarios, the noises inherited in the visual observation of the robot, which may come from either 1) distracting objects in the scene or 2) inactive speakers appearing in the field of view of the robot, hinder the models from extracting emotional cues from vision inputs. Secondly, realtime response, a desired feature for an interactive system, is also challenging to achieve. To tackle both challenges, we introduce an affective human-robot interaction system called UGotMe designed specifically for multiparty conversations. Two denoising strategies are proposed and incorporated into the system to solve the first issue. Specifically, to filter out distracting objects in the scene, we propose extracting face images of the speakers from the raw images and introduce a customized active face extraction strategy to rule out inactive speakers. As for the second issue, we employ efficient data transmission from the robot to the local server to improve realtime response capability. We deploy UGotMe on a human robot named Ameca to validate its real-time inference capabilities in practical scenarios. Videos demonstrating real-world deployment are available at https://lipzh5.github.io/HumanoidVLE/.

CLMay 11
PHAGE: Patent Heterogeneous Attention-Guided Graph Encoder for Representation Learning

Yongmin Yoo, Qiongkai Xu, Zhangkai Wu et al.

Patent claims form a directed dependency structure in which dependent claims inherit and refine the scope of earlier claims; however, existing patent encoders linearize claims as text and discard this hierarchy. Directly encoding this structure into self-attention poses two challenges: claim dependencies mix relation types that differ in semantics and extraction reliability, and the dependency graph is defined over claims while Transformers attend over tokens. PHAGE addresses the first challenge through a deterministic graph construction pipeline that separates near-deterministic legal citations from noisier rule-based technical relations, preserving type distinctions as heterogeneous edges. It addresses the second through a connectivity mask and learnable relation-aware biases that lift claim-level topology into token-level attention, allowing the encoder to differentially weight each relation type. A dual-granularity contrastive objective then aligns representations with both inter-patent taxonomy and intra-patent topology. PHAGE outperforms all baselines on classification, retrieval, and clustering, showing that intra-document claim topology is a stronger inductive bias than inter-document structure and that this bias persists in the encoder weights after training.

AIMay 10
Beyond ESG Scores: Learning Dynamic Constraints for Sequential Portfolio Optimization

Xin Li, Yan Ke, Longbing Cao

ESG-aware portfolio optimization is increasingly important for sustainable capital allocation, yet most learning-based methods still operationalize ESG by appending static scores to the policy observation or reward. This creates a mismatch for sequential control: ESG scores are noisy, provider-dependent, low-frequency, and temporally misaligned with sequential portfolio decisions, while financial evidence suggests that ESG is better treated as a portfolio preference, risk-exposure, or hedge dimension than as a robust alpha factor. We propose to impose ESG constraints without modifying the financial policy's observation or reward, using a Multimodal Action-Conditioned Constraint Field (MACF) that learns mechanism-specific ESG costs from point-in-time multimodal evidence and contemplated portfolio transitions. We then introduce MACF-X, a family of optimizer-specific adapters that converts MACF costs and uncertainties into native constrained-optimization interfaces through a shared slack- and uncertainty-aware pressure layer. Across multiple constraint-integration interfaces, MACF-X reduces tail ESG budget pressure while maintaining competitive financial performance. Ablations show that this improvement depends on dynamic evidence inputs and three-head decomposition, while static ESG-score proxies are nearly indistinguishable from score-shuffled noise baselines.

LGMar 19
MLOW: Interpretable Low-Rank Frequency Magnitude Decomposition of Multiple Effects for Time Series Forecasting

Runze Yang, Longbing Cao, Xiaoming Wu et al.

Separating multiple effects in time series is fundamental yet challenging for time-series forecasting (TSF). However, existing TSF models cannot effectively learn interpretable multi-effect decomposition by their smoothing-based temporal techniques. Here, a new interpretable frequency-based decomposition pipeline MLOW captures the insight: a time series can be represented as a magnitude spectrum multiplied by the corresponding phase-aware basis functions, and the magnitude spectrum distribution of a time series always exhibits observable patterns for different effects. MLOW learns a low-rank representation of the magnitude spectrum to capture dominant trending and seasonal effects. We explore low-rank methods, including PCA, NMF, and Semi-NMF, and find that none can simultaneously achieve interpretable, efficient and generalizable decomposition. Thus, we propose hyperplane-nonnegative matrix factorization (Hyperplane-NMF). Further, to address the frequency (spectral) leakage restricting high-quality low-rank decomposition, MLOW enables a flexible selection of input horizons and frequency levels via a mathematical mechanism. Visual analysis demonstrates that MLOW enables interpretable and hierarchical multiple-effect decomposition, robust to noises. It can also enable plug-and-play in existing TSF backbones with remarkable performance improvement but minimal architectural modifications.

MMDec 18, 2023
Frequency Spectrum is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector

An Lao, Qi Zhang, Chongyang Shi et al.

Multimodal content, such as mixing text with images, presents significant challenges to rumor detection in social media. Existing multimodal rumor detection has focused on mixing tokens among spatial and sequential locations for unimodal representation or fusing clues of rumor veracity across modalities. However, they suffer from less discriminative unimodal representation and are vulnerable to intricate location dependencies in the time-consuming fusion of spatial and sequential tokens. This work makes the first attempt at multimodal rumor detection in the frequency domain, which efficiently transforms spatial features into the frequency spectrum and obtains highly discriminative spectrum features for multimodal representation and fusion. A novel Frequency Spectrum Representation and fUsion network (FSRU) with dual contrastive learning reveals the frequency spectrum is more effective for multimodal representation and fusion, extracting the informative components for rumor detection. FSRU involves three novel mechanisms: utilizing the Fourier transform to convert features in the spatial domain to the frequency domain, the unimodal spectrum compression, and the cross-modal spectrum co-selection module in the frequency domain. Substantial experiments show that FSRU achieves satisfactory multimodal rumor detection performance.

LGFeb 3, 2024
DE$^3$-BERT: Distance-Enhanced Early Exiting for BERT based on Prototypical Networks

Jianing He, Qi Zhang, Weiping Ding et al.

Early exiting has demonstrated its effectiveness in accelerating the inference of pre-trained language models like BERT by dynamically adjusting the number of layers executed. However, most existing early exiting methods only consider local information from an individual test sample to determine their exiting indicators, failing to leverage the global information offered by sample population. This leads to suboptimal estimation of prediction correctness, resulting in erroneous exiting decisions. To bridge the gap, we explore the necessity of effectively combining both local and global information to ensure reliable early exiting during inference. Purposefully, we leverage prototypical networks to learn class prototypes and devise a distance metric between samples and class prototypes. This enables us to utilize global information for estimating the correctness of early predictions. On this basis, we propose a novel Distance-Enhanced Early Exiting framework for BERT (DE$^3$-BERT). DE$^3$-BERT implements a hybrid exiting strategy that supplements classic entropy-based local information with distance-based global information to enhance the estimation of prediction correctness for more reliable early exiting decisions. Extensive experiments on the GLUE benchmark demonstrate that DE$^3$-BERT consistently outperforms state-of-the-art models under different speed-up ratios with minimal storage or computational overhead, yielding a better trade-off between model performance and inference efficiency. Additionally, an in-depth analysis further validates the generality and interpretability of our method.

CVMay 25, 2025
Enhancing Text-to-Image Diffusion Transformer via Split-Text Conditioning

Yu Zhang, Jialei Zhou, Xinchen Li et al.

Current text-to-image diffusion generation typically employs complete-text conditioning. Due to the intricate syntax, diffusion transformers (DiTs) inherently suffer from a comprehension defect of complete-text captions. One-fly complete-text input either overlooks critical semantic details or causes semantic confusion by simultaneously modeling diverse semantic primitive types. To mitigate this defect of DiTs, we propose a novel split-text conditioning framework named DiT-ST. This framework converts a complete-text caption into a split-text caption, a collection of simplified sentences, to explicitly express various semantic primitives and their interconnections. The split-text caption is then injected into different denoising stages of DiT-ST in a hierarchical and incremental manner. Specifically, DiT-ST leverages Large Language Models to parse captions, extracting diverse primitives and hierarchically sorting out and constructing these primitives into a split-text input. Moreover, we partition the diffusion denoising process according to its differential sensitivities to diverse semantic primitive types and determine the appropriate timesteps to incrementally inject tokens of diverse semantic primitive types into input tokens via cross-attention. In this way, DiT-ST enhances the representation learning of specific semantic primitive types across different stages. Extensive experiments validate the effectiveness of our proposed DiT-ST in mitigating the complete-text comprehension defect.

LGMay 24, 2024
ParamReL: Learning Parameter Space Representation via Progressively Encoding Bayesian Flow Networks

Zhangkai Wu, Xuhui Fan, Jin Li et al.

The recently proposed Bayesian Flow Networks~(BFNs) show great potential in modeling parameter spaces, offering a unified strategy for handling continuous, discretized, and discrete data. However, BFNs cannot learn high-level semantic representation from the parameter space since {common encoders, which encode data into one static representation, cannot capture semantic changes in parameters.} This motivates a new direction: learning semantic representations hidden in the parameter spaces to characterize mixed-typed noisy data. {Accordingly, we propose a representation learning framework named ParamReL, which operates in the parameter space to obtain parameter-wise latent semantics that exhibit progressive structures. Specifically, ParamReL proposes a \emph{self-}encoder to learn latent semantics directly from parameters, rather than from observations. The encoder is then integrated into BFNs, enabling representation learning with various formats of observations. Mutual information terms further promote the disentanglement of latent semantics and capture meaningful semantics simultaneously.} We illustrate {conditional generation and reconstruction} in ParamReL via expanding BFNs, and extensive {quantitative} experimental results demonstrate the {superior effectiveness} of ParamReL in learning parameter representation.

LGJan 7, 2024
Weakly Augmented Variational Autoencoder in Time Series Anomaly Detection

Zhangkai Wu, Longbing Cao, Qi Zhang et al.

Due to their unsupervised training and uncertainty estimation, deep Variational Autoencoders (VAEs) have become powerful tools for reconstruction-based Time Series Anomaly Detection (TSAD). Existing VAE-based TSAD methods, either statistical or deep, tune meta-priors to estimate the likelihood probability for effectively capturing spatiotemporal dependencies in the data. However, these methods confront the challenge of inherent data scarcity, which is often the case in anomaly detection tasks. Such scarcity easily leads to latent holes, discontinuous regions in latent space, resulting in non-robust reconstructions on these discontinuous spaces. We propose a novel generative framework that combines VAEs with self-supervised learning (SSL) to address this issue.

CLMay 25, 2025
PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims

Yongmin Yoo, Qiongkai Xu, Longbing Cao

High-stakes texts such as patent claims, medical records, and technical reports are structurally complex and demand a high degree of reliability and precision. While large language models (LLMs) have recently been applied to automate their generation in high-stakes domains, reliably evaluating such outputs remains a major challenge. Conventional natural language generation (NLG) metrics are effective for generic documents but fail to capture the structural and legal characteristics essential to evaluating complex high-stakes documents. To address this gap, we propose PatentScore, a multi-dimensional evaluation framework specifically designed for one of the most intricate and rigorous domains, patent claims. PatentScore integrates hierarchical decomposition of claim elements, validation patterns grounded in legal and technical standards, and scoring across structural, semantic, and legal dimensions. In experiments on our dataset which consists of 400 Claim1, PatentScore achieved the highest correlation with expert annotations ($r = 0.819$), significantly outperforming widely used NLG metrics. This work establishes a new standard for evaluating LLM-generated patent claims, providing a solid foundation for research on patent generation and validation.

LGApr 24, 2024
FedSI: Federated Subnetwork Inference for Efficient Uncertainty Quantification

Hui Chen, Hengyu Liu, Zhangkai Wu et al.

While deep neural networks (DNNs) based personalized federated learning (PFL) is demanding for addressing data heterogeneity and shows promising performance, existing methods for federated learning (FL) suffer from efficient systematic uncertainty quantification. The Bayesian DNNs-based PFL is usually questioned of either over-simplified model structures or high computational and memory costs. In this paper, we introduce FedSI, a novel Bayesian DNNs-based subnetwork inference PFL framework. FedSI is simple and scalable by leveraging Bayesian methods to incorporate systematic uncertainties effectively. It implements a client-specific subnetwork inference mechanism, selects network parameters with large variance to be inferred through posterior distributions, and fixes the rest as deterministic ones. FedSI achieves fast and scalable inference while preserving the systematic uncertainties to the fullest extent. Extensive experiments on three different benchmark datasets demonstrate that FedSI outperforms existing Bayesian and non-Bayesian FL baselines in heterogeneous FL scenarios.

CLApr 5
Adaptive Cost-Efficient Evaluation for Reliable Patent Claim Validation

Yongmin Yoo, Qiongkai Xu, Longbing Cao

Automated validation of patent claims demands zero-defect tolerance, as even a single structural flaw can render a claim legally defective. Existing evaluation paradigms suffer from a rigidity-resource dilemma: lightweight encoders struggle with nuanced legal dependencies, while exhaustive verification via Large Language Models (LLMs) is prohibitively costly. To bridge this gap, we propose ACE (Adaptive Cost-efficient Evaluation), a hybrid framework that uses predictive entropy to route only high-uncertainty claims to an expert LLM. The expert then executes a Chain of Patent Thought (CoPT) protocol grounded in 35 U.S.C. statutory standards. This design enables ACE to handle long-range legal dependencies more effectively while preserving efficiency. ACE achieves the best F1 among the evaluated methods at 94.95\%, while reducing operational costs by 78\% compared to standalone LLM deployments. We also construct ACE-40k, a 40,000-claim benchmark with MPEP-grounded error annotations, to facilitate further research.

CLJul 30, 2025
CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Dongchen Li, Jitao Liang, Wei Li et al.

Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and multilingual nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these diverse settings, CliCARE significantly outperforms strong baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by expert oncologists.

LGJul 26, 2025
Modeling enzyme temperature stability from sequence segment perspective

Ziqi Zhang, Shiheng Chen, Runze Yang et al.

Developing enzymes with desired thermal properties is crucial for a wide range of industrial and research applications, and determining temperature stability is an essential step in this process. Experimental determination of thermal parameters is labor-intensive, time-consuming, and costly. Moreover, existing computational approaches are often hindered by limited data availability and imbalanced distributions. To address these challenges, we introduce a curated temperature stability dataset designed for model development and benchmarking in enzyme thermal modeling. Leveraging this dataset, we present the \textit{Segment Transformer}, a novel deep learning framework that enables efficient and accurate prediction of enzyme temperature stability. The model achieves state-of-the-art performance with an RMSE of 24.03, MAE of 18.09, and Pearson and Spearman correlations of 0.33, respectively. These results highlight the effectiveness of incorporating segment-level representations, grounded in the biological observation that different regions of a protein sequence contribute unequally to thermal behavior. As a proof of concept, we applied the Segment Transformer to guide the engineering of a cutinase enzyme. Experimental validation demonstrated a 1.64-fold improvement in relative activity following heat treatment, achieved through only 17 mutations and without compromising catalytic function.

CVFeb 24, 2025
SCoT: Unifying Consistency Models and Rectified Flows via Straight-Consistent Trajectories

Zhangkai Wu, Xuhui Fan, Hongyu Wu et al.

Pre-trained diffusion models are commonly used to generate clean data (e.g., images) from random noises, effectively forming pairs of noises and corresponding clean images. Distillation on these pre-trained models can be viewed as the process of constructing advanced trajectories within the pair to accelerate sampling. For instance, consistency model distillation develops consistent projection functions to regulate trajectories, although sampling efficiency remains a concern. Rectified flow method enforces straight trajectories to enable faster sampling, yet relies on numerical ODE solvers, which may introduce approximation errors. In this work, we bridge the gap between the consistency model and the rectified flow method by proposing a Straight Consistent Trajectory~(SCoT) model. SCoT enjoys the benefits of both approaches for fast sampling, producing trajectories with consistent and straight properties simultaneously. These dual properties are strategically balanced by targeting two critical objectives: (1) regulating the gradient of SCoT's mapping to a constant, (2) ensuring trajectory consistency. Extensive experimental results demonstrate the effectiveness and efficiency of SCoT.

CVFeb 1
Adaptive Visual Autoregressive Acceleration via Dual-Linkage Entropy Analysis

Yu Zhang, Jingyi Liu, Feng Liu et al.

Visual AutoRegressive modeling (VAR) suffers from substantial computational cost due to the massive token count involved. Failing to account for the continuous evolution of modeling dynamics, existing VAR token reduction methods face three key limitations: heuristic stage partition, non-adaptive schedules, and limited acceleration scope, thereby leaving significant acceleration potential untapped. Since entropy variation intrinsically reflects the transition of predictive uncertainty, it offers a principled measure to capture modeling dynamics evolution. Therefore, we propose NOVA, a training-free token reduction acceleration framework for VAR models via entropy analysis. NOVA adaptively determines the acceleration activation scale during inference by online identifying the inflection point of scale entropy growth. Through scale-linkage and layer-linkage ratio adjustment, NOVA dynamically computes distinct token reduction ratios for each scale and layer, pruning low-entropy tokens while reusing the cache derived from the residuals at the prior scale to accelerate inference and maintain generation quality. Extensive experiments and analyses validate NOVA as a simple yet effective training-free acceleration framework.

CVNov 28, 2025
Markovian Scale Prediction: A New Era of Visual Autoregressive Generation

Yu Zhang, Jingyi Liu, Yiwei Shi et al.

Visual AutoRegressive modeling (VAR) based on next-scale prediction has revitalized autoregressive visual generation. Although its full-context dependency, i.e., modeling all previous scales for next-scale prediction, facilitates more stable and comprehensive representation learning by leveraging complete information flow, the resulting computational inefficiency and substantial overhead severely hinder VAR's practicality and scalability. This motivates us to develop a new VAR model with better performance and efficiency without full-context dependency. To address this, we reformulate VAR as a non-full-context Markov process, proposing Markov-VAR. It is achieved via Markovian Scale Prediction: we treat each scale as a Markov state and introduce a sliding window that compresses certain previous scales into a compact history vector to compensate for historical information loss owing to non-full-context dependency. Integrating the history vector with the Markov state yields a representative dynamic state that evolves under a Markov process. Extensive experiments demonstrate that Markov-VAR is extremely simple yet highly effective: Compared to VAR on ImageNet, Markov-VAR reduces FID by 10.5% (256 $\times$ 256) and decreases peak memory consumption by 83.8% (1024 $\times$ 1024). We believe that Markov-VAR can serve as a foundation for future research on visual autoregressive generation and other downstream tasks.

HCNov 28, 2025
SimClinician: A Multimodal Simulation Testbed for Reliable Psychologist AI Collaboration in Mental Health Diagnosis

Filippo Cenacchi, Longbing Cao, Deborah Richards

AI based mental health diagnosis is often judged by benchmark accuracy, yet in practice its value depends on how psychologists respond whether they accept, adjust, or reject AI suggestions. Mental health makes this especially challenging: decisions are continuous and shaped by cues in tone, pauses, word choice, and nonverbal behaviors of patients. Current research rarely examines how AI diagnosis interface design influences these choices, leaving little basis for reliable testing before live studies. We present SimClinician, an interactive simulation platform, to transform patient data into psychologist AI collaborative diagnosis. Contributions include: (1) a dashboard integrating audio, text, and gaze-expression patterns; (2) an avatar module rendering de-identified dynamics for analysis; (3) a decision layer that maps AI outputs to multimodal evidence, letting psychologists review AI reasoning, and enter a diagnosis. Tested on the E-DAIC corpus (276 clinical interviews, expanded to 480,000 simulations), SimClinician shows that a confirmation step raises acceptance by 23%, keeping escalations below 9%, and maintaining smooth interaction flow.

LGNov 28, 2025
Learning When to Ask: Simulation-Trained Humanoids for Mental-Health Diagnosis

Filippo Cenacchi, Deborah Richards, Longbing Cao

Testing humanoid robots with users is slow, causes wear, and limits iteration and diversity. Yet screening agents must master conversational timing, prosody, backchannels, and what to attend to in faces and speech for Depression and PTSD. Most simulators omit policy learning with nonverbal dynamics; many controllers chase task accuracy while underweighting trust, pacing, and rapport. We virtualise the humanoid as a conversational agent to train without hardware burden. Our agent-centred, simulation-first pipeline turns interview data into 276 Unreal Engine MetaHuman patients with synchronised speech, gaze/face, and head-torso poses, plus PHQ-8 and PCL-C flows. A perception-fusion-policy loop decides what and when to speak, when to backchannel, and how to avoid interruptions, under a safety shield. Training uses counterfactual replay (bounded nonverbal perturbations) and an uncertainty-aware turn manager that probes to reduce diagnostic ambiguity. Results are simulation-only; the humanoid is the transfer target. In comparing three controllers, a custom TD3 (Twin Delayed DDPG) outperformed PPO and CEM, achieving near-ceiling coverage with steadier pace at comparable rewards. Decision-quality analyses show negligible turn overlap, aligned cut timing, fewer clarification prompts, and shorter waits. Performance stays stable under modality dropout and a renderer swap, and rankings hold on a held-out patient split. Contributions: (1) an agent-centred simulator that turns interviews into 276 interactive patients with bounded nonverbal counterfactuals; (2) a safe learning loop that treats timing and rapport as first-class control variables; (3) a comparative study (TD3 vs PPO/CEM) with clear gains in completeness and social timing; and (4) ablations and robustness analyses explaining the gains and enabling clinician-supervised humanoid pilots.

CVOct 27, 2025
Nested AutoRegressive Models

Hongyu Wu, Xuhui Fan, Zhangkai Wu et al.

AutoRegressive (AR) models have demonstrated competitive performance in image generation, achieving results comparable to those of diffusion models. However, their token-by-token image generation mechanism remains computationally intensive and existing solutions such as VAR often lead to limited sample diversity. In this work, we propose a Nested AutoRegressive~(NestAR) model, which proposes nested AutoRegressive architectures in generating images. NestAR designs multi-scale modules in a hierarchical order. These different scaled modules are constructed in an AR architecture, where one larger-scale module is conditioned on outputs from its previous smaller-scale module. Within each module, NestAR uses another AR structure to generate ``patches'' of tokens. The proposed nested AR architecture reduces the overall complexity from $\mathcal{O}(n)$ to $\mathcal{O}(\log n)$ in generating $n$ image tokens, as well as increases image diversities. NestAR further incorporates flow matching loss to use continuous tokens, and develops objectives to coordinate these multi-scale modules in model training. NestAR achieves competitive image generation performance while significantly lowering computational cost.

CVOct 27, 2025
FAME: Fairness-aware Attention-modulated Video Editing

Zhangkai Wu, Xuhui Fan, Zhongyuan Xie et al.

Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.

CVOct 27, 2025
VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Zhangkai Wu, Xuhui Fan, Zhongyuan Xie et al.

Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

CLOct 23, 2025
Tri-Modal Severity Fused Diagnosis across Depression and Post-traumatic Stress Disorders

Filippo Cenacchi, Deborah Richards, Longbing Cao

Depression and post traumatic stress disorder (PTSD) often co-occur with connected symptoms, complicating automated assessment, which is often binary and disorder specific. Clinically useful diagnosis needs severity aware cross disorder estimates and decision support explanations. Our unified tri modal affective severity framework synchronizes and fuses interview text with sentence level transformer embeddings, audio with log Mel statistics with deltas, and facial signals with action units, gaze, head and pose descriptors to output graded severities for diagnosing both depression (PHQ-8; 5 classes) and PTSD (3 classes). Standardized features are fused via a calibrated late fusion classifier, yielding per disorder probabilities and feature-level attributions. This severity aware tri-modal affective fusion approach is demoed on multi disorder concurrent depression and PTSD assessment. Stratified cross validation on DAIC derived corpora outperforms unimodal/ablation baselines. The fused model matches the strongest unimodal baseline on accuracy and weighted F1, while improving decision curve utility and robustness under noisy or missing modalities. For PTSD specifically, fusion reduces regression error and improves class concordance. Errors cluster between adjacent severities; extreme classes are identified reliably. Ablations show text contributes most to depression severity, audio and facial cues are critical for PTSD, whereas attributions align with linguistic and behavioral markers. Our approach offers reproducible evaluation and clinician in the loop support for affective clinical decision making.

CLOct 6, 2025
Self-Filtered Distillation with LLMs-generated Trust Indicators for Reliable Patent Classification

Yongmin Yoo, Xu Zhang, Longbing Cao

Large language models (LLMs) increasingly generate natural language rationales to enhance interpretability, but these often contain logical errors, label mismatches, and domain-specific misalignments. Directly using such rationales as supervision risks propagating noise and undermining training stability. To address this challenge, we introduce Self-Filtered Distillation, a framework specifically tailored for patent classification, which treats LLM-generated rationales as trust signals rather than ground-truth supervision. The framework employs selective distillation guided by three unsupervised trust metrics: (1) Self-Consistency, which measures the stability of LLM-generated rationales across multiple generations; (2) Class Entailment Alignment, which assesses semantic coherence with patent-specific class definitions; and (3) LLM Agreement Scoring, which validates rationale-label plausibility. These metrics are integrated into a unified trust score that primarily weights training samples while optionally filtering out extremely low-trust cases, enabling reasoning-aware supervision. Experiments on the USPTO-2M dataset, a widely used benchmark for patent classification, show that our method outperforms label-based learning and conventional distillation in accuracy, stability, and interpretability, establishing a reliable paradigm for leveraging reasoning-aware trust indicators in patent analytics.

LGSep 22, 2025
Revealing Multimodal Causality with Large Language Models

Jin Li, Shoujin Wang, Qi Zhang et al.

Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.

LGJul 13, 2025
Fourier Basis Mapping: A Time-Frequency Learning Framework for Time Series Forecasting

Runze Yang, Longbing Cao, Xin You et al.

The integration of Fourier transform and deep learning opens new avenues for time series forecasting. We reconsider the Fourier transform from a basis functions perspective. Specifically, the real and imaginary parts of the frequency components can be regarded as the coefficients of cosine and sine basis functions at tiered frequency levels, respectively. We find that existing Fourier-based methods face inconsistent starting cycles and inconsistent series length issues. They fail to interpret frequency components precisely and overlook temporal information. Accordingly, the novel Fourier Basis Mapping (FBM) method addresses these issues by integrating time-frequency features through Fourier basis expansion and mapping in the time-frequency space. Our approach extracts explicit frequency features while preserving temporal characteristics. FBM supports plug-and-play integration with various types of neural networks by only adjusting the first initial projection layer for better performance. First, we propose FBM-L, FBM-NL, and FBM-NP to enhance linear, MLP-based, and Transformer-based models, respectively, demonstrating the effectiveness of time-frequency features. Next, we propose a synergetic model architecture, termed FBM-S, which decomposes the seasonal, trend, and interaction effects into three separate blocks, each designed to model time-frequency features in a specialized manner. Finally, we introduce several techniques tailored for time-frequency features, including interaction masking, centralization, patching, rolling window projection, and multi-scale down-sampling. The results are validated on diverse real-world datasets for both long-term and short-term forecasting tasks with SOTA performance.