Xiaohan Li

CV
h-index43
35papers
612citations
Novelty50%
AI Score56

35 Papers

LGOct 28, 2022
Mitigating Health Disparities in EHR via Deconfounder

Zheng Liu, Xiaohan Li, Philip Yu

Health disparities, or inequalities between different patient demographics, are becoming crucial in medical decision-making, especially in Electronic Health Record (EHR) predictive modeling. To ensure the fairness of sensitive attributes, conventional studies mainly adopt calibration or re-weighting methods to balance the performance on among different demographic groups. However, we argue that these methods have some limitations. First, these methods usually mean a trade-off between the model's performance and fairness. Second, many methods completely attribute unfairness to the data collection process, which lacks substantial evidence. In this paper, we provide an empirical study to discover the possibility of using deconfounder to address the disparity issue in healthcare. Our study can be summarized in two parts. The first part is a pilot study demonstrating the exacerbation of disparity when unobserved confounders exist. The second part proposed a novel framework, Parity Medical Deconfounder (PriMeD), to deal with the disparity issue in healthcare datasets. Inspired by the deconfounder theory, PriMeD adopts a Conditional Variational Autoencoder (CVAE) to learn latent factors (substitute confounders) for observational data, and extensive experiments are provided to show its effectiveness.

LGAug 22, 2023
A Counterfactual Fair Model for Longitudinal Electronic Health Records via Deconfounder

Zheng Liu, Xiaohan Li, Philip Yu

The fairness issue of clinical data modeling, especially on Electronic Health Records (EHRs), is of utmost importance due to EHR's complex latent structure and potential selection bias. It is frequently necessary to mitigate health disparity while keeping the model's overall accuracy in practice. However, traditional methods often encounter the trade-off between accuracy and fairness, as they fail to capture the underlying factors beyond observed data. To tackle this challenge, we propose a novel model called Fair Longitudinal Medical Deconfounder (FLMD) that aims to achieve both fairness and accuracy in longitudinal Electronic Health Records (EHR) modeling. Drawing inspiration from the deconfounder theory, FLMD employs a two-stage training process. In the first stage, FLMD captures unobserved confounders for each encounter, which effectively represents underlying medical factors beyond observed EHR, such as patient genotypes and lifestyle habits. This unobserved confounder is crucial for addressing the accuracy/fairness dilemma. In the second stage, FLMD combines the learned latent representation with other relevant features to make predictions. By incorporating appropriate fairness criteria, such as counterfactual fairness, FLMD ensures that it maintains high prediction accuracy while simultaneously minimizing health disparities. We conducted comprehensive experiments on two real-world EHR datasets to demonstrate the effectiveness of FLMD. Apart from the comparison of baseline methods and FLMD variants in terms of fairness and accuracy, we assessed the performance of all models on disturbed/imbalanced and synthetic datasets to showcase the superiority of FLMD across different settings and provide valuable insights into its capabilities.

IRJan 10, 2023
Time-aware Hyperbolic Graph Attention Network for Session-based Recommendation

Xiaohan Li, Yuqing Liu, Zheng Liu et al.

Session-based Recommendation (SBR) is to predict users' next interested items based on their previous browsing sessions. Existing methods model sessions as graphs or sequences to estimate user interests based on their interacted items to make recommendations. In recent years, graph-based methods have achieved outstanding performance on SBR. However, none of these methods consider temporal information, which is a crucial feature in SBR as it indicates timeliness or currency. Besides, the session graphs exhibit a hierarchical structure and are demonstrated to be suitable in hyperbolic geometry. But few papers design the models in hyperbolic spaces and this direction is still under exploration. In this paper, we propose Time-aware Hyperbolic Graph Attention Network (TA-HGAT) - a novel hyperbolic graph neural network framework to build a session-based recommendation model considering temporal information. More specifically, there are three components in TA-HGAT. First, a hyperbolic projection module transforms the item features into hyperbolic space. Second, the time-aware graph attention module models time intervals between items and the users' current interests. Third, an evolutionary loss at the end of the model provides an accurate prediction of the recommended item based on the given timestamp. TA-HGAT is built in a hyperbolic space to learn the hierarchical structure of session graphs. Experimental results show that the proposed TA-HGAT has the best performance compared to ten baseline models on two real-world datasets.

36.7ROMay 5
Driving Style Recognition Like an Expert Using Semantic Privileged Information from Large Language Models

Zhaokun Chen, Chaopeng Zhang, Xiaohan Li et al.

Existing driving style recognition systems largely depend on low-level sensor-derived features for training, neglecting the rich semantic reasoning capability inherent to human experts. This discrepancy results in a fundamental misalignment between algorithmic classifications and expert judgments. To bridge this gap, we propose a novel framework that integrates Semantic Privileged Information (SPI) derived from large language models (LLMs) to align recognition outcomes with human-interpretable reasoning. First, we introduce DriBehavGPT, an interactive LLM-based module that generates natural-language descriptions of driving behaviors. These descriptions are then encoded into machine learning-compatible representations via text embedding and dimensionality reduction. Finally, we incorporate them as privileged information into Support Vector Machine Plus (SVM+) for training, enabling the model to approximate human-like interpretation patterns. Experiments across diverse real-world driving scenarios demonstrate that our SPI-enhanced framework outperforms conventional methods, achieving F1-score improvements of 7.6% (car-following) and 7.9% (lane-changing). Importantly, SPI is exclusively used during training, while inference relies solely on sensor data, ensuring computational efficiency without sacrificing performance. These results highlight the pivotal role of semantic behavioral representations in improving recognition accuracy while advancing interpretable, human-centric driving systems.

IRNov 16, 2023
Group-Aware Interest Disentangled Dual-Training for Personalized Recommendation

Xiaolong Liu, Liangwei Yang, Zhiwei Liu et al. · salesforce

Personalized recommender systems aim to predict users' preferences for items. It has become an indispensable part of online services. Online social platforms enable users to form groups based on their common interests. The users' group participation on social platforms reveals their interests and can be utilized as side information to mitigate the data sparsity and cold-start problem in recommender systems. Users join different groups out of different interests. In this paper, we generate group representation from the user's interests and propose IGRec (Interest-based Group enhanced Recommendation) to utilize the group information accurately. It consists of four modules. (1) Interest disentangler via self-gating that disentangles users' interests from their initial embedding representation. (2) Interest aggregator that generates the interest-based group representation by Gumbel-Softmax aggregation on the group members' interests. (3) Interest-based group aggregation that fuses user's representation with the participated group representation. (4) A dual-trained rating prediction module to utilize both user-item and group-item interactions. We conduct extensive experiments on three publicly available datasets. Results show IGRec can effectively alleviate the data sparsity problem and enhance the recommender system with interest-based group representation. Experiments on the group recommendation task further show the informativeness of interest-based group representation.

CVMar 3Code
Kling-MotionControl Technical Report

Kling Team, Jialu Chen, Yikang Ding et al.

Character animation aims to generate lifelike videos by transferring motion dynamics from a driving video to a reference image. Recent strides in generative models have paved the way for high-fidelity character animation. In this work, we present Kling-MotionControl, a unified DiT-based framework engineered specifically for robust, precise, and expressive holistic character animation. Leveraging a divide-and-conquer strategy within a cohesive system, the model orchestrates heterogeneous motion representations tailored to the distinct characteristics of body, face, and hands, effectively reconciling large-scale structural stability with fine-grained articulatory expressiveness. To ensure robust cross-identity generalization, we incorporate adaptive identity-agnostic learning, facilitating natural motion retargeting for diverse characters ranging from realistic humans to stylized cartoons. Simultaneously, we guarantee faithful appearance preservation through meticulous identity injection and fusion designs, further supported by a subject library mechanism that leverages comprehensive reference contexts. To ensure practical utility, we implement an advanced acceleration framework utilizing multi-stage distillation, boosting inference speed by over 10x. Kling-MotionControl distinguishes itself through intelligent semantic motion understanding and precise text responsiveness, allowing for flexible control beyond visual inputs. Human preference evaluations demonstrate that Kling-MotionControl delivers superior performance compared to leading commercial and open-source solutions, achieving exceptional fidelity in holistic motion control, open domain generalization, and visual quality and coherence. These results establish Kling-MotionControl as a robust solution for high-quality, controllable, and lifelike character animation.

IRNov 16, 2022
Mitigating Frequency Bias in Next-Basket Recommendation via Deconfounders

Xiaohan Li, Zheng Liu, Luyi Ma et al.

Recent studies on Next-basket Recommendation (NBR) have achieved much progress by leveraging Personalized Item Frequency (PIF) as one of the main features, which measures the frequency of the user's interactions with the item. However, taking the PIF as an explicit feature incurs bias towards frequent items. Items that a user purchases frequently are assigned higher weights in the PIF-based recommender system and appear more frequently in the personalized recommendation list. As a result, the system will lose the fairness and balance between items that the user frequently purchases and items that the user never purchases. We refer to this systematic bias on personalized recommendation lists as frequency bias, which narrows users' browsing scope and reduces the system utility. We adopt causal inference theory to address this issue. Considering the influence of historical purchases on users' future interests, the user and item representations can be viewed as unobserved confounders in the causal diagram. In this paper, we propose a deconfounder model named FENDER (Frequency-aware Deconfounder for Next-basket Recommendation) to mitigate the frequency bias. With the deconfounder theory and the causal diagram we propose, FENDER decomposes PIF with a neural tensor layer to obtain substitute confounders for users and items. Then, FENDER performs unbiased recommendations considering the effect of these substitute confounders. Experimental results demonstrate that FENDER has derived diverse and fair results compared to ten baseline models on three datasets while achieving competitive performance. Further experiments illustrate how FENDER balances users' historical purchases and potential interests.

CVDec 18, 2025
Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci et al.

We present Kling-Omni, a generalist generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs. Adopting an end-to-end perspective, Kling-Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks, integrating them into a holistic system. Unlike disjointed pipeline approaches, Kling-Omni supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation to deliver cinematic-quality and highly-intelligent video content creation. To support these capabilities, we constructed a comprehensive data system that serves as the foundation for multimodal video creation. The framework is further empowered by efficient large-scale pre-training strategies and infrastructure optimizations for inference. Comprehensive evaluations reveal that Kling-Omni demonstrates exceptional capabilities in in-context generation, reasoning-based editing, and multimodal instruction following. Moving beyond a content creation tool, we believe Kling-Omni is a pivotal advancement toward multimodal world simulators capable of perceiving, reasoning, generating and interacting with the dynamic and complex worlds.

LGApr 12, 2023
Towards Spatio-temporal Sea Surface Temperature Forecasting via Static and Dynamic Learnable Personalized Graph Convolution Network

Xiaohan Li, Gaowei Zhang, Kai Huang et al.

Sea surface temperature (SST) is uniquely important to the Earth's atmosphere since its dynamics are a major force in shaping local and global climate and profoundly affect our ecosystems. Accurate forecasting of SST brings significant economic and social implications, for example, better preparation for extreme weather such as severe droughts or tropical cyclones months ahead. However, such a task faces unique challenges due to the intrinsic complexity and uncertainty of ocean systems. Recently, deep learning techniques, such as graphical neural networks (GNN), have been applied to address this task. Even though these methods have some success, they frequently have serious drawbacks when it comes to investigating dynamic spatiotemporal dependencies between signals. To solve this problem, this paper proposes a novel static and dynamic learnable personalized graph convolution network (SD-LPGC). Specifically, two graph learning layers are first constructed to respectively model the stable long-term and short-term evolutionary patterns hidden in the multivariate SST signals. Then, a learnable personalized convolution layer is designed to fuse this information. Our experiments on real SST datasets demonstrate the state-of-the-art performances of the proposed approach on the forecasting task.

CVJul 14, 2023
LEST: Large-scale LiDAR Semantic Segmentation with Transformer

Chuanyu Luo, Nuo Cheng, Sikun Ma et al.

Large-scale LiDAR-based point cloud semantic segmentation is a critical task in autonomous driving perception. Almost all of the previous state-of-the-art LiDAR semantic segmentation methods are variants of sparse 3D convolution. Although the Transformer architecture is becoming popular in the field of natural language processing and 2D computer vision, its application to large-scale point cloud semantic segmentation is still limited. In this paper, we propose a LiDAR sEmantic Segmentation architecture with pure Transformer, LEST. LEST comprises two novel components: a Space Filling Curve (SFC) Grouping strategy and a Distance-based Cosine Linear Transformer, DISCO. On the public nuScenes semantic segmentation validation set and SemanticKITTI test set, our model outperforms all the other state-of-the-art methods.

CVJul 25, 2023
Mini-PointNetPlus: a local feature descriptor in deep learning model for 3d environment perception

Chuanyu Luo, Nuo Cheng, Sikun Ma et al.

Common deep learning models for 3D environment perception often use pillarization/voxelization methods to convert point cloud data into pillars/voxels and then process it with a 2D/3D convolutional neural network (CNN). The pioneer work PointNet has been widely applied as a local feature descriptor, a fundamental component in deep learning models for 3D perception, to extract features of a point cloud. This is achieved by using a symmetric max-pooling operator which provides unique pillar/voxel features. However, by ignoring most of the points, the max-pooling operator causes an information loss, which reduces the model performance. To address this issue, we propose a novel local feature descriptor, mini-PointNetPlus, as an alternative for plug-and-play to PointNet. Our basic idea is to separately project the data points to the individual features considered, each leading to a permutation invariant. Thus, the proposed descriptor transforms an unordered point cloud to a stable order. The vanilla PointNet is proved to be a special case of our mini-PointNetPlus. Due to fully utilizing the features by the proposed descriptor, we demonstrate in experiment a considerable performance improvement for 3D perception.

CVDec 15, 2025
KlingAvatar 2.0 Technical Report

Kling Team, Jialu Chen, Yikang Ding et al.

Avatar video generation models have achieved remarkable progress in recent years. However, prior work exhibits limited efficiency in generating long-duration high-resolution videos, suffering from temporal drifting, quality degradation, and weak prompt following as video length increases. To address these challenges, we propose KlingAvatar 2.0, a spatio-temporal cascade framework that performs upscaling in both spatial resolution and temporal dimension. The framework first generates low-resolution blueprint video keyframes that capture global semantics and motion, and then refines them into high-resolution, temporally coherent sub-clips using a first-last frame strategy, while retaining smooth temporal transitions in long-form videos. To enhance cross-modal instruction fusion and alignment in extended videos, we introduce a Co-Reasoning Director composed of three modality-specific large language model (LLM) experts. These experts reason about modality priorities and infer underlying user intent, converting inputs into detailed storylines through multi-turn dialogue. A Negative Director further refines negative prompts to improve instruction alignment. Building on these components, we extend the framework to support ID-specific multi-character control. Extensive experiments demonstrate that our model effectively addresses the challenges of efficient, multimodally aligned long-form high-resolution video generation, delivering enhanced visual clarity, realistic lip-teeth rendering with accurate lip synchronization, strong identity preservation, and coherent multimodal instruction following.

IROct 22, 2020Code
Basket Recommendation with Multi-Intent Translation Graph Neural Network

Zhiwei Liu, Xiaohan Li, Ziwei Fan et al.

The problem of basket recommendation~(BR) is to recommend a ranking list of items to the current basket. Existing methods solve this problem by assuming the items within the same basket are correlated by one semantic relation, thus optimizing the item embeddings. However, this assumption breaks when there exist multiple intents within a basket. For example, assuming a basket contains \{\textit{bread, cereal, yogurt, soap, detergent}\} where \{\textit{bread, cereal, yogurt}\} are correlated through the "breakfast" intent, while \{\textit{soap, detergent}\} are of "cleaning" intent, ignoring multiple relations among the items spoils the ability of the model to learn the embeddings. To resolve this issue, it is required to discover the intents within the basket. However, retrieving a multi-intent pattern is rather challenging, as intents are latent within the basket. Additionally, intents within the basket may also be correlated. Moreover, discovering a multi-intent pattern requires modeling high-order interactions, as the intents across different baskets are also correlated. To this end, we propose a new framework named as \textbf{M}ulti-\textbf{I}ntent \textbf{T}ranslation \textbf{G}raph \textbf{N}eural \textbf{N}etwork~({\textbf{MITGNN}}). MITGNN models $T$ intents as tail entities translated from one corresponding basket embedding via $T$ relation vectors. The relation vectors are learned through multi-head aggregators to handle user and item information. Additionally, MITGNN propagates multiple intents across our defined basket graph to learn the embeddings of users and items by aggregating neighbors. Extensive experiments on two real-world datasets prove the effectiveness of our proposed model on both transductive and inductive BR. The code is available online at https://github.com/JimLiu96/MITGNN.

IRFeb 29, 2024
LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction

Chenhao Fang, Xiaohan Li, Zezhong Fan et al.

Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).

85.6AIApr 29
SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

Dianyu Liu, Chuan Qin, Xi Chen et al.

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

IROct 16, 2024
Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations

Luyi Ma, Xiaohan Li, Zezhong Fan et al.

Integrating diverse data modalities is crucial for enhancing the performance of personalized recommendation systems. Traditional models, which often rely on singular data sources, lack the depth needed to accurately capture the multifaceted nature of item features and user behaviors. This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality, which is visual, textual, and graph data through alignment with large language models (LLMs). By incorporating visual information, we capture contextual and aesthetic item characteristics; textual data provides insights into user interests and item features in detail; and graph data elucidates relationships within the item-behavior heterogeneous graphs. Our proposed model called Triple Modality Fusion (TMF) utilizes the power of LLMs to align and integrate these three modalities, achieving a comprehensive representation of user behaviors. The LLM models the user's interactions including behaviors and item features in natural languages. Initially, the LLM is warmed up using only natural language-based prompts. We then devise the modality fusion module based on cross-attention and self-attention mechanisms to integrate different modalities from other models into the same embedding space and incorporate them into an LLM. Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy. Further ablation studies validate the effectiveness of our model design and benefits of the TMF.

CVApr 17, 2024
Prompt Optimizer of Text-to-Image Diffusion Models for Abstract Concept Understanding

Zezhong Fan, Xiaohan Li, Chenhao Fang et al.

The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in the description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.

CLOct 17, 2024
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs

Jiatan Huang, Mingchen Li, Zonghai Yao et al.

Answering complex real-world questions often requires accurate retrieval from textual knowledge graphs (TKGs). The scarcity of annotated data, along with intricate topological structures, makes this task particularly challenging. As the nature of relational path information could enhance the inference ability of Large Language Models (LLMs), efficiently retrieving more complex relational path information from TKGs presents another key challenge. To tackle these challenges, we first develop a Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.We synthesize realistic user queries that integrate diverse topological structures, relational information, and complex textual descriptions. We conduct rigorous expert evaluation to validate the quality of our synthesized queries. And then, we introduce an enhanced Monte Carlo Tree Search (MCTS) method, Relational MCTS, to automatically extract relational path information from textual graphs for specific queries. Our dataset mainly covers the medical domain as the relation types and entity are complex and publicly available. Experimental results indicate that RiTeK poses significant challenges for current retrieval and LLM systems, while the proposed Relational MCTS method enhances LLM inference ability and achieves state-of-the-art performance on RiTeK.

CLOct 16, 2024
Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

Huiwen Wu, Xiaohan Li, Xiaogang Xu et al.

The development of Large Language Models (LLMs) has significantly advanced various AI applications in commercial and scientific research fields, such as scientific literature summarization, writing assistance, and knowledge graph construction. However, a significant challenge is the high risk of hallucination during LLM inference, which can lead to security concerns like factual inaccuracies, inconsistent information, and fabricated content. To tackle this issue, it is essential to develop effective methods for reducing hallucination while maintaining the original capabilities of the LLM. This paper introduces a novel approach called Iterative Model-level Contrastive Learning (Iter-AHMCL) to address hallucination. This method modifies the representation layers of pre-trained LLMs by using contrastive `positive' and `negative' models, trained on data with and without hallucinations. By leveraging the differences between these two models, we create a more straightforward pathway to eliminate hallucinations, and the iterative nature of contrastive learning further enhances performance. Experimental validation on four pre-trained foundation LLMs (LLaMA2, Alpaca, LLaMA3, and Qwen) finetuning with a specially designed dataset shows that our approach achieves an average improvement of 10.1 points on the TruthfulQA benchmark. Comprehensive experiments demonstrate the effectiveness of Iter-AHMCL in reducing hallucination while maintaining the general capabilities of LLMs.

CVAug 26, 2025
MIDAS: Multimodal Interactive Digital-humAn Synthesis via Real-time Autoregressive Video Generation

Ming Chen, Liyuan Cui, Wenyuan Zhang et al.

Recently, interactive digital human video generation has attracted widespread attention and achieved remarkable progress. However, building such a practical system that can interact with diverse input signals in real time remains challenging to existing methods, which often struggle with heavy computational cost and limited controllability. In this work, we introduce an autoregressive video generation framework that enables interactive multimodal control and low-latency extrapolation in a streaming manner. With minimal modifications to a standard large language model (LLM), our framework accepts multimodal condition encodings including audio, pose, and text, and outputs spatially and semantically coherent representations to guide the denoising process of a diffusion head. To support this, we construct a large-scale dialogue dataset of approximately 20,000 hours from multiple sources, providing rich conversational scenarios for training. We further introduce a deep compression autoencoder with up to 64$\times$ reduction ratio, which effectively alleviates the long-horizon inference burden of the autoregressive model. Extensive experiments on duplex conversation, multilingual human synthesis, and interactive world model highlight the advantages of our approach in low latency, high efficiency, and fine-grained multimodal controllability.

LGMay 22, 2024
CG-FedLLM: How to Compress Gradients in Federated Fune-tuning for Large Language Models

Huiwen Wu, Xiaogang Xu, Deyi Zhang et al.

The success of current Large-Language Models (LLMs) hinges on extensive training data that is collected and stored centrally, called Centralized Learning (CL). However, such a collection manner poses a privacy threat, and one potential solution is Federated Learning (FL), which transfers gradients, not raw data, among clients. Unlike traditional networks, FL for LLMs incurs significant communication costs due to their tremendous parameters. This study introduces an innovative approach to compress gradients to improve communication efficiency during LLM FL, formulating the new FL pipeline named CG-FedLLM. This approach integrates an encoder on the client side to acquire the compressed gradient features and a decoder on the server side to reconstruct the gradients. We also developed a novel training strategy that comprises Temporal-ensemble Gradient-Aware Pre-training (TGAP) to identify characteristic gradients of the target model and Federated AutoEncoder-Involved Fine-tuning (FAF) to compress gradients adaptively. Extensive experiments confirm that our approach reduces communication costs and improves performance (e.g., average 3 points increment compared with traditional CL- and FL-based fine-tuning with LlaMA on a well-recognized benchmark, C-Eval). This improvement is because our encoder-decoder, trained via TGAP and FAF, can filter gradients while selectively preserving critical features. Furthermore, we present a series of experimental analyses focusing on the signal-to-noise ratio, compression rate, and robustness within this privacy-centric framework, providing insight into developing more efficient and secure LLMs.

CVSep 11, 2025
Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding, Jiwen Liu, Wenyuan Zhang et al.

Recent advances in audio-driven avatar video generation have significantly enhanced audio-visual realism. However, existing methods treat instruction conditioning merely as low-level tracking driven by acoustic or visual cues, without modeling the communicative purpose conveyed by the instructions. This limitation compromises their narrative coherence and character expressiveness. To bridge this gap, we introduce Kling-Avatar, a novel cascaded framework that unifies multimodal instruction understanding with photorealistic portrait generation. Our approach adopts a two-stage pipeline. In the first stage, we design a multimodal large language model (MLLM) director that produces a blueprint video conditioned on diverse instruction signals, thereby governing high-level semantics such as character motion and emotions. In the second stage, guided by blueprint keyframes, we generate multiple sub-clips in parallel using a first-last frame strategy. This global-to-local framework preserves fine-grained details while faithfully encoding the high-level intent behind multimodal instructions. Our parallel architecture also enables fast and stable generation of long-duration videos, making it suitable for real-world applications such as digital human livestreaming and vlogging. To comprehensively evaluate our method, we construct a benchmark of 375 curated samples covering diverse instructions and challenging scenarios. Extensive experiments demonstrate that Kling-Avatar is capable of generating vivid, fluent, long-duration videos at up to 1080p and 48 fps, achieving superior performance in lip synchronization accuracy, emotion and dynamic expressiveness, instruction controllability, identity preservation, and cross-domain generalization. These results establish Kling-Avatar as a new benchmark for semantically grounded, high-fidelity audio-driven avatar synthesis.

CVJul 25, 2025
DINO-SLAM: DINO-informed RGB-D SLAM for Neural Implicit and Explicit Representations

Ziren Gong, Xiaohan Li, Fabio Tosi et al.

This paper presents DINO-SLAM, a DINO-informed design strategy to enhance neural implicit (Neural Radiance Field -- NeRF) and explicit representations (3D Gaussian Splatting -- 3DGS) in SLAM systems through more comprehensive scene representations. Purposely, we rely on a Scene Structure Encoder (SSE) that enriches DINO features into Enhanced DINO ones (EDINO) to capture hierarchical scene elements and their structural relationships. Building upon it, we propose two foundational paradigms for NeRF and 3DGS SLAM systems integrating EDINO features. Our DINO-informed pipelines achieve superior performance on the Replica, ScanNet, and TUM compared to state-of-the-art methods.

LGDec 22, 2024
DR-Encoder: Encode Low-rank Gradients with Random Prior for Large Language Models Differentially Privately

Huiwen Wu, Deyi Zhang, Xiaohan Li et al.

The emergence of the Large Language Model (LLM) has shown their superiority in a wide range of disciplines, including language understanding and translation, relational logic reasoning, and even partial differential equations solving. The transformer is the pervasive backbone architecture for the foundation model construction. It is vital to research how to adjust the Transformer architecture to achieve an end-to-end privacy guarantee in LLM fine-tuning. In this paper, we investigate three potential information leakage during a federated fine-tuning procedure for LLM (FedLLM). Based on the potential information leakage, we provide an end-to-end privacy guarantee solution for FedLLM by inserting two-stage randomness. The first stage is to train a gradient auto-encoder with a Gaussian random prior based on the statistical information of the gradients generated by local clients. The second stage is to fine-tune the overall LLM with a differential privacy guarantee by adopting appropriate Gaussian noises. We show the efficiency and accuracy gains of our proposed method with several foundation models and two popular evaluation benchmarks. Furthermore, we present a comprehensive privacy analysis with Gaussian Differential Privacy (GDP) and Renyi Differential Privacy (RDP).

CVSep 24, 2025
LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Zezhong Fan, Xiaohan Li, Luyi Ma et al.

Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.

CVJul 29, 2025
Ov3R: Open-Vocabulary Semantic 3D Reconstruction from RGB Videos

Ziren Gong, Xiaohan Li, Fabio Tosi et al.

We present Ov3R, a novel framework for open-vocabulary semantic 3D reconstruction from RGB video streams, designed to advance Spatial AI. The system features two key components: CLIP3R, a CLIP-informed 3D reconstruction module that predicts dense point maps from overlapping clips while embedding object-level semantics; and 2D-3D OVS, a 2D-3D open-vocabulary semantic module that lifts 2D features into 3D by learning fused descriptors integrating spatial, geometric, and semantic cues. Unlike prior methods, Ov3R incorporates CLIP semantics directly into the reconstruction process, enabling globally consistent geometry and fine-grained semantic alignment. Our framework achieves state-of-the-art performance in both dense 3D reconstruction and open-vocabulary 3D segmentation, marking a step forward toward real-time, semantics-aware Spatial AI.

CLMay 7, 2025
Natural Language Generation in Healthcare: A Review of Methods and Applications

Mengxian Lyu, Xiaohan Li, Ziyi Chen et al.

Natural language generation (NLG) is the key technology to achieve generative artificial intelligence (AI). With the breakthroughs in large language models (LLMs), NLG has been widely used in various medical applications, demonstrating the potential to enhance clinical workflows, support clinical decision-making, and improve clinical documentation. Heterogeneous and diverse medical data modalities, such as medical text, images, and knowledge bases, are utilized in NLG. Researchers have proposed many generative models and applied them in a number of healthcare applications. There is a need for a comprehensive review of NLG methods and applications in the medical domain. In this study, we systematically reviewed 113 scientific publications from a total of 3,988 NLG-related articles identified using a literature search, focusing on data modality, model architecture, clinical applications, and evaluation methods. Following PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines, we categorize key methods, identify clinical applications, and assess their capabilities, limitations, and emerging challenges. This timely review covers the key NLG technologies and medical applications and provides valuable insights for future studies to leverage NLG to transform medical discovery and healthcare.

APP-PHApr 11, 2025
Bayesian Reasoning Enabled by Spin-Orbit Torque Magnetic Tunnel Junctions

Yingqian Xu, Xiaohan Li, Caihua Wan et al.

Bayesian networks play an increasingly important role in data mining, inference, and reasoning with the rapid development of artificial intelligence. In this paper, we present proof-of-concept experiments demonstrating the use of spin-orbit torque magnetic tunnel junctions (SOT-MTJs) in Bayesian network reasoning. Not only can the target probability distribution function (PDF) of a Bayesian network be precisely formulated by a conditional probability table as usual but also quantitatively parameterized by a probabilistic forward propagating neuron network. Moreover, the parameters of the network can also approach the optimum through a simple point-by point training algorithm, by leveraging which we do not need to memorize all historical data nor statistically summarize conditional probabilities behind them, significantly improving storage efficiency and economizing data pretreatment. Furthermore, we developed a simple medical diagnostic system using the SOT-MTJ as a random number generator and sampler, showcasing the application of SOT-MTJ-based Bayesian reasoning. This SOT-MTJ-based Bayesian reasoning shows great promise in the field of artificial probabilistic neural network, broadening the scope of spintronic device applications and providing an efficient and low-storage solution for complex reasoning tasks.

CLMar 19, 2024
Automatic Summarization of Doctor-Patient Encounter Dialogues Using Large Language Model through Prompt Tuning

Mengxian Lyu, Cheng Peng, Xiaohan Li et al.

Automatic text summarization (ATS) is an emerging technology to assist clinicians in providing continuous and coordinated care. This study presents an approach to summarize doctor-patient dialogues using generative large language models (LLMs). We developed prompt-tuning algorithms to instruct generative LLMs to summarize clinical text. We examined the prompt-tuning strategies, the size of soft prompts, and the few-short learning ability of GatorTronGPT, a generative clinical LLM developed using 277 billion clinical and general English words with up to 20 billion parameters. We compared GatorTronGPT with a previous solution based on fine-tuning of a widely used T5 model, using a clinical benchmark dataset MTS-DIALOG. The experimental results show that the GatorTronGPT- 20B model achieved the best performance on all evaluation metrics. The proposed solution has a low computing cost as the LLM parameters are not updated during prompt-tuning. This study demonstrates the efficiency of generative clinical LLMs for clinical ATS through prompt tuning.

IRMay 17, 2023
Knowledge Graph Completion Models are Few-shot Learners: An Empirical Study of Relation Labeling in E-commerce with LLMs

Jiao Chen, Luyi Ma, Xiaohan Li et al.

Knowledge Graphs (KGs) play a crucial role in enhancing e-commerce system performance by providing structured information about entities and their relationships, such as complementary or substitutable relations between products or product types, which can be utilized in recommender systems. However, relation labeling in KGs remains a challenging task due to the dynamic nature of e-commerce domains and the associated cost of human labor. Recently, breakthroughs in Large Language Models (LLMs) have shown surprising results in numerous natural language processing tasks. In this paper, we conduct an empirical study of LLMs for relation labeling in e-commerce KGs, investigating their powerful learning capabilities in natural language and effectiveness in predicting relations between product types with limited labeled data. We evaluate various LLMs, including PaLM and GPT-3.5, on benchmark datasets, demonstrating their ability to achieve competitive performance compared to humans on relation labeling tasks using just 1 to 5 labeled examples per relation. Additionally, we experiment with different prompt engineering techniques to examine their impact on model performance. Our results show that LLMs significantly outperform existing KG completion models in relation labeling for e-commerce KGs and exhibit performance strong enough to replace human labeling.

CVJan 30, 2022
MVP-Net: Multiple View Pointwise Semantic Segmentation of Large-Scale Point Clouds

Chuanyu Luo, Xiaohan Li, Nuo Cheng et al.

Semantic segmentation of 3D point cloud is an essential task for autonomous driving environment perception. The pipeline of most pointwise point cloud semantic segmentation methods includes points sampling, neighbor searching, feature aggregation, and classification. Neighbor searching method like K-nearest neighbors algorithm, KNN, has been widely applied. However, the complexity of KNN is always a bottleneck of efficiency. In this paper, we propose an end-to-end neural architecture, Multiple View Pointwise Net, MVP-Net, to efficiently and directly infer large-scale outdoor point cloud without KNN or any complex pre/postprocessing. Instead, assumption-based space filling curves and multi-rotation of point cloud methods are introduced to point feature aggregation and receptive field expanding. Numerical experiments show that the proposed MVP-Net is 11 times faster than the most efficient pointwise semantic segmentation method RandLA-Net and achieves the same accuracy on the large-scale benchmark SemanticKITTI dataset.

IRNov 28, 2021
Pre-training Recommender Systems via Reinforced Attentive Multi-relational Graph Neural Network

Xiaohan Li, Zhiwei Liu, Stephen Guo et al.

Recently, Graph Neural Networks (GNNs) have proven their effectiveness for recommender systems. Existing studies have applied GNNs to capture collaborative relations in the data. However, in real-world scenarios, the relations in a recommendation graph can be of various kinds. For example, two movies may be associated either by the same genre or by the same director/actor. If we use a single graph to elaborate all these relations, the graph can be too complex to process. To address this issue, we bring the idea of pre-training to process the complex graph step by step. Based on the idea of divide-and-conquer, we separate the large graph into three sub-graphs: user graph, item graph, and user-item interaction graph. Then the user and item embeddings are pre-trained from user and item graphs, respectively. To conduct pre-training, we construct the multi-relational user graph and item graph, respectively, based on their attributes. In this paper, we propose a novel Reinforced Attentive Multi-relational Graph Neural Network (RAM-GNN) to the pre-train user and item embeddings on the user and item graph prior to the recommendation step. Specifically, we design a relation-level attention layer to learn the importance of different relations. Next, a Reinforced Neighbor Sampler (RNS) is applied to search the optimal filtering threshold for sampling top-k similar neighbors in the graph, which avoids the over-smoothing issue. We initialize the recommendation model with the pre-trained user/item embeddings. Finally, an aggregation-based GNN model is utilized to learn from the collaborative relations in the user-item interaction graph and provide recommendations. Our experiments demonstrate that RAM-GNN outperforms other state-of-the-art graph-based recommendation models and multi-relational graph neural networks.

LGJan 17, 2021
Heterogeneous Similarity Graph Neural Network on Electronic Health Records

Zheng Liu, Xiaohan Li, Hao Peng et al.

Mining Electronic Health Records (EHRs) becomes a promising topic because of the rich information they contain. By learning from EHRs, machine learning models can be built to help human experts to make medical decisions and thus improve healthcare quality. Recently, many models based on sequential or graph models are proposed to achieve this goal. EHRs contain multiple entities and relations and can be viewed as a heterogeneous graph. However, previous studies ignore the heterogeneity in EHRs. On the other hand, current heterogeneous graph neural networks cannot be simply used on an EHR graph because of the existence of hub nodes in it. To address this issue, we propose Heterogeneous Similarity Graph Neural Network (HSGNN) analyze EHRs with a novel heterogeneous GNN. Our framework consists of two parts: one is a preprocessing method and the other is an end-to-end GNN. The preprocessing method normalizes edges and splits the EHR graph into multiple homogeneous graphs while each homogeneous graph contains partial information of the original EHR graph. The GNN takes all homogeneous graphs as input and fuses all of them into one graph to make a prediction. Experimental results show that HSGNN outperforms other baselines in the diagnosis prediction task.

IRJan 8, 2021
Dynamic Graph Collaborative Filtering

Xiaohan Li, Mengqi Zhang, Shu Wu et al.

Dynamic recommendation is essential for modern recommender systems to provide real-time predictions based on sequential data. In real-world scenarios, the popularity of items and interests of users change over time. Based on this assumption, many previous works focus on interaction sequences and learn evolutionary embeddings of users and items. However, we argue that sequence-based models are not able to capture collaborative information among users and items directly. Here we propose Dynamic Graph Collaborative Filtering (DGCF), a novel framework leveraging dynamic graphs to capture collaborative and sequential relations of both items and users at the same time. We propose three update mechanisms: zero-order 'inheritance', first-order 'propagation', and second-order 'aggregation', to represent the impact on a user or item when a new interaction occurs. Based on them, we update related user and item embeddings simultaneously when interactions occur in turn, and then use the latest embeddings to make recommendations. Extensive experiments conducted on three public datasets show that DGCF significantly outperforms the state-of-the-art dynamic recommendation methods up to 30. Our approach achieves higher performance when the dataset contains less action repetition, indicating the effectiveness of integrating dynamic collaborative information.

MLJun 3, 2016
Statistical Pattern Recognition for Driving Styles Based on Bayesian Probability and Kernel Density Estimation

Wenshuo Wang, Junqiang Xi, Xiaohan Li

Driving styles have a great influence on vehicle fuel economy, active safety, and drivability. To recognize driving styles of path-tracking behaviors for different divers, a statistical pattern-recognition method is developed to deal with the uncertainty of driving styles or characteristics based on probability density estimation. First, to describe driver path-tracking styles, vehicle speed and throttle opening are selected as the discriminative parameters, and a conditional kernel density function of vehicle speed and throttle opening is built, respectively, to describe the uncertainty and probability of two representative driving styles, e.g., aggressive and normal. Meanwhile, a posterior probability of each element in feature vector is obtained using full Bayesian theory. Second, a Euclidean distance method is involved to decide to which class the driver should be subject instead of calculating the complex covariance between every two elements of feature vectors. By comparing the Euclidean distance between every elements in feature vector, driving styles are classified into seven levels ranging from low normal to high aggressive. Subsequently, to show benefits of the proposed pattern-recognition method, a cross-validated method is used, compared with a fuzzy logic-based pattern-recognition method. The experiment results show that the proposed statistical pattern-recognition method for driving styles based on kernel density estimation is more efficient and stable than the fuzzy logic-based method.