Kaize Shi

LG
h-index17
18papers
106citations
Novelty48%
AI Score52

18 Papers

CLAug 9, 2023
LLaMA-E: Empowering E-commerce Authoring with Object-Interleaved Instruction Following

Kaize Shi, Xueyao Sun, Dingxian Wang et al.

E-commerce authoring entails creating engaging, diverse, and targeted content to enhance preference elicitation and retrieval experience. While Large Language Models (LLMs) have revolutionized content generation, they often fall short in e-commerce applications due to their limited memorization of domain-specific features. This paper proposes LLaMA-E, the unified e-commerce authoring models that address the contextual preferences of customers, sellers, and platforms, the essential objects in e-commerce operation. We design the instruction set derived from tasks of ads generation, query-enhanced product title rewriting, product classification, purchase intent speculation, and general e-commerce Q&A. The instruction formulation ensures the interleaved cover of the presented and required object features, allowing the alignment of base models to parameterise e-commerce knowledge comprehensively. The proposed LLaMA-E models achieve state-of-the-art evaluation performance and exhibit the advantage in zero-shot practical applications. To our knowledge, this is the first LLM tailored to empower authoring applications with comprehensive scenario understanding by integrating features focused on participated objects.

LGSep 1, 2022
Distributional Drift Adaptation with Temporal Conditional Variational Autoencoder for Multivariate Time Series Forecasting

Hui He, Qi Zhang, Kun Yi et al.

Due to the non-stationary nature, the distribution of real-world multivariate time series (MTS) changes over time, which is known as distribution drift. Most existing MTS forecasting models greatly suffer from distribution drift and degrade the forecasting performance over time. Existing methods address distribution drift via adapting to the latest arrived data or self-correcting per the meta knowledge derived from future data. Despite their great success in MTS forecasting, these methods hardly capture the intrinsic distribution changes, especially from a distributional perspective. Accordingly, we propose a novel framework temporal conditional variational autoencoder (TCVAE) to model the dynamic distributional dependencies over time between historical observations and future data in MTSs and infer the dependencies as a temporal conditional distribution to leverage latent variables. Specifically, a novel temporal Hawkes attention mechanism represents temporal factors subsequently fed into feed-forward networks to estimate the prior Gaussian distribution of latent variables. The representation of temporal factors further dynamically adjusts the structures of Transformer-based encoder and decoder to distribution changes by leveraging a gated attention mechanism. Moreover, we introduce conditional continuous normalization flow to transform the prior Gaussian to a complex and form-free distribution to facilitate flexible inference of the temporal conditional distribution. Extensive experiments conducted on six real-world MTS datasets demonstrate the TCVAE's superior robustness and effectiveness over the state-of-the-art MTS forecasting baselines. We further illustrate the TCVAE applicability through multifaceted case studies and visualization in real-world scenarios.

MAMar 26Code
Belief-Driven Multi-Agent Collaboration via Approximate Perfect Bayesian Equilibrium for Social Simulation

Weiwei Fang, Lin Li, Kaize Shi et al.

High-fidelity social simulation is pivotal for addressing complex Web societal challenges, yet it demands agents capable of authentically replicating the dynamic spectrum of human interaction. Current LLM-based multi-agent frameworks, however, predominantly adhere to static interaction topologies, failing to capture the fluid oscillation between cooperative knowledge synthesis and competitive critical reasoning seen in real-world scenarios. This rigidity often leads to unrealistic ``groupthink'' or unproductive deadlocks, undermining the credibility of simulations for decision support. To bridge this gap, we propose \textit{BEACOF}, a \textit{belief-driven adaptive collaboration framework} inspired by Perfect Bayesian Equilibrium (PBE). By modeling social interaction as a dynamic game of incomplete information, BEACOF rigorously addresses the circular dependency between collaboration type selection and capability estimation. Agents iteratively refine probabilistic beliefs about peer capabilities and autonomously modulate their collaboration strategy, thereby ensuring sequentially rational decisions under uncertainty. Validated across adversarial (judicial), open-ended (social) and mixed (medical) scenarios, BEACOF prevents coordination failures and fosters robust convergence toward high-quality solutions, demonstrating superior potential for reliable social simulation. Source codes and datasets are publicly released at: https://github.com/WUT-IDEA/BEACOF.

CLMar 25
CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

Kaize Shi, Xueyao Sun, Qika Lin et al.

Retrieval-augmented generation (RAG) has shown promising results in enhancing Q&A by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web Q&A benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.

IRMar 26
Hyena Operator for Fast Sequential Recommendation

Jiahao Liu, Lin Li, Zhiyuan Li et al.

Sequential recommendation models, particularly those based on attention, achieve strong accuracy but incur quadratic complexity, making long user histories prohibitively expensive. Sub-quadratic operators such as Hyena provide efficient alternatives in language modeling, but their potential in recommendation remains underexplored. We argue that Hyena faces challenges in recommendation due to limited representation capacity on sparse, long user sequences. To address these challenges, we propose HyenaRec, a novel sequential recommender that integrates polynomial-based kernel parameterization with gated convolutions. Specifically, we design convolutional kernels using Legendre orthogonal polynomials, which provides a smooth and compact basis for modeling long-term temporal dependencies. A complementary gating mechanism captures fine-grained short-term behavioral bursts, yielding a hybrid architecture that balances global temporal evolution with localized user interests under sparse feedback. This construction enhances expressiveness while scaling linearly with sequence length. Extensive experiments on multiple real-world datasets demonstrate that HyenaRec consistently outperforms Attention-, Recurrent-, and other baselines in ranking accuracy. Moreover, it trains significantly faster (up to 6x speedup), with particularly pronounced advantages on long-sequence scenarios where efficiency is maintained without sacrificing accuracy. These results highlight polynomial-based kernel parameterization as a principled and scalable alternative to attention for sequential recommendation.

CLMar 15
Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

Yuanchi Ma, Kaize Shi, Hui He et al.

Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.

LGFeb 23, 2024
Deep Coupling Network For Multivariate Time Series Forecasting

Kun Yi, Qi Zhang, Hui He et al.

Multivariate time series (MTS) forecasting is crucial in many real-world applications. To achieve accurate MTS forecasting, it is essential to simultaneously consider both intra- and inter-series relationships among time series data. However, previous work has typically modeled intra- and inter-series relationships separately and has disregarded multi-order interactions present within and between time series data, which can seriously degrade forecasting accuracy. In this paper, we reexamine intra- and inter-series relationships from the perspective of mutual information and accordingly construct a comprehensive relationship learning mechanism tailored to simultaneously capture the intricate multi-order intra- and inter-series couplings. Based on the mechanism, we propose a novel deep coupling network for MTS forecasting, named DeepCN, which consists of a coupling mechanism dedicated to explicitly exploring the multi-order intra- and inter-series relationships among time series data concurrently, a coupled variable representation module aimed at encoding diverse variable patterns, and an inference module facilitating predictions through one forward step. Extensive experiments conducted on seven real-world datasets demonstrate that our proposed DeepCN achieves superior performance compared with the state-of-the-art baselines.

LGFeb 18, 2024
A Temporally Disentangled Contrastive Diffusion Model for Spatiotemporal Imputation

Yakun Chen, Kaize Shi, Zhangkai Wu et al.

Spatiotemporal data analysis is pivotal across various domains, such as transportation, meteorology, and healthcare. The data collected in real-world scenarios are often incomplete due to device malfunctions and network errors. Spatiotemporal imputation aims to predict missing values by exploiting the spatial and temporal dependencies in the observed data. Traditional imputation approaches based on statistical and machine learning techniques require the data to conform to their distributional assumptions, while graph and recurrent neural networks are prone to error accumulation problems due to their recurrent structures. Generative models, especially diffusion models, can potentially circumvent the reliance on inaccurate, previously imputed values for future predictions; However, diffusion models still face challenges in generating stable results. We propose to address these challenges by designing conditional information to guide the generative process and expedite the training process. We introduce a conditional diffusion framework called C$^2$TSD, which incorporates disentangled temporal (trend and seasonality) representations as conditional information and employs contrastive learning to improve generalizability. Our extensive experiments on three real-world datasets demonstrate the superior performance of our approach compared to a number of state-of-the-art baselines.

LGFeb 2, 2025
A Survey of Quantized Graph Representation Learning: Connecting Graph Structures with Large Language Models

Qika Lin, Zhen Peng, Kaize Shi et al.

Recent years have witnessed rapid advances in graph representation learning, with the continuous embedding approach emerging as the dominant paradigm. However, such methods encounter issues regarding parameter efficiency, interpretability, and robustness. Thus, Quantized Graph Representation (QGR) learning has recently gained increasing interest, which represents the graph structure with discrete codes instead of conventional continuous embeddings. Given its analogous representation form to natural language, QGR also possesses the capability to seamlessly integrate graph structures with large language models (LLMs). As this emerging paradigm is still in its infancy yet holds significant promise, we undertake this thorough survey to promote its rapid future prosperity. We first present the background of the general quantization methods and their merits. Moreover, we provide an in-depth demonstration of current QGR studies from the perspectives of quantized strategies, training objectives, distinctive designs, knowledge graph quantization, and applications. We further explore the strategies for code dependence learning and integration with LLMs. At last, we give discussions and conclude future directions, aiming to provide a comprehensive picture of QGR and inspire future research.

CVJan 12, 2024
APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning

Guiming Cao, Kaize Shi, Hong Fu et al.

Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.

LGNov 18, 2025
Certified Signed Graph Unlearning

Junpeng Zhao, Lin Li, Kaixi Hu et al.

Signed graphs model complex relationships through positive and negative edges, with widespread real-world applications. Given the sensitive nature of such data, selective removal mechanisms have become essential for privacy protection. While graph unlearning enables the removal of specific data influences from Graph Neural Networks (GNNs), existing methods are designed for conventional GNNs and overlook the unique heterogeneous properties of signed graphs. When applied to Signed Graph Neural Networks (SGNNs), these methods lose critical sign information, degrading both model utility and unlearning effectiveness. To address these challenges, we propose Certified Signed Graph Unlearning (CSGU), which provides provable privacy guarantees while preserving the sociological principles underlying SGNNs. CSGU employs a three-stage method: (1) efficiently identifying minimal influenced neighborhoods via triangular structures, (2) applying sociological theories to quantify node importance for optimal privacy budget allocation, and (3) performing importance-weighted parameter updates to achieve certified modifications with minimal utility degradation. Extensive experiments demonstrate that CSGU outperforms existing methods, achieving superior performance in both utility preservation and unlearning effectiveness on SGNNs.

CLNov 24, 2025
Concept than Document: Context Compression via AMR-based Conceptual Entropy

Kaize Shi, Xueyao Sun, Xiaohui Tao et al.

Large Language Models (LLMs) face information overload when handling long contexts, particularly in Retrieval-Augmented Generation (RAG) where extensive supporting documents often introduce redundant content. This issue not only weakens reasoning accuracy but also increases computational overhead. We propose an unsupervised context compression framework that exploits Abstract Meaning Representation (AMR) graphs to preserve semantically essential information while filtering out irrelevant text. By quantifying node-level entropy within AMR graphs, our method estimates the conceptual importance of each node, enabling the retention of core semantics. Specifically, we construct AMR graphs from raw contexts, compute the conceptual entropy of each node, and screen significant informative nodes to form a condensed and semantically focused context than raw documents. Experiments on the PopQA and EntityQuestions datasets show that our method outperforms vanilla and other baselines, achieving higher accuracy while substantially reducing context length. To the best of our knowledge, this is the first work introducing AMR-based conceptual entropy for context compression, demonstrating the potential of stable linguistic features in context engineering.

CVOct 27, 2025
FAME: Fairness-aware Attention-modulated Video Editing

Zhangkai Wu, Xuhui Fan, Zhongyuan Xie et al.

Training-free video editing (VE) models tend to fall back on gender stereotypes when rendering profession-related prompts. We propose \textbf{FAME} for \textit{Fairness-aware Attention-modulated Video Editing} that mitigates profession-related gender biases while preserving prompt alignment and temporal consistency for coherent VE. We derive fairness embeddings from existing minority representations by softly injecting debiasing tokens into the text encoder. Simultaneously, FAME integrates fairness modulation into both temporal self attention and prompt-to-region cross attention to mitigate the motion corruption and temporal inconsistency caused by directly introducing fairness cues. For temporal self attention, FAME introduces a region constrained attention mask combined with time decay weighting, which enhances intra-region coherence while suppressing irrelevant inter-region interactions. For cross attention, it reweights tokens to region matching scores by incorporating fairness sensitive similarity masks derived from debiasing prompt embeddings. Together, these modulations keep fairness-sensitive semantics tied to the right visual regions and prevent temporal drift across frames. Extensive experiments on new VE fairness-oriented benchmark \textit{FairVE} demonstrate that FAME achieves stronger fairness alignment and semantic fidelity, surpassing existing VE baselines.

CVOct 27, 2025
VALA: Learning Latent Anchors for Training-Free and Temporally Consistent

Zhangkai Wu, Xuhui Fan, Zhongyuan Xie et al.

Recent advances in training-free video editing have enabled lightweight and precise cross-frame generation by leveraging pre-trained text-to-image diffusion models. However, existing methods often rely on heuristic frame selection to maintain temporal consistency during DDIM inversion, which introduces manual bias and reduces the scalability of end-to-end inference. In this paper, we propose~\textbf{VALA} (\textbf{V}ariational \textbf{A}lignment for \textbf{L}atent \textbf{A}nchors), a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing. To learn meaningful assignments, VALA propose a variational framework with a contrastive learning objective. Therefore, it can transform cross-frame latent representations into compressed latent anchors that preserve both content and temporal coherence. Our method can be fully integrated into training-free text-to-image based video editing models. Extensive experiments on real-world video editing benchmarks show that VALA achieves state-of-the-art performance in inversion fidelity, editing quality, and temporal consistency, while offering improved efficiency over prior methods.

LGSep 27, 2025
Factor Decorrelation Enhanced Data Removal from Deep Predictive Models

Wenhao Yang, Lin Li, Xiaohui Tao et al.

The imperative of user privacy protection and regulatory compliance necessitates sensitive data removal in model training, yet this process often induces distributional shifts that undermine model performance-particularly in out-of-distribution (OOD) scenarios. We propose a novel data removal approach that enhances deep predictive models through factor decorrelation and loss perturbation. Our approach introduces: (1) a discriminative-preserving factor decorrelation module employing dynamic adaptive weight adjustment and iterative representation updating to reduce feature redundancy and minimize inter-feature correlations. (2) a smoothed data removal mechanism with loss perturbation that creates information-theoretic safeguards against data leakage during removal operations. Extensive experiments on five benchmark datasets show that our approach outperforms other baselines and consistently achieves high predictive accuracy and robustness even under significant distribution shifts. The results highlight its superior efficiency and adaptability in both in-distribution and out-of-distribution scenarios.

CVJul 28, 2025
Endoscopic Depth Estimation Based on Deep Learning: A Survey

Ke Niu, Zeyun Liu, Xue Feng et al.

Endoscopic depth estimation is a critical technology for improving the safety and precision of minimally invasive surgery. It has attracted considerable attention from researchers in medical imaging, computer vision, and robotics. Over the past decade, a large number of methods have been developed. Despite the existence of several related surveys, a comprehensive overview focusing on recent deep learning-based techniques is still limited. This paper endeavors to bridge this gap by systematically reviewing the state-of-the-art literature. Specifically, we provide a thorough survey of the field from three key perspectives: data, methods, and applications. Firstly, at the data level, we describe the acquisition process of publicly available datasets. Secondly, at the methodological level, we introduce both monocular and stereo deep learning-based approaches for endoscopic depth estimation. Thirdly, at the application level, we identify the specific challenges and corresponding solutions for the clinical implementation of depth estimation technology, situated within concrete clinical scenarios. Finally, we outline potential directions for future research, such as domain adaptation, real-time implementation, and the synergistic fusion of depth information with sensor technologies, thereby providing a valuable starting point for researchers to engage with and advance the field toward clinical translation.

CVApr 5, 2025
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering

Jiaqi Deng, Kaize Shi, Zonghan Wu et al.

Knowledge-based Vision Question Answering (KB-VQA) systems address complex visual-grounded questions with knowledge retrieved from external knowledge bases. The tasks of knowledge retrieval and answer generation tasks both necessitate precise multimodal understanding of question context and external knowledge. However, existing methods treat these two stages as separate modules with limited interaction during training, which hinders bi-directional parametric knowledge sharing, ultimately leading to suboptimal performance. To fully exploit the cross-task synergy in KB-VQA, we propose a unified retrieval-augmented VQA framework with collaborative parametric knowledge calibration. The proposed framework can effectively adapt general multimodal pre-trained models for fine-grained, knowledge-intensive tasks while enabling the retriever and generator to collaboratively enhance and share their parametric knowledge during both training and inference. To enhance fine-grained understanding of questions and external documents, we also integrate late interaction mechanism into the proposed training framework. Additionally, we introduce a reflective-answering mechanism that allows the model to explicitly evaluate and refine its knowledge boundary. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7\% improvement in answering accuracy, and brings an average 7.5\% boost in base MLLMs' VQA performance.

CLMay 6, 2024
Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation

Kaize Shi, Xueyao Sun, Qing Li et al.

Large Language Models (LLMs) have made significant strides in information acquisition. However, their overreliance on potentially flawed parametric knowledge leads to hallucinations and inaccuracies, particularly when handling long-tail, domain-specific queries. Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge. Nevertheless, the retrieved long-context documents often contain noisy, irrelevant information alongside vital knowledge, negatively diluting LLMs' attention. Inspired by the supportive role of essential concepts in individuals' reading comprehension, we propose a novel concept-based RAG framework with the Abstract Meaning Representation (AMR)-based concept distillation algorithm. The proposed algorithm compresses the cluttered raw retrieved documents into a compact set of crucial concepts distilled from the informative nodes of AMR by referring to reliable linguistic features. The concepts explicitly constrain LLMs to focus solely on vital information in the inference process. We conduct extensive experiments on open-domain question-answering datasets to empirically evaluate the proposed method's effectiveness. The results indicate that the concept-based RAG framework outperforms other baseline methods, particularly as the number of supporting documents increases, while also exhibiting robustness across various backbone LLMs. This emphasizes the distilled concepts are informative for augmenting the RAG process by filtering out interference information. To the best of our knowledge, this is the first work introducing AMR to enhance the RAG, presenting a potential solution to augment inference performance with semantic-based context compression.