LGApr 15, 2022Code
Graph Pooling for Graph Neural Networks: Progress, Challenges, and OpportunitiesChuang Liu, Yibing Zhan, Jia Wu et al.
Graph neural networks have emerged as a leading architecture for many graph-level tasks, such as graph classification and graph generation. As an essential component of the architecture, graph pooling is indispensable for obtaining a holistic graph-level representation of the whole graph. Although a great variety of methods have been proposed in this promising and fast-developing research field, to the best of our knowledge, little effort has been made to systematically summarize these works. To set the stage for the development of future works, in this paper, we attempt to fill this gap by providing a broad review of recent methods for graph pooling. Specifically, 1) we first propose a taxonomy of existing graph pooling methods with a mathematical summary for each category; 2) then, we provide an overview of the libraries related to graph pooling, including the commonly used datasets, model architectures for downstream tasks, and open-source implementations; 3) next, we further outline the applications that incorporate the idea of graph pooling in a variety of domains; 4) finally, we discuss certain critical challenges facing current studies and share our insights on future potential directions for research on the improvement of graph pooling.
CVNov 25, 2022
3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion ModelsGang Li, Heliang Zheng, Chaoyue Wang et al.
Text-guided diffusion models have shown superior performance in image/video generation and editing. While few explorations have been performed in 3D scenarios. In this paper, we discuss three fundamental and interesting problems on this topic. First, we equip text-guided diffusion models to achieve 3D-consistent generation. Specifically, we integrate a NeRF-like neural field to generate low-resolution coarse results for a given camera view. Such results can provide 3D priors as condition information for the following diffusion process. During denoising diffusion, we further enhance the 3D consistency by modeling cross-view correspondences with a novel two-stream (corresponding to two different views) asynchronous diffusion process. Second, we study 3D local editing and propose a two-step solution that can generate 360-degree manipulated results by editing an object from a single view. Step 1, we propose to perform 2D local editing by blending the predicted noises. Step 2, we conduct a noise-to-text inversion process that maps 2D blended noises into the view-independent text embedding space. Once the corresponding text embedding is obtained, 360-degree images can be generated. Last but not least, we extend our model to perform one-shot novel view synthesis by fine-tuning on a single image, firstly showing the potential of leveraging text guidance for novel view synthesis. Extensive experiments and various applications show the prowess of our 3DDesigner. The project page is available at https://3ddesigner-diffusion.github.io/.
LGMar 1, 2023
OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML SystemChao Xue, Wei Liu, Shuai Xie et al.
Automated machine learning (AutoML) seeks to build ML models with minimal human effort. While considerable research has been conducted in the area of AutoML in general, aiming to take humans out of the loop when building artificial intelligence (AI) applications, scant literature has focused on how AutoML works well in open-environment scenarios such as the process of training and updating large models, industrial supply chains or the industrial metaverse, where people often face open-loop problems during the search process: they must continuously collect data, update data and models, satisfy the requirements of the development and deployment environment, support massive devices, modify evaluation metrics, etc. Addressing the open-environment issue with pure data-driven approaches requires considerable data, computing resources, and effort from dedicated data engineers, making current AutoML systems and platforms inefficient and computationally intractable. Human-computer interaction is a practical and feasible way to tackle the problem of open-environment AI. In this paper, we introduce OmniForce, a human-centered AutoML (HAML) system that yields both human-assisted ML and ML-assisted human techniques, to put an AutoML system into practice and build adaptive AI in open-environment scenarios. Specifically, we present OmniForce in terms of ML version management; pipeline-driven development and deployment collaborations; a flexible search strategy framework; and widely provisioned and crowdsourced application algorithms, including large models. Furthermore, the (large) models constructed by OmniForce can be automatically turned into remote services in a few minutes; this process is dubbed model as a service (MaaS). Experimental results obtained in multiple search spaces and real-world use cases demonstrate the efficacy and efficiency of OmniForce.
CVMar 16, 2022
Coverage Optimization of Camera Network for Continuous Deformable ObjectChang Li, Xi Chen, Li Chai
In this paper, a deformable object is considered for cameras deployment with the aim of visual coverage. The object contour is discretized into sampled points as meshes, and the deformation is represented as continuous trajectories for the sampled points. To reduce the computational complexity, some feature points are carefully selected representing the continuous deformation process, and the visual coverage for the deformable object is transferred to cover the specific feature points. In particular, the vertexes of a rectangle that can contain the entire deformation trajectory of every sampled point on the object contour are chosen as the feature points. An improved wolf pack algorithm is then proposed to solve the optimization problem. Finally, simulation results are given to demonstrate the effectiveness of the proposed deployment method of camera network.
99.4CLApr 3Code
JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token EfficiencyAichen Cai, Anmeng Zhang, Anyu Li et al.
We introduce JoyAI-LLM Flash, an efficient Mixture-of-Experts (MoE) language model designed to redefine the trade-off between strong performance and token efficiency in the sub-50B parameter regime. JoyAI-LLM Flash is pretrained on a massive corpus of 20 trillion tokens and further optimized through a rigorous post-training pipeline, including supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and large-scale reinforcement learning (RL) across diverse environments. To improve token efficiency, JoyAI-LLM Flash strategically balances \emph{thinking} and \emph{non-thinking} cognitive modes and introduces FiberPO, a novel RL algorithm inspired by fibration theory that decomposes trust-region maintenance into global and local components, providing unified multi-scale stability control for LLM policy optimization. To enhance architectural sparsity, the model comprises 48B total parameters while activating only 2.7B parameters per forward pass, achieving a substantially higher sparsity ratio than contemporary industry leading models of comparable scale. To further improve inference throughput, we adopt a joint training-inference co-design that incorporates dense Multi-Token Prediction (MTP) and Quantization-Aware Training (QAT). We release the checkpoints for both JoyAI-LLM-48B-A3B Base and its post-trained variants on Hugging Face to support the open-source community.
CVAug 24, 2023
DD-GCN: Directed Diffusion Graph Convolutional Network for Skeleton-based Human Action RecognitionChang Li, Qian Huang, Yingchi Mao
Graph Convolutional Networks (GCNs) have been widely used in skeleton-based human action recognition. In GCN-based methods, the spatio-temporal graph is fundamental for capturing motion patterns. However, existing approaches ignore the physical dependency and synchronized spatio-temporal correlations between joints, which limits the representation capability of GCNs. To solve these problems, we construct the directed diffusion graph for action modeling and introduce the activity partition strategy to optimize the weight sharing mechanism of graph convolution kernels. In addition, we present the spatio-temporal synchronization encoder to embed synchronized spatio-temporal semantics. Finally, we propose Directed Diffusion Graph Convolutional Network (DD-GCN) for action recognition, and the experiments on three public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA, demonstrate the state-of-the-art performance of our method.
CVJul 22, 2023
An Intelligent Remote Sensing Image Quality Inspection SystemYijiong Yu, Tao Wang, Kang Ran et al.
Due to the inevitable presence of quality problems, quality inspection of remote sensing images is indeed an indispensable step between the acquisition and the application of them. However, traditional manual inspection suffers from low efficiency. Hence, we propose a novel deep learning-based two-step intelligent system consisting of multiple advanced computer vision models, which first performs image classification by SwinV2 and then accordingly adopts the most appropriate method, such as semantic segmentation by Segformer, to localize the quality problems. Results demonstrate that the proposed method exhibits excellent performance and efficiency, surpassing traditional methods. Furthermore, we conduct an initial exploration of applying multimodal models to remote sensing image quality inspection.
88.7SEMar 24
The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool OrchestrationHaoyuan Xu, Chang Li, Xinyan Ma et al.
Tool use enables large language models (LLMs) to access external information, invoke software systems, and act in digital environments beyond what can be solved from model parameters alone. Early research mainly studied whether a model could select and execute a correct single tool call. As agent systems evolve, however, the central problem has shifted from isolated invocation to multi-tool orchestration over long trajectories with intermediate state, execution feedback, changing environments, and practical constraints such as safety, cost, and verifiability. We comprehensively review recent progress in multi-tool LLM agents and analyzes the state of the art in this rapidly developing area. First, we unify task formulations and distinguish single-call tool use from long-horizon orchestration. Then, we organize the literature around six core dimensions: inference-time planning and execution, training and trajectory construction, safety and control, efficiency under resource constraints, capability completeness in open environments, and benchmark design and evaluation. We further summarize representative applications in software engineering, enterprise workflows, graphical user interfaces, and mobile systems. Finally, we discuss major challenges and outline future directions for building reliable, scalable, and verifiable multi-tool agents.
LGOct 20, 2023
BRFL: A Blockchain-based Byzantine-Robust Federated Learning ModelYang Li, Chunhe Xia, Chang Li et al.
With the increasing importance of machine learning, the privacy and security of training data have become critical. Federated learning, which stores data in distributed nodes and shares only model parameters, has gained significant attention for addressing this concern. However, a challenge arises in federated learning due to the Byzantine Attack Problem, where malicious local models can compromise the global model's performance during aggregation. This article proposes the Blockchain-based Byzantine-Robust Federated Learning (BRLF) model that combines federated learning with blockchain technology. This integration enables traceability of malicious models and provides incentives for locally trained clients. Our approach involves selecting the aggregation node based on Pearson's correlation coefficient, and we perform spectral clustering and calculate the average gradient within each cluster, validating its accuracy using local dataset of the aggregation nodes. Experimental results on public datasets demonstrate the superior byzantine robustness of our secure aggregation algorithm compared to other baseline byzantine robust aggregation methods, and proved our proposed model effectiveness in addressing the resource consumption problem.
CVNov 1, 2023
Graph Representation Learning for Infrared and Visible Image FusionJing Li, Lu Bai, Bin Yang et al.
Infrared and visible image fusion aims to extract complementary features to synthesize a single fused image. Many methods employ convolutional neural networks (CNNs) to extract local features due to its translation invariance and locality. However, CNNs fail to consider the image's non-local self-similarity (NLss), though it can expand the receptive field by pooling operations, it still inevitably leads to information loss. In addition, the transformer structure extracts long-range dependence by considering the correlativity among all image patches, leading to information redundancy of such transformer-based methods. However, graph representation is more flexible than grid (CNN) or sequence (transformer structure) representation to address irregular objects, and graph can also construct the relationships among the spatially repeatable details or texture with far-space distance. Therefore, to address the above issues, it is significant to convert images into the graph space and thus adopt graph convolutional networks (GCNs) to extract NLss. This is because the graph can provide a fine structure to aggregate features and propagate information across the nearest vertices without introducing redundant information. Concretely, we implement a cascaded NLss extraction pattern to extract NLss of intra- and inter-modal by exploring interactions of different image pixels in intra- and inter-image positional distance. We commence by preforming GCNs on each intra-modal to aggregate features and propagate information to extract independent intra-modal NLss. Then, GCNs are performed on the concatenate intra-modal NLss features of infrared and visible images, which can explore the cross-domain NLss of inter-modal to reconstruct the fused image. Ablation studies and extensive experiments illustrates the effectiveness and superiority of the proposed method on three datasets.
CVDec 22, 2023Code
FM-OV3D: Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D DetectionDongmei Zhang, Chang Li, Ray Zhang et al.
The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.
95.9MTRL-SCIApr 13
Agentic LLM Reasoning in a Self-Driving Laboratory for Air-Sensitive Lithium Halide Spinel ConductorsYuxing Fei, Bernardus Rendy, Xiaochen Yang et al.
Self-driving laboratories promise to accelerate materials discovery. Yet current automated solid-state synthesis platforms are limited to ambient conditions, thereby precluding their use for air-sensitive materials. Here, we present A-Lab for Glovebox Powder Solid-state Synthesis (A-Lab GPSS), a robotic platform capable of synthesizing and characterizing air-sensitive inorganic materials under strict air-free conditions. By integrating an agentic AI framework into the A-Lab GPSS platform, we structure autonomous experimental design through abductive and inductive reasoning. We deploy this platform to explore the vast compositional space of lithium halide spinel solid-state ionic conductors. Across a synthesis campaign comprising 352 samples with diverse compositions, the system explores a broad chemical space, experimentally realizing 72% of the 171 possible pairwise combinations among the 19 metals considered in this study. Over the course of the campaign, the fraction of compositions exhibiting both good ionic conductivity (> 0.05 mS/cm) and high halide spinel phase purity increases from 1.33% in the first 75 agent-proposed samples to 5.33% in the final 75. Furthermore, by inspecting the AI's reasoning processes, we reveal distinct yet complementary discovery strategies: abductive reasoning interrogates abnormal observations within already explored regions, whereas inductive reasoning expands the search into broader, previously unvisited chemical space. This work establishes a scalable platform for the autonomous discovery of complex, air-sensitive solid-state materials.
56.6SDMay 6
Stage-adaptive audio diffusion modelingXuanhao Zhang, Chang Li
Recent progress in diffusion-based audio generation and restoration has substantially improved performance across heterogeneous conditioning regimes, including text-conditioned audio generation and audio-conditioned super-resolution. However, training audio diffusion models remains computationally expensive, and most existing pipelines still rely on static optimization recipes that treat the relative importance of training signals as fixed throughout learning. In this work, we argue that a major source of inefficiency lies in the evolving balance between semantic acquisition and generation-oriented refinement. Early training places stronger emphasis on acquiring condition-aligned semantic structure and coarse global organization, whereas later training increasingly emphasizes temporal consistency, perceptual fidelity, and fine-detail refinement. To characterize this evolving balance, we introduce a progress-based regime variable derived from the training-time slope of an SSL-space discrepancy, which measures semantic progress during training. Based on this signal, we develop three complementary stage-aware mechanisms: decayed SSL guidance for early semantic bootstrapping, self-adaptive timestep sampling driven by the regime variable, and structure-aware regularization activated from convergent grouped organization in parameter space. We evaluate these mechanisms on text-conditioned audio generation and audio-conditioned super-resolution. Across both settings, the proposed stage-aware strategies improve convergence behavior and yield gains on the primary generation and spectral reconstruction metrics over standard static baselines. These results support the view that efficient audio diffusion training can benefit from treating external guidance, internal organization, and optimization emphasis as stage-dependent components rather than fixed ingredients.
SDMay 24, 2024Code
Quality-aware Masked Diffusion Transformer for Enhanced Music GenerationChang Li, Ruoyu Wang, Lijuan Liu et al.
Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https://qa-mdt.github.io/, code and pretrained checkpoints are open-sourced at https://github.com/ivcylc/OpenMusic.
CLDec 29, 2025
AI Meets Brain: Memory Systems from Cognitive Neuroscience to Autonomous AgentsJiafeng Liang, Hao Li, Chang Li et al.
Memory serves as the pivotal nexus bridging past and future, providing both humans and AI systems with invaluable concepts and experience to navigate complex tasks. Recent research on autonomous agents has increasingly focused on designing efficient memory workflows by drawing on cognitive neuroscience. However, constrained by interdisciplinary barriers, existing works struggle to assimilate the essence of human memory mechanisms. To bridge this gap, we systematically synthesizes interdisciplinary knowledge of memory, connecting insights from cognitive neuroscience with LLM-driven agents. Specifically, we first elucidate the definition and function of memory along a progressive trajectory from cognitive neuroscience through LLMs to agents. We then provide a comparative analysis of memory taxonomy, storage mechanisms, and the complete management lifecycle from both biological and artificial perspectives. Subsequently, we review the mainstream benchmarks for evaluating agent memory. Additionally, we explore memory security from dual perspectives of attack and defense. Finally, we envision future research directions, with a focus on multimodal memory systems and skill acquisition.
55.2AIMay 7
Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicinePeisong Zhang, Manqiang Peng, Yuxuan Wu et al.
Estimating individualized treatment effects from longitudinal observational data is central to data-driven medicine, yet existing methods face a fundamental limitation: reducing confounding bias often suppresses clinically informative heterogeneity, degrading patient-specific predictions. Here, we identify this tension as a bias-precision paradox in causal representation learning and introduce sampling-based maximum mean discrepancy (sMMD), a stochastic alignment strategy that replaces global adversarial balancing with subset-level matching. We instantiate this approach in a framework for counterfactual outcome prediction with attribution-grounded interpretability. Across two large-scale ICU cohorts (n = 27,783), our framework improves accuracy under distribution shift, reducing error by up to 11.5% and substantially increasing recall in high-risk tasks. Mechanistic analyses show that sMMD selectively preserves clinically decisive variables. In human-AI evaluation, our method outperforms clinicians-in-training and large language models, and improves clinician accuracy by 14.7% while reducing decision time, enabling interpretable, real-time clinical decision support.
CVFeb 22
BriMA: Bridged Modality Adaptation for Multi-Modal Continual Action Quality AssessmentKanglei Zhou, Chang Li, Qingyi Pan et al.
Action Quality Assessment (AQA) aims to score how well an action is performed and is widely used in sports analysis, rehabilitation assessment, and human skill evaluation. Multi-modal AQA has recently achieved strong progress by leveraging complementary visual and kinematic cues, yet real-world deployments often suffer from non-stationary modality imbalance, where certain modalities become missing or intermittently available due to sensor failures or annotation gaps. Existing continual AQA methods overlook this issue and assume that all modalities remain complete and stable throughout training, which restricts their practicality. To address this challenge, we introduce Bridged Modality Adaptation (BriMA), an innovative approach to multi-modal continual AQA under modality-missing conditions. BriMA consists of a memory-guided bridging imputation module that reconstructs missing modalities using both task-agnostic and task-specific representations, and a modality-aware replay mechanism that prioritizes informative samples based on modality distortion and distribution drift. Experiments on three representative multi-modal AQA datasets (RG, Fis-V, and FS1000) show that BriMA consistently improves performance under different modality-missing conditions, achieving 6--8\% higher correlation and 12--15\% lower error on average. These results demonstrate a step toward robust multi-modal AQA systems under real-world deployment constraints.
SDFeb 3
PACE: Pretrained Audio Continual LearningChang Li, Kanglei Zhou, Liyuan Wang
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
IVNov 17, 2024Code
DBF-Net: A Dual-Branch Network with Feature Fusion for Ultrasound Image SegmentationGuoping Xu, Ximing Wu, Wentao Liao et al.
Accurately segmenting lesions in ultrasound images is challenging due to the difficulty in distinguishing boundaries between lesions and surrounding tissues. While deep learning has improved segmentation accuracy, there is limited focus on boundary quality and its relationship with body structures. To address this, we introduce UBBS-Net, a dual-branch deep neural network that learns the relationship between body and boundary for improved segmentation. We also propose a feature fusion module to integrate body and boundary information. Evaluated on three public datasets, UBBS-Net outperforms existing methods, achieving Dice Similarity Coefficients of 81.05% for breast cancer, 76.41% for brachial plexus nerves, and 87.75% for infantile hemangioma segmentation. Our results demonstrate the effectiveness of UBBS-Net for ultrasound image segmentation. The code is available at https://github.com/apple1986/DBF-Net.
LGDec 29, 2025
Exploiting the Prior of Generative Time Series ImputationYuYang Miao, Chang Li, Zehua Chen
Time series imputation, i.e., filling the missing values of a time recording, finds various applications in electricity, finance, and weather modelling. Previous methods have introduced generative models such as diffusion probabilistic models and Schrodinger bridge models to conditionally generate the missing values from Gaussian noise or directly from linear interpolation results. However, as their prior is not informative to the ground-truth target, their generation process inevitably suffer increased burden and limited imputation accuracy. In this work, we present Bridge-TS, building a data-to-data generation process for generative time series imputation and exploiting the design of prior with two novel designs. Firstly, we propose expert prior, leveraging a pretrained transformer-based module as an expert to fill the missing values with a deterministic estimation, and then taking the results as the prior of ground truth target. Secondly, we explore compositional priors, utilizing several pretrained models to provide different estimation results, and then combining them in the data-to-data generation process to achieve a compositional priors-to-target imputation process. Experiments conducted on several benchmark datasets such as ETT, Exchange, and Weather show that Bridge-TS reaches a new record of imputation accuracy in terms of mean square error and mean absolute error, demonstrating the superiority of improving prior for generative time series imputation.
CVJul 16, 2024
MRIo3DS-Net: A Mutually Reinforcing Images to 3D Surface RNN-like framework for model-adaptation indoor 3D reconstructionChang Li, Jiao Guo, Yufei Zhao et al.
This paper is the first to propose an end-to-end framework of mutually reinforcing images to 3D surface recurrent neural network-like for model-adaptation indoor 3D reconstruction,where multi-view dense matching and point cloud surface optimization are mutually reinforced by a RNN-like structure rather than being treated as a separate issue.The characteristics are as follows:In the multi-view dense matching module, the model-adaptation strategy is used to fine-tune and optimize a Transformer-based multi-view dense matching DNN,so that it has the higher image feature for matching and detail expression capabilities;In the point cloud surface optimization module,the 3D surface reconstruction network based on 3D implicit field is optimized by using model-adaptation strategy,which solves the problem of point cloud surface optimization without knowing normal vector of 3D surface.To improve and finely reconstruct 3D surfaces from point cloud,smooth loss is proposed and added to this module;The MRIo3DS-Net is a RNN-like framework,which utilizes the finely optimized 3D surface obtained by PCSOM to recursively reinforce the differentiable warping for optimizing MVDMM.This refinement leads to achieving better dense matching results, and better dense matching results leads to achieving better 3D surface results recursively and mutually.Hence, model-adaptation strategy can better collaborate the differences between the two network modules,so that they complement each other to achieve the better effect;To accelerate the transfer learning and training convergence from source domain to target domain,a multi-task loss function based on Bayesian uncertainty is used to adaptively adjust the weights between the two networks loss functions of MVDMM and PCSOM;In this multi-task cascade network framework,any modules can be replaced by any state-of-the-art networks to achieve better 3D reconstruction results.
SDJul 11, 2025
FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio GenerationYuxuan Jiang, Zehua Chen, Zeqian Ju et al.
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: https://freeaudio.github.io/FreeAudio/
CVJul 3, 2024
ISWSST: Index-space-wave State Superposition Transformers for Multispectral Remotely Sensed Imagery Semantic SegmentationChang Li, Pengfei Zhang, Yu Wang
Currently the semantic segmentation task of multispectral remotely sensed imagery (MSRSI) faces the following problems: 1) Usually, only single domain feature (i.e., space domain or frequency domain) is considered; 2) downsampling operation in encoder generally leads to the accuracy loss of edge extraction; 3) multichannel features of MSRSI are not fully considered; and 4) prior knowledge of remote sensing is not fully utilized. To solve the aforementioned issues, an index-space-wave state superposition Transformer (ISWSST) is the first to be proposed for MSRSI semantic segmentation by the inspiration from quantum mechanics, whose superiority is as follows: 1) index, space and wave states are superposed or fused to simulate quantum superposition by adaptively voting decision (i.e., ensemble learning idea) for being a stronger classifier and improving the segmentation accuracy; 2) a lossless wavelet pyramid encoder-decoder module is designed to losslessly reconstruct image and simulate quantum entanglement based on wavelet transform and inverse wavelet transform for avoiding the edge extraction loss; 3) combining multispectral features (i.e. remote sensing index and channel attention mechanism) is proposed to accurately extract ground objects from original resolution images; and 4) quantum mechanics are introduced to interpret the underlying superiority of ISWSST. Experiments show that ISWSST is validated and superior to the state-of-the-art architectures for the MSRSI segmentation task, which improves the segmentation and edge extraction accuracy effectively. Codes will be available publicly after our paper is accepted.
SDSep 22, 2025
Audio Super-Resolution with Latent Bridge ModelsChang Li, Zehua Chen, Liyuan Wang et al.
Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
CVMar 24, 2024
Dual-modal Prior Semantic Guided Infrared and Visible Image Fusion for Intelligent Transportation SystemJing Li, Lu Bai, Bin Yang et al.
Infrared and visible image fusion (IVF) plays an important role in intelligent transportation system (ITS). The early works predominantly focus on boosting the visual appeal of the fused result, and only several recent approaches have tried to combine the high-level vision task with IVF. However, they prioritize the design of cascaded structure to seek unified suitable features and fit different tasks. Thus, they tend to typically bias toward to reconstructing raw pixels without considering the significance of semantic features. Therefore, we propose a novel prior semantic guided image fusion method based on the dual-modality strategy, improving the performance of IVF in ITS. Specifically, to explore the independent significant semantic of each modality, we first design two parallel semantic segmentation branches with a refined feature adaptive-modulation (RFaM) mechanism. RFaM can perceive the features that are semantically distinct enough in each semantic segmentation branch. Then, two pilot experiments based on the two branches are conducted to capture the significant prior semantic of two images, which then is applied to guide the fusion task in the integration of semantic segmentation branches and fusion branches. In addition, to aggregate both high-level semantics and impressive visual effects, we further investigate the frequency response of the prior semantics, and propose a multi-level representation-adaptive fusion (MRaF) module to explicitly integrate the low-frequent prior semantic with the high-frequent details. Extensive experiments on two public datasets demonstrate the superiority of our method over the state-of-the-art image fusion approaches, in terms of either the visual appeal or the high-level semantics.
CVNov 8, 2025
U(PM)$^2$:Unsupervised polygon matching with pre-trained models for challenging stereo imagesChang Li, Xingtao Peng
Stereo image matching is a fundamental task in computer vision, photogrammetry and remote sensing, but there is an almost unexplored field, i.e., polygon matching, which faces the following challenges: disparity discontinuity, scale variation, training requirement, and generalization. To address the above-mentioned issues, this paper proposes a novel U(PM)$^2$: low-cost unsupervised polygon matching with pre-trained models by uniting automatically learned and handcrafted features, of which pipeline is as follows: firstly, the detector leverages the pre-trained segment anything model to obtain masks; then, the vectorizer converts the masks to polygons and graphic structure; secondly, the global matcher addresses challenges from global viewpoint changes and scale variation based on bidirectional-pyramid strategy with pre-trained LoFTR; finally, the local matcher further overcomes local disparity discontinuity and topology inconsistency of polygon matching by local-joint geometry and multi-feature matching strategy with Hungarian algorithm. We benchmark our U(PM)$^2$ on the ScanNet and SceneFlow datasets using our proposed new metric, which achieved state-of-the-art accuracy at a competitive speed and satisfactory generalization performance at low cost without any training requirement.
CLOct 4, 2025
Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language ModelsCanhui Wu, Qiong Cao, Chang Li et al.
Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.
SDSep 28, 2025
AudioMoG: Guiding Audio Generation with Mixture-of-GuidanceJunyou Wang, Zehua Chen, Binjie Yuan et al.
Guidance methods have demonstrated significant improvements in cross-modal audio generation, including text-to-audio (T2A) and video-to-audio (V2A) generation. The popularly adopted method, classifier-free guidance (CFG), steers generation by emphasizing condition alignment, enhancing fidelity but often at the cost of diversity. Recently, autoguidance (AG) has been explored for audio generation, encouraging the sampling to faithfully reconstruct the target distribution and showing increased diversity. Despite these advances, they usually rely on a single guiding principle, e.g., condition alignment in CFG or score accuracy in AG, leaving the full potential of guidance for audio generation untapped. In this work, we explore enriching the composition of the guidance method and present a mixture-of-guidance framework, AudioMoG. Within the design space, AudioMoG can exploit the complementary advantages of distinctive guiding principles by fulfilling their cumulative benefits. With a reduced form, AudioMoG can consider parallel complements or recover a single guiding principle, without sacrificing generality. We experimentally show that, given the same inference speed, AudioMoG approach consistently outperforms single guidance in T2A generation across sampling steps, concurrently showing advantages in V2A, text-to-music, and image generation. These results highlight a "free lunch" in current cross-modal audio generation systems: higher quality can be achieved through mixed guiding principles at the sampling stage without sacrificing inference efficiency. Demo samples are available at: https://audio-mog.github.io.
SDSep 25, 2025
Prompt-aware classifier free guidance for diffusion modelsXuanhao Zhang, Chang Li
Diffusion models have achieved remarkable progress in image and audio generation, largely due to Classifier-Free Guidance. However, the choice of guidance scale remains underexplored: a fixed scale often fails to generalize across prompts of varying complexity, leading to oversaturation or weak alignment. We address this gap by introducing a prompt-aware framework that predicts scale-dependent quality and selects the optimal guidance at inference. Specifically, we construct a large synthetic dataset by generating samples under multiple scales and scoring them with reliable evaluation metrics. A lightweight predictor, conditioned on semantic embeddings and linguistic complexity, estimates multi-metric quality curves and determines the best scale via a utility function with regularization. Experiments on MSCOCO~2014 and AudioCaps show consistent improvements over vanilla CFG, enhancing fidelity, alignment, and perceptual preference. This work demonstrates that prompt-aware scale selection provides an effective, training-free enhancement for pretrained diffusion backbones.
CLSep 23, 2025
Trace Is In Sentences: Unbiased Lightweight ChatGPT-Generated Text DetectorMo Mu, Dianqiao Lei, Chang Li
The widespread adoption of ChatGPT has raised concerns about its misuse, highlighting the need for robust detection of AI-generated text. Current word-level detectors are vulnerable to paraphrasing or simple prompts (PSP), suffer from biases induced by ChatGPT's word-level patterns (CWP) and training data content, degrade on modified text, and often require large models or online LLM interaction. To tackle these issues, we introduce a novel task to detect both original and PSP-modified AI-generated texts, and propose a lightweight framework that classifies texts based on their internal structure, which remains invariant under word-level changes. Our approach encodes sentence embeddings from pre-trained language models and models their relationships via attention. We employ contrastive learning to mitigate embedding biases from autoregressive generation and incorporate a causal graph with counterfactual methods to isolate structural features from topic-related biases. Experiments on two curated datasets, including abstract comparisons and revised life FAQs, validate the effectiveness of our method.
CVAug 14, 2025
Towards Powerful and Practical Patch Attacks for 2D Object Detection in Autonomous DrivingYuxin Cao, Yedi Zhang, Wentao He et al.
Learning-based autonomous driving systems remain critically vulnerable to adversarial patches, posing serious safety and security risks in their real-world deployment. Black-box attacks, notable for their high attack success rate without model knowledge, are especially concerning, with their transferability extensively studied to reduce computational costs compared to query-based attacks. Previous transferability-based black-box attacks typically adopt mean Average Precision (mAP) as the evaluation metric and design training loss accordingly. However, due to the presence of multiple detected bounding boxes and the relatively lenient Intersection over Union (IoU) thresholds, the attack effectiveness of these approaches is often overestimated, resulting in reduced success rates in practical attacking scenarios. Furthermore, patches trained on low-resolution data often fail to maintain effectiveness on high-resolution images, limiting their transferability to autonomous driving datasets. To fill this gap, we propose P$^3$A, a Powerful and Practical Patch Attack framework for 2D object detection in autonomous driving, specifically optimized for high-resolution datasets. First, we introduce a novel metric, Practical Attack Success Rate (PASR), to more accurately quantify attack effectiveness with greater relevance for pedestrian safety. Second, we present a tailored Localization-Confidence Suppression Loss (LCSL) to improve attack transferability under PASR. Finally, to maintain the transferability for high-resolution datasets, we further incorporate the Probabilistic Scale-Preserving Padding (PSPP) into the patch attack pipeline as a data preprocessing step. Extensive experiments show that P$^3$A outperforms state-of-the-art attacks on unseen models and unseen high-resolution datasets, both under the proposed practical IoU-based evaluation metric and the previous mAP-based metrics.
AIJul 22, 2025
Learning Temporal Abstractions via Variational Homomorphisms in Option-Induced Abstract MDPsChang Li, Yaren Zhang, Haoran Lv et al.
Large Language Models (LLMs) have shown remarkable reasoning ability through explicit Chain-of-Thought (CoT) prompting, but generating these step-by-step textual explanations is computationally expensive and slow. To overcome this, we aim to develop a framework for efficient, implicit reasoning, where the model "thinks" in a latent space without generating explicit text for every step. We propose that these latent thoughts can be modeled as temporally-extended abstract actions, or options, within a hierarchical reinforcement learning framework. To effectively learn a diverse library of options as latent embeddings, we first introduce the Variational Markovian Option Critic (VMOC), an off-policy algorithm that uses variational inference within the HiT-MDP framework. To provide a rigorous foundation for using these options as an abstract reasoning space, we extend the theory of continuous MDP homomorphisms. This proves that learning a policy in the simplified, abstract latent space, for which VMOC is suited, preserves the optimality of the solution to the original, complex problem. Finally, we propose a cold-start procedure that leverages supervised fine-tuning (SFT) data to distill human reasoning demonstrations into this latent option space, providing a rich initialization for the model's reasoning capabilities. Extensive experiments demonstrate that our approach achieves strong performance on complex logical reasoning benchmarks and challenging locomotion tasks, validating our framework as a principled method for learning abstract skills for both language and control.
CVJul 20, 2025
A Novel Downsampling Strategy Based on Information Complementarity for Medical Image SegmentationWenbo Yue, Chang Li, Guoping Xu
In convolutional neural networks (CNNs), downsampling operations are crucial to model performance. Although traditional downsampling methods (such as maximum pooling and cross-row convolution) perform well in feature aggregation, receptive field expansion, and computational reduction, they may lead to the loss of key spatial information in semantic segmentation tasks, thereby affecting the pixel-by-pixel prediction accuracy.To this end, this study proposes a downsampling method based on information complementarity - Hybrid Pooling Downsampling (HPD). The core is to replace the traditional method with MinMaxPooling, and effectively retain the light and dark contrast and detail features of the image by extracting the maximum value information of the local area.Experiment on various CNN architectures on the ACDC and Synapse datasets show that HPD outperforms traditional methods in segmentation performance, and increases the DSC coefficient by 0.5% on average. The results show that the HPD module provides an efficient solution for semantic segmentation tasks.
SDFeb 7, 2025
Latent Swap Joint Diffusion for 2D Long-Form Latent GenerationYusheng Dai, Chenxi Wang, Chang Li et al.
This paper introduces Swap Forward (SaFa), a modality-agnostic and efficient method to generate seamless and coherence long spectrum and panorama through latent swap joint diffusion across multi-views. We first investigate the spectrum aliasing problem in spectrum-based audio generation caused by existing joint diffusion methods. Through a comparative analysis of the VAE latent representation of Mel-spectra and RGB images, we identify that the failure arises from excessive suppression of high-frequency components during the spectrum denoising process due to the averaging operator. To address this issue, we propose Self-Loop Latent Swap, a frame-level bidirectional swap applied to the overlapping region of adjacent views. Leveraging stepwise differentiated trajectories of adjacent subviews, this swap operator adaptively enhances high-frequency components and avoid spectrum distortion. Furthermore, to improve global cross-view consistency in non-overlapping regions, we introduce Reference-Guided Latent Swap, a unidirectional latent swap operator that provides a centralized reference trajectory to synchronize subview diffusions. By refining swap timing and intervals, we can achieve a cross-view similarity-diversity balance in a forward-only manner. Quantitative and qualitative experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based methods in audio generation using both U-Net and DiT models, along with effective longer length adaptation. It also adapts well to panorama generation, achieving comparable performance with 2 $\sim$ 20 $\times$ faster speed and greater model generalizability. More generation demos are available at https://swapforward.github.io/
CVNov 26, 2024
ΩSFormer: Dual-Modal Ω-like Super-Resolution Transformer Network for Cross-scale and High-accuracy Terraced Field Vectorization ExtractionChang Li, Yu Wang, Ce Zhang et al.
Terraced field is a significant engineering practice for soil and water conservation (SWC). Terraced field extraction from remotely sensed imagery is the foundation for monitoring and evaluating SWC. This study is the first to propose a novel dual-modal Ω-like super-resolution Transformer network for intelligent TFVE, offering the following advantages: (1) reducing edge segmentation error from conventional multi-scale downsampling encoder, through fusing original high-resolution features with downsampling features at each step of encoder and leveraging a multi-head attention mechanism; (2) improving the accuracy of TFVE by proposing a Ω-like network structure, which fully integrates rich high-level features from both spectral and terrain data to form cross-scale super-resolution features; (3) validating an optimal fusion scheme for cross-modal and cross-scale (i.e., inconsistent spatial resolution between remotely sensed imagery and DEM) super-resolution feature extraction; (4) mitigating uncertainty between segmentation edge pixels by a coarse-to-fine and spatial topological semantic relationship optimization (STSRO) segmentation strategy; (5) leveraging contour vibration neural network to continuously optimize parameters and iteratively vectorize terraced fields from semantic segmentation results. Moreover, a DMRVD for deep-learning-based TFVE was created for the first time, which covers nine study areas in four provinces of China, with a total coverage area of 22441 square kilometers. To assess the performance of ΩSFormer, classic and SOTA networks were compared. The mIOU of ΩSFormer has improved by 0.165, 0.297 and 0.128 respectively, when compared with best accuracy single-modal remotely sensed imagery, single-modal DEM and dual-modal result.
CVJun 23, 2024
UDHF2-Net: Uncertainty-diffusion-model-based High-Frequency TransFormer Network for Remotely Sensed Imagery InterpretationPengfei Zhang, Chang Li, Yongjun Zhang et al.
Remotely sensed imagery interpretation (RSII) faces the three major problems: (1) objective representation of spatial distribution patterns; (2) edge uncertainty problem caused by downsampling encoder and intrinsic edge noises (e.g., mixed pixel and edge occlusion etc.); and (3) false detection problem caused by geometric registration error in change detection. To solve the aforementioned problems, uncertainty-diffusion-model-based high-Frequency TransFormer network (UDHF2-Net) is the first to be proposed, whose superiorities are as follows: (1) a spatially-stationary-and-non-stationary high-frequency connection paradigm (SHCP) is proposed to enhance the interaction of spatially frequency-wise stationary and non-stationary features to yield high-fidelity edge extraction result. Inspired by HRFormer, SHCP proposes high-frequency-wise stream to replace high-resolution-wise stream in HRFormer through the whole encoder-decoder process with parallel frequency-wise high-to-low streams, so it improves the edge extraction accuracy by continuously remaining high-frequency information; (2) a mask-and-geo-knowledge-based uncertainty diffusion module (MUDM), which is a self-supervised learning strategy, is proposed to improve the edge accuracy of extraction and change detection by gradually removing the simulated spectrum noises based on geo-knowledge and the generated diffused spectrum noises; (3) a frequency-wise semi-pseudo-Siamese UDHF2-Net is the first to be proposed to balance accuracy and complexity for change detection. Besides the aforementioned spectrum noises in semantic segmentation, MUDM is also a self-supervised learning strategy to effectively reduce the edge false change detection from the generated imagery with geometric registration error.
SPDec 8, 2021
Toward Open-World Electroencephalogram Decoding Via Deep Learning: A Comprehensive SurveyXun Chen, Chang Li, Aiping Liu et al.
Electroencephalogram (EEG) decoding aims to identify the perceptual, semantic, and cognitive content of neural processing based on non-invasively measured brain activity. Traditional EEG decoding methods have achieved moderate success when applied to data acquired in static, well-controlled lab environments. However, an open-world environment is a more realistic setting, where situations affecting EEG recordings can emerge unexpectedly, significantly weakening the robustness of existing methods. In recent years, deep learning (DL) has emerged as a potential solution for such problems due to its superior capacity in feature extraction. It overcomes the limitations of defining `handcrafted' features or features extracted using shallow architectures, but typically requires large amounts of costly, expertly-labelled data - something not always obtainable. Combining DL with domain-specific knowledge may allow for development of robust approaches to decode brain activity even with small-sample data. Although various DL methods have been proposed to tackle some of the challenges in EEG decoding, a systematic tutorial overview, particularly for open-world applications, is currently lacking. This article therefore provides a comprehensive survey of DL methods for open-world EEG decoding, and identifies promising research directions to inspire future studies for EEG decoding in real-world applications.
IROct 12, 2021
Optimizing Ranking Systems Online as BanditsChang Li
Ranking system is the core part of modern retrieval and recommender systems, where the goal is to rank candidate items given user contexts. Optimizing ranking systems online means that the deployed system can serve user requests, e.g., queries in the web search, and optimize the ranking policy by learning from user interactions, e.g., clicks. Bandit is a general online learning framework and can be used in our optimization task. However, due to the unique features of ranking, there are several challenges in designing bandit algorithms for ranking system optimization. In this dissertation, we study and propose solutions for four challenges in optimizing ranking systems online: effectiveness, safety, nonstationarity, and diversification. First, the effectiveness is related to how fast the algorithm learns from interactions. We study the effective online ranker evaluation task and propose the MergeDTS algorithm to solve the problem effectively. Second, the deployed algorithm should be safe, which means the algorithm only displays reasonable content to user requests. To solve the safe online learning to rank problem, we propose the BubbleRank algorithm. Third, as users change their preferences constantly, the algorithm should handle the nonstationarity. We formulate this nonstationary online learning to rank problem as cascade non-stationary bandits and propose CascadeDUCB and CascadeSWUCB algorithms to solve the problem. Finally, the contents in ranked lists should be diverse. We consider the results diversification task and propose the CascadeHybird algorithm that considers both the item relevance and results diversification when learning from user interactions.
ROAug 11, 2021
Elastic Tactile Simulation Towards Tactile-Visual PerceptionYikai Wang, Wenbing Huang, Bin Fang et al.
Tactile sensing plays an important role in robotic perception and manipulation tasks. To overcome the real-world limitations of data collection, simulating tactile response in a virtual environment comes as a desirable direction of robotic research. In this paper, we propose Elastic Interaction of Particles (EIP) for tactile simulation. Most existing works model the tactile sensor as a rigid multi-body, which is incapable of reflecting the elastic property of the tactile sensor as well as characterizing the fine-grained physical interaction between the two objects. By contrast, EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact. With the tactile simulation by EIP, we further propose a tactile-visual perception network that enables information fusion between tactile data and visual images. The perception network is based on a global-to-local fusion mechanism where multi-scale tactile features are aggregated to the corresponding local region of the visual modality with the guidance of tactile positions and directions. The fusion method exhibits superiority regarding the 3D geometric reconstruction task.
IRMay 11, 2021
Federated Unbiased Learning to RankChang Li, Hua Ouyang
Unbiased Learning to Rank (ULTR) studies the problem of learning a ranking function based on biased user interactions. In this framework, ULTR algorithms have to rely on a large amount of user data that are collected, stored, and aggregated by central servers. In this paper, we consider an on-device search setting, where users search against their personal corpora on their local devices, and the goal is to learn a ranking function from biased user interactions. Due to privacy constraints, users' queries, personal documents, results lists, and raw interaction data will not leave their devices, and ULTR has to be carried out via Federated Learning (FL). Directly applying existing ULTR algorithms on users' devices could suffer from insufficient training data due to the limited amount of local interactions. To address this problem, we propose the FedIPS algorithm, which learns from user interactions on-device under the coordination of a central server and uses click propensities to remove the position bias in user interactions. Our evaluation of FedIPS on the Yahoo and Istella datasets shows that FedIPS is robust over a range of position biases.
CVJan 19, 2021
Human Action Recognition Based on Multi-scale Feature Maps from Depth Video SequencesChang Li, Qian Huang, Xing Li et al.
Human action recognition is an active research area in computer vision. Although great process has been made, previous methods mostly recognize actions based on depth data at only one scale, and thus they often neglect multi-scale features that provide additional information action recognition in practical application scenarios. In this paper, we present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences. We propose a multi-scale feature map called Laplacian pyramid depth motion images(LP-DMI). We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions. Then, we caculate LP-DMI to enhance multi-scale dynamic information of motions and reduces redundant static information in human bodies. We further extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features. Finally, we utilize extreme learning machine (ELM) for action classification. The proposed method yeilds the recognition accuracy of 93.41%, 85.12%, 91.94% on public MSRAction3D dataset, UTD-MHAD and DHA dataset. Through extensive experiments, we prove that our method outperforms state-of-the-art benchmarks.
HCOct 13, 2020
Intrinsic motivation in virtual assistant interaction for fostering spontaneous interactionsChang Li, Hideyoshi Yanagisawa
With the growing utility of today's conversational virtual assistants, the importance of user motivation in human-AI interaction is becoming more obvious. However, previous studies in this and related fields, such as human-computer interaction and human-robot interaction, scarcely discussed intrinsic motivation and its affecting factors. Those studies either treated motivation as an inseparable concept or focused on non-intrinsic motivation. The current study aims to cover intrinsic motivation by taking an affective-engineering approach. A novel motivation model is proposed, in which intrinsic motivation is affected by two factors that derive from user interactions with virtual assistants: expectation of capability and uncertainty. Experiments are conducted where these two factors are manipulated by making participants believe they are interacting with the smart speaker "Amazon Echo". Intrinsic motivation is measured both by using questionnaires and by covertly monitoring a five-minute free-choice period in the experimenter's absence, during which the participants could decide for themselves whether to interact with the virtual assistants. Results of the first experiment showed that high expectation engenders more intrinsically motivated interaction compared with low expectation. The results also suggested suppressive effects by uncertainty on intrinsic motivation, though we had not hypothesized before experiments. We then revised our hypothetical model of action selection accordingly and conducted a verification experiment of uncertainty's effects. Results of the verification experiment showed that reducing uncertainty encourages more interactions and causes the motivation behind these interactions to shift from non-intrinsic to intrinsic.
LGDec 1, 2019
Cascading Hybrid Bandits: Online Learning to Rank for Relevance and DiversityChang Li, Haoyun Feng, Maarten de Rijke
Relevance ranking and result diversification are two core areas in modern recommender systems. Relevance ranking aims at building a ranked list sorted in decreasing order of item relevance, while result diversification focuses on generating a ranked list of items that covers a broad range of topics. In this paper, we study an online learning setting that aims to recommend a ranked list with $K$ items that maximizes the ranking utility, i.e., a list whose items are relevant and whose topics are diverse. We formulate it as the cascade hybrid bandits (CHB) problem. CHB assumes the cascading user behavior, where a user browses the displayed list from top to bottom, clicks the first attractive item, and stops browsing the rest. We propose a hybrid contextual bandit approach, called CascadeHybrid, for solving this problem. CascadeHybrid models item relevance and topical diversity using two independent functions and simultaneously learns those functions from user click feedback. We conduct experiments to evaluate CascadeHybrid on two real-world recommendation datasets: MovieLens and Yahoo music datasets. Our experimental results show that CascadeHybrid outperforms the baselines. In addition, we prove theoretical guarantees on the $n$-step performance demonstrating the soundness of CascadeHybrid.
LGMay 29, 2019
Cascading Non-Stationary Bandits: Online Learning to Rank in the Non-Stationary Cascade ModelChang Li, Maarten de Rijke
Non-stationarity appears in many online applications such as web search and advertising. In this paper, we study the online learning to rank problem in a non-stationary environment where user preferences change abruptly at an unknown moment in time. We consider the problem of identifying the K most attractive items and propose cascading non-stationary bandits, an online learning variant of the cascading model, where a user browses a ranked list from top to bottom and clicks on the first attractive item. We propose two algorithms for solving this non-stationary problem: CascadeDUCB and CascadeSWUCB. We analyze their performance and derive gap-dependent upper bounds on the n-step regret of these algorithms. We also establish a lower bound on the regret for cascading non-stationary bandits and show that both algorithms match the lower bound up to a logarithmic factor. Finally, we evaluate their performance on a real-world web search click dataset.
IRDec 12, 2018
Online Learning to Rank with List-level Feedback for Image FilteringChang Li, Artem Grotov, Ilya Markov et al.
Online learning to rank (OLTR) via implicit feedback has been extensively studied for document retrieval in cases where the feedback is available at the level of individual items. To learn from item-level feedback, the current algorithms require certain assumptions about user behavior. In this paper, we study a more general setup: OLTR with list-level feedback, where the feedback is provided only at the level of an entire ranked list. We propose two methods that allow online learning to rank in this setup. The first method, PGLearn, uses a ranking model to generate policies and optimizes it online using policy gradients. The second method, RegLearn, learns to combine individual document relevance scores by directly predicting the observed list-level feedback through regression. We evaluate the proposed methods on the image filtering task, in which deep neural networks (DNNs) are used to rank images in response to a set of standing queries. We show that PGLearn does not perform well in OLTR with list-level feedback. RegLearn, instead, shows good performance in both online and offline metrics.
IRDec 11, 2018
MergeDTS: A Method for Effective Large-Scale Online Ranker EvaluationChang Li, Ilya Markov, Maarten de Rijke et al.
Online ranker evaluation is one of the key challenges in information retrieval. While the preferences of rankers can be inferred by interleaving methods, the problem of how to effectively choose the ranker pair that generates the interleaved list without degrading the user experience too much is still challenging. On the one hand, if two rankers have not been compared enough, the inferred preference can be noisy and inaccurate. On the other, if two rankers are compared too many times, the interleaving process inevitably hurts the user experience too much. This dilemma is known as the exploration versus exploitation tradeoff. It is captured by the $K$-armed dueling bandit problem, which is a variant of the $K$-armed bandit problem, where the feedback comes in the form of pairwise preferences. Today's deployed search systems can evaluate a large number of rankers concurrently, and scaling effectively in the presence of numerous rankers is a critical aspect of $K$-armed dueling bandit problems. In this paper, we focus on solving the large-scale online ranker evaluation problem under the so-called Condorcet assumption, where there exists an optimal ranker that is preferred to all other rankers. We propose Merge Double Thompson Sampling (MergeDTS), which first utilizes a divide-and-conquer strategy that localizes the comparisons carried out by the algorithm to small batches of rankers, and then employs Thompson Sampling (TS) to reduce the comparisons between suboptimal rankers inside these small batches. The effectiveness (regret) and efficiency (time complexity) of MergeDTS are extensively evaluated using examples from the domain of online evaluation for web search. Our main finding is that for large-scale Condorcet ranker evaluation problems, MergeDTS outperforms the state-of-the-art dueling bandit algorithms.
LGJun 18, 2018
Incremental Sparse Bayesian Ordinal RegressionChang Li, Maarten de Rijke
Ordinal Regression (OR) aims to model the ordering information between different data categories, which is a crucial topic in multi-label learning. An important class of approaches to OR models the problem as a linear combination of basis functions that map features to a high dimensional non-linear space. However, most of the basis function-based algorithms are time consuming. We propose an incremental sparse Bayesian approach to OR tasks and introduce an algorithm to sequentially learn the relevant basis functions in the ordinal scenario. Our method, called Incremental Sparse Bayesian Ordinal Regression (ISBOR), automatically optimizes the hyper-parameters via the type-II maximum likelihood method. By exploiting fast marginal likelihood optimization, ISBOR can avoid big matrix inverses, which is the main bottleneck in applying basis function-based algorithms to OR tasks on large-scale datasets. We show that ISBOR can make accurate predictions with parsimonious basis functions while offering automatic estimates of the prediction uncertainty. Extensive experiments on synthetic and real word datasets demonstrate the efficiency and effectiveness of ISBOR compared to other basis function-based OR approaches.
LGJun 15, 2018
BubbleRank: Safe Online Learning to Re-Rank via Implicit Click FeedbackChang Li, Branislav Kveton, Tor Lattimore et al.
In this paper, we study the problem of safe online learning to re-rank, where user feedback is used to improve the quality of displayed lists. Learning to rank has traditionally been studied in two settings. In the offline setting, rankers are typically learned from relevance labels created by judges. This approach has generally become standard in industrial applications of ranking, such as search. However, this approach lacks exploration and thus is limited by the information content of the offline training data. In the online setting, an algorithm can experiment with lists and learn from feedback on them in a sequential fashion. Bandit algorithms are well-suited for this setting but they tend to learn user preferences from scratch, which results in a high initial cost of exploration. This poses an additional challenge of safe exploration in ranked lists. We propose BubbleRank, a bandit algorithm for safe re-ranking that combines the strengths of both the offline and online settings. The algorithm starts with an initial base list and improves it online by gradually exchanging higher-ranked less attractive items for lower-ranked more attractive items. We prove an upper bound on the n-step regret of BubbleRank that degrades gracefully with the quality of the initial base list. Our theoretical findings are supported by extensive experiments on a large-scale real-world click dataset.
LGSep 18, 2016
Probabilistic Feature Selection and Classification Vector MachineBingbing Jiang, Chang Li, Maarten de Rijke et al.
Sparse Bayesian learning is a state-of-the-art supervised learning algorithm that can choose a subset of relevant samples from the input data and make reliable probabilistic predictions. However, in the presence of high-dimensional data with irrelevant features, traditional sparse Bayesian classifiers suffer from performance degradation and low efficiency by failing to eliminate irrelevant features. To tackle this problem, we propose a novel sparse Bayesian embedded feature selection method that adopts truncated Gaussian distributions as both sample and feature priors. The proposed method, called probabilistic feature selection and classification vector machine (PFCVMLP ), is able to simultaneously select relevant features and samples for classification tasks. In order to derive the analytical solutions, Laplace approximation is applied to compute approximate posteriors and marginal likelihoods. Finally, parameters and hyperparameters are optimized by the type-II maximum likelihood method. Experiments on three datasets validate the performance of PFCVMLP along two dimensions: classification performance and effectiveness for feature selection. Finally, we analyze the generalization performance and derive a generalization error bound for PFCVMLP . By tightening the bound, the importance of feature selection is demonstrated.