Jihong Guan

CV
h-index51
36papers
1,145citations
Novelty53%
AI Score60

36 Papers

CRJun 2
Ghost: Plausible Yet Unlearnable Trajectories via On-Manifold Substitution for Next-POI Privacy

Zhenyu Yu, Jihong Guan, Shuigeng Zhou

A publisher who releases check-in trajectories inadvertently publishes a strong predictor of every user's future locations. We address this risk by generating unlearnable trajectories, perturbed sequences that yield victim models with degraded next-Point-of-Interest (next-POI) accuracy on clean test inputs. Direct ports of image-domain unlearnable examples fail on two counts. The published data must remain geographically and semantically plausible, and the perturbation must resist purification adversaries that exploit the structure of randomized defences. We propose Ghost, a manifold-aligned framework whose perturbations look like plausible human check-in sequences yet leave no learnable signal behind. Ghost steers each substitution onto the real-trajectory manifold through a frozen trajectory language model, so a denoising-bridge adversary has nothing to invert and a context-free frequency-table adversary recovers a near-uniform distribution. Across two standard benchmarks, and four attacker postures, Ghost achieves protection-gap competitive with the strongest deterministic baseline (PGD) while attaining the lowest restored accuracy under the bigram adaptive purification adversary on both datasets, and lies within one per-cell standard deviation of PGD on the protection-versus-purification-resistance plane. Ablations confirm the manifold prior subsumes the entropy-floor knob of prior randomized defences, with the frequency-table adversary's survival gap remaining within 0.04 even when twenty percent of the pairs are leaked.

AIMay 31Code
FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

Hongxu Ma, Han Zhou, Chenghou Jin et al.

Watch time has emerged as a pivotal metric for optimizing deep user engagement in short-video recommender systems. However, current methods of watch time prediction (WTP) suffer from inherent paradigm-specific limitations. Direct Regression faces mean-collapse due to unimodal Gaussian assumptions, while Ordinal Regression is hampered by quantization errors from rigid discretization. Similarly, Discrete Generative Regression struggles with high inference latency and heuristic vocabulary design. Beyond these specific flaws, a shared deficiency is the inability to capture the intrinsic multimodality and heterogeneity of User-Item Interaction Patterns. To address these challenges, we first revisit the WTP problem from a causal perspective and identify these user-specific patterns as structural confounders that modulate watch time outcomes, where identical interests manifest as distinct watch time outcomes conditioned on diverse user habits. Then, we formally propose a new (or the fourth) paradigm -- Continuous Generative Regression, and introduce FlowTime, a novel method utilizing a One-step Generative Variational Autoencoder. FlowTime effectively circumvents the latency of iterative denoising while maintaining the expressivity of continuous latent spaces. Furthermore, we design a Flow-based Personalized Prior that leverages NFs to warp a standard Gaussian prior into a complex, history-conditioned manifold, thereby enabling the adaptive modeling of multimodal interaction patterns. Finally, we build TimeRec, the first open-source WTP Library, alongside a novel personalization metric to establish a rigorous benchmarking standard. Extensive offline experiments and online A/B tests demonstrate FlowTime's significant superiority over SOTA methods.

CLApr 24, 2022Code
EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification

Minyi Zhao, Lu Zhang, Yi Xu et al.

Recent works have empirically shown the effectiveness of data augmentation (DA) in NLP tasks, especially for those suffering from data scarcity. Intuitively, given the size of generated data, their diversity and quality are crucial to the performance of targeted tasks. However, to the best of our knowledge, most existing methods consider only either the diversity or the quality of augmented data, thus cannot fully mine the potential of DA for NLP. In this paper, we present an easy and plug-in data augmentation framework EPiDA to support effective text classification. EPiDA employs two mechanisms: relative entropy maximization (REM) and conditional entropy minimization (CEM) to control data generation, where REM is designed to enhance the diversity of augmented data while CEM is exploited to ensure their semantic consistency. EPiDA can support efficient and continuous data generation for effective classifier training. Extensive experiments show that EPiDA outperforms existing SOTA methods in most cases, though not using any agent networks or pre-trained generation networks, and it works well with various DA algorithms and classification models. Code is available at https://github.com/zhaominyiz/EPiDA.

SIJul 4, 2023
All in One: Multi-task Prompting for Graph Neural Networks

Xiangguo Sun, Hong Cheng, Jia Li et al.

Recently, ''pre-training and fine-tuning'' has been adopted as a standard workflow for many graph tasks since it can take general graph knowledge to relieve the lack of graph annotations from each application. However, graph tasks with node level, edge level, and graph level are far diversified, making the pre-training pretext often incompatible with these multiple tasks. This gap may even cause a ''negative transfer'' to the specific application, leading to poor results. Inspired by the prompt learning in natural language processing (NLP), which has presented significant effectiveness in leveraging prior knowledge for various NLP tasks, we study the prompting topic for graphs with the motivation of filling the gap between pre-trained models and various graph tasks. In this paper, we propose a novel multi-task prompting method for graph models. Specifically, we first unify the format of graph prompts and language prompts with the prompt token, token structure, and inserting pattern. In this way, the prompting idea from NLP can be seamlessly introduced to the graph area. Then, to further narrow the gap between various graph tasks and state-of-the-art pre-training strategies, we further study the task space of various graph applications and reformulate downstream problems to the graph-level task. Afterward, we introduce meta-learning to efficiently learn a better initialization for the multi-task prompt of graphs so that our prompting framework can be more reliable and general for different tasks. We conduct extensive experiments, results from which demonstrate the superiority of our method.

CVMar 27, 2022
Recent Few-Shot Object Detection Algorithms: A Survey with Performance Comparison

Tianying Liu, Lu Zhang, Yang Wang et al.

The generic object detection (GOD) task has been successfully tackled by recent deep neural networks, trained by an avalanche of annotated training samples from some common classes. However, it is still non-trivial to generalize these object detectors to the novel long-tailed object classes, which have only few labeled training samples. To this end, the Few-Shot Object Detection (FSOD) has been topical recently, as it mimics the humans' ability of learning to learn, and intelligently transfers the learned generic object knowledge from the common heavy-tailed, to the novel long-tailed object classes. Especially, the research in this emerging field has been flourishing in recent years with various benchmarks, backbones, and methodologies proposed. To review these FSOD works, there are several insightful FSOD survey articles [58, 59, 74, 78] that systematically study and compare them as the groups of fine-tuning/transfer learning, and meta-learning methods. In contrast, we review the existing FSOD algorithms from a new perspective under a new taxonomy based on their contributions, i.e., data-oriented, model-oriented, and algorithm-oriented. Thus, a comprehensive survey with performance comparison is conducted on recent achievements of FSOD. Furthermore, we also analyze the technical challenges, the merits and demerits of these methods, and envision the future directions of FSOD. Specifically, we give an overview of FSOD, including the problem definition, common datasets, and evaluation protocols. The taxonomy is then proposed that groups FSOD methods into three types. Following this taxonomy, we provide a systematic review of the advances in FSOD. Finally, further discussions on performance, challenges, and future directions are presented.

AIApr 20Code
One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction

Jihong Guan, Jiaqi Wang, Wengen Li et al.

Knowledge Graphs (KGs) are composed of triples, and the goal of Knowledge Graph Completion (KGC) is to infer the missing factual triples. Traditional KGC tasks predict missing elements in a triple given one or two of its elements. As a more realistic task, the Triple Set Prediction (TSP) task aims to infer the set of missing triples conditioned only on the observed knowledge graph, without assuming any partial information about the missing triples. Existing TSP methods predict the set of missing triples in a triple-by-triple manner, falling short in capturing the dependencies among the predicted triples to ensure consistency. To address this issue, we propose a novel discrete diffusion model termed DiffTSP that treats TSP as a generative task. DiffTSP progressively adds noise to the KG through a discrete diffusion process, achieved by masking relational edges. The reverse process then gradually recovers the complete KG conditioned on the incomplete graph. To this end, we design a structure-aware denoising network that integrates a relational context encoder with a relational graph diffusion transformer for knowledge graph generation. DiffTSP can generate the complete set of triples in a one-pass manner while ensuring the dependencies among the predicted triples. Our approach achieves state-of-the-art performance on three public datasets. Code: https://github.com/ADMIS-TONGJI/DiffTSP.

CVOct 8, 2022
Hierarchical Few-Shot Object Detection: Problem, Benchmark and Method

Lu Zhang, Yang Wang, Jiaogen Zhou et al.

Few-shot object detection (FSOD) is to detect objects with a few examples. However, existing FSOD methods do not consider hierarchical fine-grained category structures of objects that exist widely in real life. For example, animals are taxonomically classified into orders, families, genera and species etc. In this paper, we propose and solve a new problem called hierarchical few-shot object detection (Hi-FSOD), which aims to detect objects with hierarchical categories in the FSOD paradigm. To this end, on the one hand, we build the first large-scale and high-quality Hi-FSOD benchmark dataset HiFSOD-Bird, which contains 176,350 wild-bird images falling to 1,432 categories. All the categories are organized into a 4-level taxonomy, consisting of 32 orders, 132 families, 572 genera and 1,432 species. On the other hand, we propose the first Hi-FSOD method HiCLPL, where a hierarchical contrastive learning approach is developed to constrain the feature space so that the feature distribution of objects is consistent with the hierarchical taxonomy and the model's generalization power is strengthened. Meanwhile, a probabilistic loss is designed to enable the child nodes to correct the classification errors of their parent nodes in the taxonomy. Extensive experiments on the benchmark dataset HiFSOD-Bird show that our method HiCLPL outperforms the existing FSOD methods.

BMMar 13, 2023
Molecular Property Prediction by Semantic-invariant Contrastive Learning

Ziqiao Zhang, Ailin Xie, Jihong Guan et al.

Contrastive learning have been widely used as pretext tasks for self-supervised pre-trained molecular representation learning models in AI-aided drug design and discovery. However, exiting methods that generate molecular views by noise-adding operations for contrastive learning may face the semantic inconsistency problem, which leads to false positive pairs and consequently poor prediction performance. To address this problem, in this paper we first propose a semantic-invariant view generation method by properly breaking molecular graphs into fragment pairs. Then, we develop a Fragment-based Semantic-Invariant Contrastive Learning (FraSICL) model based on this view generation method for molecular property prediction. The FraSICL model consists of two branches to generate representations of views for contrastive learning, meanwhile a multi-view fusion and an auxiliary similarity loss are introduced to make better use of the information contained in different fragment-pair views. Extensive experiments on various benchmark datasets show that with the least number of pre-training samples, FraSICL can achieve state-of-the-art performance, compared with major existing counterpart models.

CVAug 18, 2023
Meta-ZSDETR: Zero-shot DETR with Meta-learning

Lu Zhang, Chenbo Zhang, Jiajia Zhao et al.

Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an individual episode based meta-learning task. Different from Faster R-CNN based methods that firstly generate class-agnostic proposals, and then classify them with visual-semantic alignment module, Meta-ZSDETR directly predict class-specific boxes with class-specific queries and further filter them with the predicted accuracy from classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms the existing ZSD methods by a large margin.

LGJul 20, 2023
Spatial-Temporal Data Mining for Ocean Science: Data, Methodologies, and Opportunities

Hanchen Yang, Wengen Li, Shuyu Wang et al.

With the rapid amassing of spatial-temporal (ST) ocean data, many spatial-temporal data mining (STDM) studies have been conducted to address various oceanic issues, including climate forecasting and disaster warning. Compared with typical ST data (e.g., traffic data), ST ocean data is more complicated but with unique characteristics, e.g., diverse regionality and high sparsity. These characteristics make it difficult to design and train STDM models on ST ocean data. To the best of our knowledge, a comprehensive survey of existing studies remains missing in the literature, which hinders not only computer scientists from identifying the research issues in ocean data mining but also ocean scientists to apply advanced STDM techniques. In this paper, we provide a comprehensive survey of existing STDM studies for ocean science. Concretely, we first review the widely-used ST ocean datasets and highlight their unique characteristics. Then, typical ST ocean data quality enhancement techniques are explored. Next, we classify existing STDM studies in ocean science into four types of tasks, i.e., prediction, event detection, pattern mining, and anomaly detection, and elaborate on the techniques for these tasks. Finally, promising research opportunities are discussed. This survey can help scientists from both computer science and ocean science better understand the fundamental concepts, key techniques, and open challenges of STDM for ocean science.

BMOct 5, 2023
Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties

Siyuan Guo, Jihong Guan, Shuigeng Zhou

In the past decade, Artificial Intelligence driven drug design and discovery has been a hot research topic, where an important branch is molecule generation by generative models, from GAN-based models and VAE-based models to the latest diffusion-based models. However, most existing models pursue only the basic properties like validity and uniqueness of the generated molecules, a few go further to explicitly optimize one single important molecular property (e.g. QED or PlogP), which makes most generated molecules little usefulness in practice. In this paper, we present a novel approach to generating molecules with desirable properties, which expands the diffusion model framework with multiple innovative designs. The novelty is two-fold. On the one hand, considering that the structures of molecules are complex and diverse, and molecular properties are usually determined by some substructures (e.g. pharmacophores), we propose to perform diffusion on two structural levels: molecules and molecular fragments respectively, with which a mixed Gaussian distribution is obtained for the reverse diffusion process. To get desirable molecular fragments, we develop a novel electronic effect based fragmentation method. On the other hand, we introduce two ways to explicitly optimize multiple molecular properties under the diffusion model framework. First, as potential drug molecules must be chemically valid, we optimize molecular validity by an energy-guidance function. Second, since potential drug molecules should be desirable in various properties, we employ a multi-objective mechanism to optimize multiple molecular properties simultaneously. Extensive experiments with two benchmark datasets QM9 and ZINC250k show that the molecules generated by our proposed method have better validity, uniqueness, novelty, Fréchet ChemNet Distance (FCD), QED, and PlogP than those generated by current SOTA models.

CRJul 26, 2023
Flexible Differentially Private Vertical Federated Learning with Adaptive Feature Embeddings

Yuxi Mi, Hongquan Liu, Yewei Xia et al.

The emergence of vertical federated learning (VFL) has stimulated concerns about the imperfection in privacy protection, as shared feature embeddings may reveal sensitive information under privacy attacks. This paper studies the delicate equilibrium between data privacy and task utility goals of VFL under differential privacy (DP). To address the generality issue of prior arts, this paper advocates a flexible and generic approach that decouples the two goals and addresses them successively. Specifically, we initially derive a rigorous privacy guarantee by applying norm clipping on shared feature embeddings, which is applicable across various datasets and models. Subsequently, we demonstrate that task utility can be optimized via adaptive adjustments on the scale and distribution of feature embeddings in an accuracy-appreciative way, without compromising established DP mechanisms. We concretize our observation into the proposed VFL-AFE framework, which exhibits effectiveness against privacy attacks and the capacity to retain favorable task utility, as substantiated by extensive experiments.

CVJul 31, 2023
HiREN: Towards Higher Supervision Quality for Better Scene Text Image Super-Resolution

Minyi Zhao, Yi Xu, Bingjia Li et al.

Scene text image super-resolution (STISR) is an important pre-processing technique for text recognition from low-resolution scene images. Nowadays, various methods have been proposed to extract text-specific information from high-resolution (HR) images to supervise STISR model training. However, due to uncontrollable factors (e.g. shooting equipment, focus, and environment) in manually photographing HR images, the quality of HR images cannot be guaranteed, which unavoidably impacts STISR performance. Observing the quality issue of HR images, in this paper we propose a novel idea to boost STISR by first enhancing the quality of HR images and then using the enhanced HR images as supervision to do STISR. Concretely, we develop a new STISR framework, called High-Resolution ENhancement (HiREN) that consists of two branches and a quality estimation module. The first branch is developed to recover the low-resolution (LR) images, and the other is an HR quality enhancement branch aiming at generating high-quality (HQ) text images based on the HR images to provide more accurate supervision to the LR images. As the degradation from HQ to HR may be diverse, and there is no pixel-level supervision for HQ image generation, we design a kernel-guided enhancement network to handle various degradation, and exploit the feedback from a recognizer and text-level annotations as weak supervision signal to train the HR enhancement branch. Then, a quality estimation module is employed to evaluate the qualities of HQ images, which are used to suppress the erroneous supervision information by weighting the loss of each image. Extensive experiments on TextZoom show that HiREN can work well with most existing STISR methods and significantly boost their performances.

CVSep 22, 2024
One Model for Two Tasks: Cooperatively Recognizing and Recovering Low-Resolution Scene Text Images by Iterative Mutual Guidance

Minyi Zhao, Yang Wang, Jihong Guan et al.

Scene text recognition (STR) from high-resolution (HR) images has been significantly successful, however text reading on low-resolution (LR) images is still challenging due to insufficient visual information. Therefore, recently many scene text image super-resolution (STISR) models have been proposed to generate super-resolution (SR) images for the LR ones, then STR is done on the SR images, which thus boosts recognition performance. Nevertheless, these methods have two major weaknesses. On the one hand, STISR approaches may generate imperfect or even erroneous SR images, which mislead the subsequent recognition of STR models. On the other hand, as the STISR and STR models are jointly optimized, to pursue high recognition accuracy, the fidelity of SR images may be spoiled. As a result, neither the recognition performance nor the fidelity of STISR models are desirable. Then, can we achieve both high recognition performance and good fidelity? To this end, in this paper we propose a novel method called IMAGE (the abbreviation of Iterative MutuAl GuidancE) to effectively recognize and recover LR scene text images simultaneously. Concretely, IMAGE consists of a specialized STR model for recognition and a tailored STISR model to recover LR images, which are optimized separately. And we develop an iterative mutual guidance mechanism, with which the STR model provides high-level semantic information as clue to the STISR model for better super-resolution, meanwhile the STISR model offers essential low-level pixel clue to the STR model for more accurate recognition. Extensive experiments on two LR datasets demonstrate the superiority of our method over the existing works on both recognition performance and super-resolution fidelity.

CVOct 9, 2023
A Lightweight Video Anomaly Detection Model with Weak Supervision and Adaptive Instance Selection

Yang Wang, Jiaogen Zhou, Jihong Guan

Video anomaly detection is to determine whether there are any abnormal events, behaviors or objects in a given video, which enables effective and intelligent public safety management. As video anomaly labeling is both time-consuming and expensive, most existing works employ unsupervised or weakly supervised learning methods. This paper focuses on weakly supervised video anomaly detection, in which the training videos are labeled whether or not they contain any anomalies, but there is no information about which frames the anomalies are located. However, the uncertainty of weakly labeled data and the large model size prevent existing methods from wide deployment in real scenarios, especially the resource-limit situations such as edge-computing. In this paper, we develop a lightweight video anomaly detection model. On the one hand, we propose an adaptive instance selection strategy, which is based on the model's current status to select confident instances, thereby mitigating the uncertainty of weakly labeled data and subsequently promoting the model's performance. On the other hand, we design a lightweight multi-level temporal correlation attention module and an hourglass-shaped fully connected layer to construct the model, which can reduce the model parameters to only 0.56\% of the existing methods (e.g. RTFM). Our extensive experiments on two public datasets UCF-Crime and ShanghaiTech show that our model can achieve comparable or even superior AUC score compared to the state-of-the-art methods, with a significantly reduced number of model parameters.

CVJan 16
IDDR-NGP: Incorporating Detectors for Distractor Removal with Instant Neural Radiance Field

Xianliang Huang, Jiajie Gou, Shuhang Chen et al.

This paper presents the first unified distractor removal method, named IDDR-NGP, which directly operates on Instant-NPG. The method is able to remove a wide range of distractors in 3D scenes, such as snowflakes, confetti, defoliation and petals, whereas existing methods usually focus on a specific type of distractors. By incorporating implicit 3D representations with 2D detectors, we demonstrate that it is possible to efficiently restore 3D scenes from multiple corrupted images. We design the learned perceptual image patch similarity~( LPIPS) loss and the multi-view compensation loss (MVCL) to jointly optimize the rendering results of IDDR-NGP, which could aggregate information from multi-view corrupted images. All of them can be trained in an end-to-end manner to synthesize high-quality 3D scenes. To support the research on distractors removal in implicit 3D representations, we build a new benchmark dataset that consists of both synthetic and real-world distractors. To validate the effectiveness and robustness of IDDR-NGP, we provide a wide range of distractors with corresponding annotated labels added to both realistic and synthetic scenes. Extensive experimental results demonstrate the effectiveness and robustness of IDDR-NGP in removing multiple types of distractors. In addition, our approach achieves results comparable with the existing SOTA desnow methods and is capable of accurately removing both realistic and synthetic distractors.

CVMay 20
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration

Yi Liu, Jia Ma, Wengen Li et al.

Recent advances in Image Restoration (IR) have been largely driven by generative methods such as Diffusion Models and Flow Matching, which excel in synthesizing realistic textures while suffering from slow multi-step inference and compromised pixel fidelity. In contrast, classical regression-based IR methods excel precisely in these aspects, offering single-step efficiency and high pixel-level reconstruction fidelity. To bridge this gap, we propose DiSI, a unified framework that Disentangles the underlying Stochastic Interpolant process into independent generation and regression components. This decoupling endows DiSI with remarkable versatility, enabling a continuous and controllable transition from a pure regression process to a fully generative one. Technically, we instantiate this framework with two specific sampling trajectories, accompanied by a unified sampler for high-quality, few-step inference on arbitrary trajectories. Furthermore, we design a dual-branch U-Net style transformer network in pixel space, using a dedicated branch to enhance conditional guidance while ensuring high throughput. Extensive experiments demonstrate that DiSI efficiently achieves competitive results on various IR tasks, while uniquely offering the inference-time flexibility to control the distortion-perception trade-off within a single model.

AIApr 25Code
AdaMamba: Adaptive Frequency-Gated Mamba for Long-Term Time Series Forecasting

Xudong Jiang, Mingshan Loo, Hanchen Yang et al.

Accurate long-term time series forecasting (LTSF) requires the capture of complex long-range dependencies and dynamic periodic patterns. Recent advances in frequency-domain analysis offer a global perspective for uncovering temporal characteristics. However, real-world time series often exhibit pronounced cross-domain heterogeneity where variables that appear synchronized in the time domain can differ substantially in the frequency domain. Existing frequency-based LTSF methods often rely on implicit assumptions of cross-domain homogeneity, which limits their ability to adapt to such intricate variability. To effectively integrate frequency-domain analysis with temporal dependency learning, we propose AdaMamba, a novel framework that endogenizes adaptive and context-aware frequency analysis within the Mamba state-space update process. Specifically, AdaMamba introduces an interactive patch encoding module to capture inter-variable interaction dynamics. Then, we develop an adaptive frequency-gated state-space module that generates input-dependent frequency bases, and generalizes the conventional temporal forgetting gate into a unified time-frequency forgetting gate. This allows dynamic calibration of state transitions based on learned frequency-domain importance, while preserving Mamba's capability in modeling long-range dependencies. Extensive experiments on seven public LTSF benchmarks and two domain-specific datasets demonstrate that AdaMamba consistently outperforms state-of-the-art methods in forecasting accu racy while maintaining competitive computational efficiency. The code of AdaMamba is available at https://github.com/XDjiang25/AdaMamba.

AO-PHApr 21
An Adaptive Spatiotemporal Clustering Framework for 3D Ocean Subsurface Temperature Reconstruction

Ming Shan Loo, Wengen Li, Xudong Jiang et al.

The reconstruction of ocean subsurface temperature (OST) using satellite remote sensing data holds significant scientific value for advancing the understanding of ocean dynamics and climate variability. However, the scarcity of subsurface observations, combined with the high degree of nonlinearity and spatiotemporal heterogeneity in subsurface processes, poses substantial challenges to the accuracy and generalization capability of traditional reconstruction methods. To address these limitations, this study proposes an adaptive framework that could capture both vertical structural dependencies and temporal variation patterns of OST via spatio-temporal clustering. By incorporating this framework with various deep learning models, e.g., dual-path convolutional neural networks (DP-CNN), Attention U-Net, and Vision Transformer (ViT), the OST field can be accurately reconstructed at a global scale only using surface observations, i.e., sea surface temperature (SST), sea surface salinity (SSS), sea surface height (SSH), and sea surface wind (SSW). Experimental results demonstrate that multiple deep learning methods using the proposed framework largely outperform their original counterparts, yielding improvements in RMSE ranging from 12.4\% to 27.2\%. This study provides a reliable solution for subsurface temperature reconstruction, offering important implications for meteorological modeling and climate change assessment.

CVMar 31, 2025Code
Effective Cloud Removal for Remote Sensing Images by an Improved Mean-Reverting Denoising Model with Elucidated Design Space

Yi Liu, Wengen Li, Jihong Guan et al.

Cloud removal (CR) remains a challenging task in remote sensing image processing. Although diffusion models (DM) exhibit strong generative capabilities, their direct applications to CR are suboptimal, as they generate cloudless images from random noise, ignoring inherent information in cloudy inputs. To overcome this drawback, we develop a new CR model EMRDM based on mean-reverting diffusion models (MRDMs) to establish a direct diffusion process between cloudy and cloudless images. Compared to current MRDMs, EMRDM offers a modular framework with updatable modules and an elucidated design space, based on a reformulated forward process and a new ordinary differential equation (ODE)-based backward process. Leveraging our framework, we redesign key MRDM modules to boost CR performance, including restructuring the denoiser via a preconditioning technique, reorganizing the training process, and improving the sampling process by introducing deterministic and stochastic samplers. To achieve multi-temporal CR, we further develop a denoising network for simultaneously denoising sequential images. Experiments on mono-temporal and multi-temporal datasets demonstrate the superior performance of EMRDM. Our code is available at https://github.com/Ly403/EMRDM.

CVMay 10
SSDA: Bridging Spectral and Structural Gaps via Dual Adaptation for Vision-Based Time Series Forecasting

Mingrui Zhang, Hanchen Yang, Wengen Li et al.

Large vision models (LVMs) have recently proven to be surprisingly effective time series forecasters, simply by rendering temporal data as images. This success, how ever, rests on a largely unexamined premise: the rendered time series images are sufficiently close to natural images for knowledge in pre-trained models to transfer effectively. We argue that two gaps still remain, i.e., spectral and structural gaps, fundamentally limiting the potential of LVMs for time series forecasting. Spectrally, we systematically reveal that rendered time series images exhibit a markedly shallower power spectrum than the natural images LVMs are pre-trained to recognize. Structurally, reshaping 1D temporal sequences into 2D grids fabricates spurious spatial adjacencies while severing genuine temporal continuities, misleading the spatial inductive biases of pre-trained LVMs. To bridge these gaps, we propose SSDA, a dual-branch network that spectrally and structurally adapts to unlock the full potential of LVMs for time series forecasting. At the data level, a Spectral Magnitude Aligner (SMA) applies 2D FFT to selectively enhance the magnitude spectrum toward natural-image statistics while preserving phase. At the model level, a Structural-Guided Low-Rank Adaptation (SG-LoRA) injects position-aware temporal encodings into patch embeddings and adapts at tention via low-rank updates. The two branches are further adaptively fused to produce the final forecast. Extensive experiments on seven real-world benchmarks demonstrate that SSDA consistently outperforms strong LVM- and LLM-based baselines under both full-shot and few-shot settings. Code is publicly available at https://anonymous.4open.science/r/SSDA-8C5B.

LGJun 24, 2024Code
CausalFormer: An Interpretable Transformer for Temporal Causal Discovery

Lingbai Kong, Wengen Li, Hanchen Yang et al.

Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality. Our code is available at https://github.com/lingbai-kong/CausalFormer.

QMDec 8, 2024Code
M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery

Siyuan Guo, Lexuan Wang, Chang Jin et al.

This paper introduces M$^{3}$-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M$^{3}$-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M$^{3}$-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M$^{3}$-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M$^{3}$-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.

LGDec 28, 2023
Molecular Property Prediction Based on Graph Structure Learning

Bangyi Zhao, Weixia Xu, Jihong Guan et al.

Molecular property prediction (MPP) is a fundamental but challenging task in the computer-aided drug discovery process. More and more recent works employ different graph-based models for MPP, which have made considerable progress in improving prediction performance. However, current models often ignore relationships between molecules, which could be also helpful for MPP. For this sake, in this paper we propose a graph structure learning (GSL) based MPP approach, called GSL-MPP. Specifically, we first apply graph neural network (GNN) over molecular graphs to extract molecular representations. Then, with molecular fingerprints, we construct a molecular similarity graph (MSG). Following that, we conduct graph structure learning on the MSG (i.e., molecule-level graph structure learning) to get the final molecular embeddings, which are the results of fusing both GNN encoded molecular representations and the relationships among molecules, i.e., combining both intra-molecule and inter-molecule information. Finally, we use these molecular embeddings to perform MPP. Extensive experiments on seven various benchmark datasets show that our method could achieve state-of-the-art performance in most cases, especially on classification tasks. Further visualization studies also demonstrate the good molecular representations of our method.

LGJul 31, 2025
OKG-LLM: Aligning Ocean Knowledge Graph with Observation Data via LLMs for Global Sea Surface Temperature Prediction

Hanchen Yang, Jiaqi Wang, Jiannong Cao et al.

Sea surface temperature (SST) prediction is a critical task in ocean science, supporting various applications, such as weather forecasting, fisheries management, and storm tracking. While existing data-driven methods have demonstrated significant success, they often neglect to leverage the rich domain knowledge accumulated over the past decades, limiting further advancements in prediction accuracy. The recent emergence of large language models (LLMs) has highlighted the potential of integrating domain knowledge for downstream tasks. However, the application of LLMs to SST prediction remains underexplored, primarily due to the challenge of integrating ocean domain knowledge and numerical data. To address this issue, we propose Ocean Knowledge Graph-enhanced LLM (OKG-LLM), a novel framework for global SST prediction. To the best of our knowledge, this work presents the first systematic effort to construct an Ocean Knowledge Graph (OKG) specifically designed to represent diverse ocean knowledge for SST prediction. We then develop a graph embedding network to learn the comprehensive semantic and structural knowledge within the OKG, capturing both the unique characteristics of individual sea regions and the complex correlations between them. Finally, we align and fuse the learned knowledge with fine-grained numerical SST data and leverage a pre-trained LLM to model SST patterns for accurate prediction. Extensive experiments on the real-world dataset demonstrate that OKG-LLM consistently outperforms state-of-the-art methods, showcasing its effectiveness, robustness, and potential to advance SST prediction. The codes are available in the online repository.

CVJul 14, 2025
Fine-Grained Zero-Shot Object Detection

Hongxu Ma, Chenbo Zhang, Lu Zhang et al.

Zero-shot object detection (ZSD) aims to leverage semantic descriptions to localize and recognize objects of both seen and unseen classes. Existing ZSD works are mainly coarse-grained object detection, where the classes are visually quite different, thus are relatively easy to distinguish. However, in real life we often have to face fine-grained object detection scenarios, where the classes are too similar to be easily distinguished. For example, detecting different kinds of birds, fishes, and flowers. In this paper, we propose and solve a new problem called Fine-Grained Zero-Shot Object Detection (FG-ZSD for short), which aims to detect objects of different classes with minute differences in details under the ZSD paradigm. We develop an effective method called MSHC for the FG-ZSD task, which is based on an improved two-stage detector and employs a multi-level semantics-aware embedding alignment loss, ensuring tight coupling between the visual and semantic spaces. Considering that existing ZSD datasets are not suitable for the new FG-ZSD task, we build the first FG-ZSD benchmark dataset FGZSD-Birds, which contains 148,820 images falling into 36 orders, 140 families, 579 genera and 1432 species. Extensive experiments on FGZSD-Birds show that our method outperforms existing ZSD models.

LGDec 28, 2024
Generative Regression Based Watch Time Prediction for Short-Video Recommendation

Hongxu Ma, Kai Tian, Tao Zhang et al.

Watch time prediction (WTP) has emerged as a pivotal task in short video recommendation systems, designed to quantify user engagement through continuous interaction modeling. Predicting users' watch times on videos often encounters fundamental challenges, including wide value ranges and imbalanced data distributions, which can lead to significant estimation bias when directly applying regression techniques. Recent studies have attempted to address these issues by converting the continuous watch time estimation into an ordinal regression task. While these methods demonstrate partial effectiveness, they exhibit notable limitations: (1) the discretization process frequently relies on bucket partitioning, inherently reducing prediction flexibility and accuracy and (2) the interdependencies among different partition intervals remain underutilized, missing opportunities for effective error correction. Inspired by language modeling paradigms, we propose a novel Generative Regression (GR) framework that reformulates WTP as a sequence generation task. Our approach employs \textit{structural discretization} to enable nearly lossless value reconstruction while maintaining prediction fidelity. Through carefully designed vocabulary construction and label encoding schemes, each watch time is bijectively mapped to a token sequence. To mitigate the training-inference discrepancy caused by teacher-forcing, we introduce a \textit{curriculum learning with embedding mixup} strategy that gradually transitions from guided to free-generation modes. We evaluate our method against state-of-the-art approaches on two public datasets and one industrial dataset. We also perform online A/B testing on the Kuaishou App to confirm the real-world effectiveness. The results conclusively show that GR outperforms existing techniques significantly.

LGMar 11, 2024
All in One: Multi-Task Prompting for Graph Neural Networks (Extended Abstract)

Xiangguo Sun, Hong Cheng, Jia Li et al.

This paper is an extended abstract of our original work published in KDD23, where we won the best research paper award (Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan. All in one: Multi-task prompting for graph neural networks. KDD 23) The paper introduces a novel approach to bridging the gap between pre-trained graph models and the diverse tasks they're applied to, inspired by the success of prompt learning in NLP. Recognizing the challenge of aligning pre-trained models with varied graph tasks (node level, edge level, and graph level), which can lead to negative transfer and poor performance, we propose a multi-task prompting method for graphs. This method involves unifying graph and language prompt formats, enabling NLP's prompting strategies to be adapted for graph tasks. By analyzing the task space of graph applications, we reformulate problems to fit graph-level tasks and apply meta-learning to improve prompt initialization for multiple tasks. Experiments show our method's effectiveness in enhancing model performance across different graph tasks. Beyond the original work, in this extended abstract, we further discuss the graph prompt from a bigger picture and provide some of the latest work toward this area.

LGDec 23, 2024
Fast Causal Discovery by Approximate Kernel-based Generalized Score Functions with Linear Computational Complexity

Yixin Ren, Haocheng Zhang, Yewei Xia et al.

Score-based causal discovery methods can effectively identify causal relationships by evaluating candidate graphs and selecting the one with the highest score. One popular class of scores is kernel-based generalized score functions, which can adapt to a wide range of scenarios and work well in practice because they circumvent assumptions about causal mechanisms and data distributions. Despite these advantages, kernel-based generalized score functions pose serious computational challenges in time and space, with a time complexity of $\mathcal{O}(n^3)$ and a memory complexity of $\mathcal{O}(n^2)$, where $n$ is the sample size. In this paper, we propose an approximate kernel-based generalized score function with $\mathcal{O}(n)$ time and space complexities by using low-rank technique and designing a set of rules to handle the complex composite matrix operations required to calculate the score, as well as developing sampling algorithms for different data types to benefit the handling of diverse data types efficiently. Our extensive causal discovery experiments on both synthetic and real-world data demonstrate that compared to the state-of-the-art method, our method can not only significantly reduce computational costs, but also achieve comparable accuracy, especially for large datasets.

CVOct 11, 2025
ImmerIris: A Large-Scale Dataset and Benchmark for Immersive Iris Recognition in Open Scenes

Yuxi Mi, Qiuyang Yuan, Zhizhou Zhong et al.

In egocentric applications such as augmented and virtual reality, immersive iris recognition is emerging as an accurate and seamless way to identify persons. While classic systems acquire iris images on-axis, i.e., via dedicated frontal sensors in controlled settings, the immersive setup primarily captures off-axis irises through tilt-placed headset cameras, with only mild control in open scenes. This yields unique challenges, including perspective distortion, intensified quality degradations, and intra-class variations in iris texture. Datasets capturing these challenges remain scarce. To fill this gap, this paper introduces ImmerIris, a large-scale dataset collected via VR headsets, containing 499,791 ocular images from 564 subjects. It is, to the best of current knowledge, the largest public dataset and among the first dedicated to off-axis acquisition. Based on ImmerIris, evaluation protocols are constructed to benchmark recognition methods under different challenging factors. Current methods, primarily designed for classic on-axis imagery, perform unsatisfactorily on the immersive setup, mainly due to reliance on fallible normalization. To this end, this paper further proposes a normalization-free paradigm that directly learns from ocular images with minimal adjustment. Despite its simplicity, this approach consistently outperforms normalization-based counterparts, pointing to a promising direction for robust immersive recognition.

GNAug 23, 2025
scI2CL: Effectively Integrating Single-cell Multi-omics by Intra- and Inter-omics Contrastive Learning

Wuchao Liu, Han Peng, Wengen Li et al.

Single-cell multi-omics data contain huge information of cellular states, and analyzing these data can reveal valuable insights into cellular heterogeneity, diseases, and biological processes. However, as cell differentiation \& development is a continuous and dynamic process, it remains challenging to computationally model and infer cell interaction patterns based on single-cell multi-omics data. This paper presents scI2CL, a new single-cell multi-omics fusion framework based on intra- and inter-omics contrastive learning, to learn comprehensive and discriminative cellular representations from complementary multi-omics data for various downstream tasks. Extensive experiments of four downstream tasks validate the effectiveness of scI2CL and its superiority over existing peers. Concretely, in cell clustering, scI2CL surpasses eight state-of-the-art methods on four widely-used real-world datasets. In cell subtyping, scI2CL effectively distinguishes three latent monocyte cell subpopulations, which are not discovered by existing methods. Simultaneously, scI2CL is the only method that correctly constructs the cell developmental trajectory from hematopoietic stem and progenitor cells to Memory B cells. In addition, scI2CL resolves the misclassification of cell types between two subpopulations of CD4+ T cells, while existing methods fail to precisely distinguish the mixed cells. In summary, scI2CL can accurately characterize cross-omics relationships among cells, thus effectively fuses multi-omics data and learns discriminative cellular representations to support various downstream analysis tasks.

LGMay 29, 2025
Score-based Generative Modeling for Conditional Independence Testing

Yixin Ren, Chenghou Jin, Yewei Xia et al.

Determining conditional independence (CI) relationships between random variables is a fundamental yet challenging task in machine learning and statistics, especially in high-dimensional settings. Existing generative model-based CI testing methods, such as those utilizing generative adversarial networks (GANs), often struggle with undesirable modeling of conditional distributions and training instability, resulting in subpar performance. To address these issues, we propose a novel CI testing method via score-based generative modeling, which achieves precise Type I error control and strong testing power. Concretely, we first employ a sliced conditional score matching scheme to accurately estimate conditional score and use Langevin dynamics conditional sampling to generate null hypothesis samples, ensuring precise Type I error control. Then, we incorporate a goodness-of-fit stage into the method to verify generated samples and enhance interpretability in practice. We theoretically establish the error bound of conditional distributions modeled by score-based generative models and prove the validity of our CI tests. Extensive experiments on both synthetic and real-world datasets show that our method significantly outperforms existing state-of-the-art methods, providing a promising way to revitalize generative model-based CI testing.

GNDec 27, 2023
scRNA-seq Data Clustering by Cluster-aware Iterative Contrastive Learning

Weikang Jiang, Jinxian Wang, Jihong Guan et al.

Single-cell RNA sequencing (scRNA-seq) enables researchers to analyze gene expression at single-cell level. One important task in scRNA-seq data analysis is unsupervised clustering, which helps identify distinct cell types, laying down the foundation for other downstream analysis tasks. In this paper, we propose a novel method called Cluster-aware Iterative Contrastive Learning (CICL in short) for scRNA-seq data clustering, which utilizes an iterative representation learning and clustering framework to progressively learn the clustering structure of scRNA-seq data with a cluster-aware contrastive loss. CICL consists of a Transformer encoder, a clustering head, a projection head and a contrastive loss module. First, CICL extracts the feature vectors of the original and augmented data by the Transformer encoder. Then, it computes the clustering centroids by K-means and employs the student t-distribution to assign pseudo-labels to all cells in the clustering head. The projection-head uses a Multi-Layer Perceptron (MLP) to obtain projections of the augmented data. At last, both pseudo-labels and projections are used in the contrastive loss to guide the model training. Such a process goes iteratively so that the clustering result becomes better and better. Extensive experiments on 25 real world scRNA-seq datasets show that CICL outperforms the SOTA methods. Concretely, CICL surpasses the existing methods by from 14% to 280%, and from 5% to 133% on average in terms of performance metrics ARI and NMI respectively.

AIFeb 9, 2022
Identifying Backdoor Attacks in Federated Learning via Anomaly Detection

Yuxi Mi, Yiheng Sun, Jihong Guan et al.

Federated learning has seen increased adoption in recent years in response to the growing regulatory demand for data privacy. However, the opaque local training process of federated learning also sparks rising concerns about model faithfulness. For instance, studies have revealed that federated learning is vulnerable to backdoor attacks, whereby a compromised participant can stealthily modify the model's behavior in the presence of backdoor triggers. This paper proposes an effective defense against the attack by examining shared model updates. We begin with the observation that the embedding of backdoors influences the participants' local model weights in terms of the magnitude and orientation of their model gradients, which can manifest as distinguishable disparities. We enable a robust identification of backdoors by studying the statistical distribution of the models' subsets of gradients. Concretely, we first segment the model gradients into fragment vectors that represent small portions of model parameters. We then employ anomaly detection to locate the distributionally skewed fragments and prune the participants with the most outliers. We embody the findings in a novel defense method, ARIBA. We demonstrate through extensive analyses that our proposed methods effectively mitigate state-of-the-art backdoor attacks with minimal impact on task utility.

LGMar 17, 2019
Learning Competitive and Discriminative Reconstructions for Anomaly Detection

Kai Tian, Shuigeng Zhou, Jianping Fan et al.

Most of the existing methods for anomaly detection use only positive data to learn the data distribution, thus they usually need a pre-defined threshold at the detection stage to determine whether a test instance is an outlier. Unfortunately, a good threshold is vital for the performance and it is really hard to find an optimal one. In this paper, we take the discriminative information implied in unlabeled data into consideration and propose a new method for anomaly detection that can learn the labels of unlabelled data directly. Our proposed method has an end-to-end architecture with one encoder and two decoders that are trained to model inliers and outliers' data distributions in a competitive way. This architecture works in a discriminative manner without suffering from overfitting, and the training algorithm of our model is adopted from SGD, thus it is efficient and scalable even for large-scale datasets. Empirical studies on 7 datasets including KDD99, MNIST, Caltech-256, and ImageNet etc. show that our model outperforms the state-of-the-art methods.

CVJun 22, 2018
Global Semantic Consistency for Zero-Shot Learning

Fan Wu, Kai Tian, Jihong Guan et al.

In image recognition, there are many cases where training samples cannot cover all target classes. Zero-shot learning (ZSL) utilizes the class semantic information to classify samples of the unseen categories that have no corresponding samples contained in the training set. In this paper, we propose an end-to-end framework, called Global Semantic Consistency Network (GSC-Net for short), which makes complete use of the semantic information of both seen and unseen classes, to support effective zero-shot learning. We also adopt a soft label embedding loss to further exploit the semantic relationships among classes. To adapt GSC-Net to a more practical setting, Generalized Zero-shot Learning (GZSL), we introduce a parametric novelty detection mechanism. Our approach achieves the state-of-the-art performance on both ZSL and GZSL tasks over three visual attribute datasets, which validates the effectiveness and advantage of the proposed framework.