Qinmu Peng

CV
h-index28
31papers
1,152citations
Novelty49%
AI Score50

31 Papers

CVMar 7, 2022Code
MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning

Shiming Chen, Ziming Hong, Guo-Sen Xie et al. · pku

The key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes, and thus achieving a desirable knowledge transfer to unseen classes. Prior works either simply align the global features of an image with its associated class semantic vector or utilize unidirectional attention to learn the limited latent semantic representations, which could not effectively discover the intrinsic semantic knowledge e.g., attribute semantics) between visual and attribute features. To solve the above dilemma, we propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features for ZSL. MSDN incorporates an attribute$\rightarrow$visual attention sub-net that learns attribute-based visual features, and a visual$\rightarrow$attribute attention sub-net that learns visual-based attribute features. By further introducing a semantic distillation loss, the two mutual attention sub-nets are capable of learning collaboratively and teaching each other throughout the training process. The proposed MSDN yields significant improvements over the strong baselines, leading to new state-of-the-art performances on three popular challenging benchmarks, i.e., CUB, SUN, and AWA2. Our codes have been available at: \url{https://github.com/shiming-chen/MSDN}.

CVMar 27, 2023
Semantic-visual Guided Transformer for Few-shot Class-incremental Learning

Wenhao Qiu, Sichao Fu, Jingyi Zhang et al.

Few-shot class-incremental learning (FSCIL) has recently attracted extensive attention in various areas. Existing FSCIL methods highly depend on the robustness of the feature backbone pre-trained on base classes. In recent years, different Transformer variants have obtained significant processes in the feature representation learning of massive fields. Nevertheless, the progress of the Transformer in FSCIL scenarios has not achieved the potential promised in other fields so far. In this paper, we develop a semantic-visual guided Transformer (SV-T) to enhance the feature extracting capacity of the pre-trained feature backbone on incremental classes. Specifically, we first utilize the visual (image) labels provided by the base classes to supervise the optimization of the Transformer. And then, a text encoder is introduced to automatically generate the corresponding semantic (text) labels for each image from the base classes. Finally, the constructed semantic labels are further applied to the Transformer for guiding its hyperparameters updating. Our SV-T can take full advantage of more supervision information from base classes and further enhance the training robustness of the feature backbone. More importantly, our SV-T is an independent method, which can directly apply to the existing FSCIL architectures for acquiring embeddings of various incremental classes. Extensive experiments on three benchmarks, two FSCIL architectures, and two Transformer variants show that our proposed SV-T obtains a significant improvement in comparison to the existing state-of-the-art FSCIL methods.

LGMay 6, 2022
Deep Supervised Information Bottleneck Hashing for Cross-modal Retrieval based Computer-aided Diagnosis

Yufeng Shi, Shuhuang Chen, Xinge You et al.

Mapping X-ray images, radiology reports, and other medical data as binary codes in the common space, which can assist clinicians to retrieve pathology-related data from heterogeneous modalities (i.e., hashing-based cross-modal medical data retrieval), provides a new view to promot computeraided diagnosis. Nevertheless, there remains a barrier to boost medical retrieval accuracy: how to reveal the ambiguous semantics of medical data without the distraction of superfluous information. To circumvent this drawback, we propose Deep Supervised Information Bottleneck Hashing (DSIBH), which effectively strengthens the discriminability of hash codes. Specifically, the Deep Deterministic Information Bottleneck (Yu, Yu, and Principe 2021) for single modality is extended to the cross-modal scenario. Benefiting from this, the superfluous information is reduced, which facilitates the discriminability of hash codes. Experimental results demonstrate the superior accuracy of the proposed DSIBH compared with state-of-the-arts in cross-modal medical data retrieval tasks.

CVApr 11, 2023
Another Vertical View: A Hierarchical Network for Heterogeneous Trajectory Prediction via Spectrums

Beihao Xia, Conghao Wong, Duanquan Xu et al.

With the fast development of AI-related techniques, the applications of trajectory prediction are no longer limited to easier scenes and trajectories. More and more trajectories with different forms, such as coordinates, bounding boxes, and even high-dimensional human skeletons, need to be analyzed and forecasted. Among these heterogeneous trajectories, interactions between different elements within a frame of trajectory, which we call ``Dimension-wise Interactions'', would be more complex and challenging. However, most previous approaches focus mainly on a specific form of trajectories, and potential dimension-wise interactions are less concerned. In this work, we expand the trajectory prediction task by introducing the trajectory dimensionality $M$, thus extending its application scenarios to heterogeneous trajectories. We first introduce the Haar transform as an alternative to Fourier transform to better capture the time-frequency properties of each trajectory-dimension. Then, we adopt the bilinear structure to model and fuse two factors simultaneously, including the time-frequency response and the dimension-wise interaction, to forecast heterogeneous trajectories via trajectory spectrums hierarchically in a generic way. Experiments show that the proposed model outperforms most state-of-the-art methods on ETH-UCY, SDD, nuScenes, and Human3.6M with heterogeneous trajectories, including 2D coordinates, 2D/3D bounding boxes, and 3D human skeletons.

CVMar 7, 2023
Filter Pruning based on Information Capacity and Independence

Xiaolong Tang, Shuo Ye, Yufeng Shi et al.

Filter pruning has gained widespread adoption for the purpose of compressing and speeding up convolutional neural networks (CNNs). However, existing approaches are still far from practical applications due to biased filter selection and heavy computation cost. This paper introduces a new filter pruning method that selects filters in an interpretable, multi-perspective, and lightweight manner. Specifically, we evaluate the contributions of filters from both individual and overall perspectives. For the amount of information contained in each filter, a new metric called information capacity is proposed. Inspired by the information theory, we utilize the interpretable entropy to measure the information capacity, and develop a feature-guided approximation process. For correlations among filters, another metric called information independence is designed. Since the aforementioned metrics are evaluated in a simple but effective way, we can identify and prune the least important filters with less computation cost. We conduct comprehensive experiments on benchmark datasets employing various widely-used CNN architectures to evaluate the performance of our method. For instance, on ILSVRC-2012, our method outperforms state-of-the-art methods by reducing FLOPs by 77.4% and parameters by 69.3% for ResNet-50 with only a minor decrease in accuracy of 2.64%.

LGSep 6, 2023
Towards Unsupervised Graph Completion Learning on Graphs with Features and Structure Missing

Sichao Fu, Qinmu Peng, Yang He et al.

In recent years, graph neural networks (GNN) have achieved significant developments in a variety of graph analytical tasks. Nevertheless, GNN's superior performance will suffer from serious damage when the collected node features or structure relationships are partially missing owning to numerous unpredictable factors. Recently emerged graph completion learning (GCL) has received increasing attention, which aims to reconstruct the missing node features or structure relationships under the guidance of a specifically supervised task. Although these proposed GCL methods have made great success, they still exist the following problems: the reliance on labels, the bias of the reconstructed node features and structure relationships. Besides, the generalization ability of the existing GCL still faces a huge challenge when both collected node features and structure relationships are partially missing at the same time. To solve the above issues, we propose a more general GCL framework with the aid of self-supervised learning for improving the task performance of the existing GNN variants on graphs with features and structure missing, termed unsupervised GCL (UGCL). Specifically, to avoid the mismatch between missing node features and structure during the message-passing process of GNN, we separate the feature reconstruction and structure reconstruction and design its personalized model in turn. Then, a dual contrastive loss on the structure level and feature level is introduced to maximize the mutual information of node representations from feature reconstructing and structure reconstructing paths for providing more supervision signals. Finally, the reconstructed node features and structure can be applied to the downstream node classification task. Extensive experiments on eight datasets, three GNN variants and five missing rates demonstrate the effectiveness of our proposed method.

CVSep 15, 2023
Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions

Tianxu Wu, Shuo Ye, Shuhuang Chen et al.

The challenge in fine-grained visual categorization lies in how to explore the subtle differences between different subclasses and achieve accurate discrimination. Previous research has relied on large-scale annotated data and pre-trained deep models to achieve the objective. However, when only a limited amount of samples is available, similar methods may become less effective. Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation. However, the high level of detail required for fine-grained images makes it challenging for existing methods to be directly employed. To address this issue, we propose a novel approach termed the detail reinforcement diffusion model~(DRDM), which leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference~(SKR). Specifically, DSR is designed to extract implicit similarity relationships from the labels and reconstruct the semantic mapping between labels and instances, which enables better discrimination of subtle differences between different subclasses. Furthermore, we introduce the SKR module, which incorporates the distributions of different datasets as references in the feature space. This allows the SKR to aggregate the high-dimensional distribution of subclass features in few-shot FGVC tasks, thus expanding the decision boundary. Through these two critical components, we effectively utilize the knowledge from large models to address the issue of data scarcity, resulting in improved performance for fine-grained visual recognition tasks. Extensive experiments demonstrate the consistent performance gain offered by our DRDM.

LGFeb 16, 2023
Self-supervised Guided Hypergraph Feature Propagation for Semi-supervised Classification with Missing Node Features

Chengxiang Lei, Sichao Fu, Yuetian Wang et al.

Graph neural networks (GNNs) with missing node features have recently received increasing interest. Such missing node features seriously hurt the performance of the existing GNNs. Some recent methods have been proposed to reconstruct the missing node features by the information propagation among nodes with known and unknown attributes. Although these methods have achieved superior performance, how to exactly exploit the complex data correlations among nodes to reconstruct missing node features is still a great challenge. To solve the above problem, we propose a self-supervised guided hypergraph feature propagation (SGHFP). Specifically, the feature hypergraph is first generated according to the node features with missing information. And then, the reconstructed node features produced by the previous iteration are fed to a two-layer GNNs to construct a pseudo-label hypergraph. Before each iteration, the constructed feature hypergraph and pseudo-label hypergraph are fused effectively, which can better preserve the higher-order data correlations among nodes. After then, we apply the fused hypergraph to the feature propagation for reconstructing missing features. Finally, the reconstructed node features by multi-iteration optimization are applied to the downstream semi-supervised classification task. Extensive experiments demonstrate that the proposed SGHFP outperforms the existing semi-supervised classification with missing node feature methods.

CVAug 4, 2024
What Happens Without Background? Constructing Foreground-Only Data for Fine-Grained Tasks

Yuetian Wang, Wenjin Hou, Qinmu Peng et al.

Fine-grained recognition, a pivotal task in visual signal processing, aims to distinguish between similar subclasses based on discriminative information present in samples. However, prevailing methods often erroneously focus on background areas, neglecting the capture of genuinely effective discriminative information from the subject, thus impeding practical application. To facilitate research into the impact of background noise on models and enhance their ability to concentrate on the subject's discriminative features, we propose an engineered pipeline that leverages the capabilities of SAM and Detic to create fine-grained datasets with only foreground subjects, devoid of background. Extensive cross-experiments validate this approach as a preprocessing step prior to training, enhancing algorithmic performance and holding potential for further modal expansion of the data.

CVSep 26, 2022
Deep Manifold Hashing: A Divide-and-Conquer Approach for Semi-Paired Unsupervised Cross-Modal Retrieval

Yufeng Shi, Xinge You, Jiamiao Xu et al.

Hashing that projects data into binary codes has shown extraordinary talents in cross-modal retrieval due to its low storage usage and high query speed. Despite their empirical success on some scenarios, existing cross-modal hashing methods usually fail to cross modality gap when fully-paired data with plenty of labeled information is nonexistent. To circumvent this drawback, motivated by the Divide-and-Conquer strategy, we propose Deep Manifold Hashing (DMH), a novel method of dividing the problem of semi-paired unsupervised cross-modal retrieval into three sub-problems and building one simple yet efficiency model for each sub-problem. Specifically, the first model is constructed for obtaining modality-invariant features by complementing semi-paired data based on manifold learning, whereas the second model and the third model aim to learn hash codes and hash functions respectively. Extensive experiments on three benchmarks demonstrate the superiority of our DMH compared with the state-of-the-art fully-paired and semi-paired unsupervised cross-modal hashing methods.

43.5CVApr 7
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

Yizhuo Xu, Chaojian Yu, Yuanjie Shao et al.

Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.

CVDec 3, 2021Code
TransZero: Attribute-guided Transformer for Zero-Shot Learning

Shiming Chen, Ziming Hong, Yang Liu et al.

Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen ones. Semantic knowledge is learned from attribute descriptions shared between different classes, which act as strong priors for localizing object attributes that represent discriminative region features, enabling significant visual-semantic interaction. Although some attention-based models have attempted to learn such region features in a single image, the transferability and discriminative attribute localization of visual features are typically neglected. In this paper, we propose an attribute-guided Transformer network, termed TransZero, to refine visual features and learn attribute localization for discriminative visual embedding representations in ZSL. Specifically, TransZero takes a feature augmentation encoder to alleviate the cross-dataset bias between ImageNet and ZSL benchmarks, and improves the transferability of visual features by reducing the entangled relative geometry relationships among region features. To learn locality-augmented visual features, TransZero employs a visual-semantic decoder to localize the image regions most relevant to each attribute in a given image, under the guidance of semantic attribute information. Then, the locality-augmented visual features and semantic vectors are used to conduct effective visual-semantic interaction in a visual-semantic embedding network. Extensive experiments show that TransZero achieves the new state of the art on three ZSL benchmarks. The codes are available at: \url{https://github.com/shiming-chen/TransZero}.

CVSep 30, 2021Code
HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Shiming Chen, Guo-Sen Xie, Yang Liu et al.

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at \url{https://github.com/shiming-chen/HSVA} .

CVJul 29, 2021Code
FREE: Feature Refinement for Generalized Zero-Shot Learning

Shiming Chen, Wenjie Wang, Beihao Xia et al.

Generalized zero-shot learning (GZSL) has achieved significant progress, with many efforts dedicated to overcoming the problems of visual-semantic domain gap and seen-unseen bias. However, most existing methods directly use feature extraction models trained on ImageNet alone, ignoring the cross-dataset bias between ImageNet and GZSL benchmarks. Such a bias inevitably results in poor-quality visual features for GZSL tasks, which potentially limits the recognition performance on both seen and unseen classes. In this paper, we propose a simple yet effective GZSL method, termed feature refinement for generalized zero-shot learning (FREE), to tackle the above problem. FREE employs a feature refinement (FR) module that incorporates \textit{semantic$\rightarrow$visual} mapping into a unified generative model to refine the visual features of seen and unseen class samples. Furthermore, we propose a self-adaptive margin center loss (SAMC-loss) that cooperates with a semantic cycle-consistency loss to guide FR to learn class- and semantically-relevant representations, and concatenate the features in FR to extract the fully refined features. Extensive experiments on five benchmark datasets demonstrate the significant performance gain of FREE over its baseline and current state-of-the-art methods. Our codes are available at https://github.com/shiming-chen/FREE .

LGOct 11, 2024
Towards Cross-domain Few-shot Graph Anomaly Detection

Jiazhen Chen, Sichao Fu, Zhibin Zhang et al.

Few-shot graph anomaly detection (GAD) has recently garnered increasing attention, which aims to discern anomalous patterns among abundant unlabeled test nodes under the guidance of a limited number of labeled training nodes. Existing few-shot GAD approaches typically adopt meta-training methods trained on richly labeled auxiliary networks to facilitate rapid adaptation to target networks that possess sparse labels. However, these proposed methods often assume that the auxiliary and target networks exist in the same data distributions-an assumption rarely holds in practical settings. This paper explores a more prevalent and complex scenario of cross-domain few-shot GAD, where the goal is to identify anomalies within sparsely labeled target graphs using auxiliary graphs from a related, yet distinct domain. The challenge here is nontrivial owing to inherent data distribution discrepancies between the source and target domains, compounded by the uncertainties of sparse labeling in the target domain. In this paper, we propose a simple and effective framework, termed CDFS-GAD, specifically designed to tackle the aforementioned challenges. CDFS-GAD first introduces a domain-adaptive graph contrastive learning module, which is aimed at enhancing cross-domain feature alignment. Then, a prompt tuning module is further designed to extract domain-specific features tailored to each domain. Moreover, a domain-adaptive hypersphere classification loss is proposed to enhance the discrimination between normal and anomalous instances under minimal supervision, utilizing domain-sensitive norms. Lastly, a self-training strategy is introduced to further refine the predicted scores, enhancing its reliability in few-shot settings. Extensive experiments on twelve real-world cross-domain data pairs demonstrate the effectiveness of the proposed CDFS-GAD framework in comparison to various existing GAD methods.

LGJan 25, 2025
Semi-supervised Anomaly Detection with Extremely Limited Labels in Dynamic Graphs

Jiazhen Chen, Sichao Fu, Zheng Ma et al.

Semi-supervised graph anomaly detection (GAD) has recently received increasing attention, which aims to distinguish anomalous patterns from graphs under the guidance of a moderate amount of labeled data and a large volume of unlabeled data. Although these proposed semi-supervised GAD methods have achieved great success, their superior performance will be seriously degraded when the provided labels are extremely limited due to some unpredictable factors. Besides, the existing methods primarily focus on anomaly detection in static graphs, and little effort was paid to consider the continuous evolution characteristic of graphs over time (dynamic graphs). To address these challenges, we propose a novel GAD framework (EL$^{2}$-DGAD) to tackle anomaly detection problem in dynamic graphs with extremely limited labels. Specifically, a transformer-based graph encoder model is designed to more effectively preserve evolving graph structures beyond the local neighborhood. Then, we incorporate an ego-context hypersphere classification loss to classify temporal interactions according to their structure and temporal neighborhoods while ensuring the normal samples are mapped compactly against anomalous data. Finally, the above loss is further augmented with an ego-context contrasting module which utilizes unlabeled data to enhance model generalization. Extensive experiments on four datasets and three label rates demonstrate the effectiveness of the proposed method in comparison to the existing GAD methods.

74.7CRApr 7
Stealthy and Adjustable Text-Guided Backdoor Attacks on Multimodal Pretrained Models

Yiyang Zhang, Chaojian Yu, Ziming Hong et al.

Multimodal pretrained models are vulnerable to backdoor attacks, yet most existing methods rely on visual or multimodal triggers, which are impractical since visually embedded triggers rarely occur in real-world data. To overcome this limitation, we propose a novel Text-Guided Backdoor (TGB) attack on multimodal pretrained models, where commonly occurring words in textual descriptions serve as backdoor triggers, significantly improving stealthiness and practicality. Furthermore, we introduce visual adversarial perturbations on poisoned samples to modulate the model's learning of textual triggers, enabling a controllable and adjustable TGB attack. Extensive experiments on downstream tasks built upon multimodal pretrained models, including Composed Image Retrieval (CIR) and Visual Question Answering (VQA), demonstrate that TGB achieves practicality and stealthiness with adjustable attack success rates across diverse realistic settings, revealing critical security vulnerabilities in multimodal pretrained models.

CVAug 11, 2025
Prototype-Guided Curriculum Learning for Zero-Shot Learning

Lei Wang, Shiming Chen, Guo-Sen Xie et al.

In Zero-Shot Learning (ZSL), embedding-based methods enable knowledge transfer from seen to unseen classes by learning a visual-semantic mapping from seen-class images to class-level semantic prototypes (e.g., attributes). However, these semantic prototypes are manually defined and may introduce noisy supervision for two main reasons: (i) instance-level mismatch: variations in perspective, occlusion, and annotation bias will cause discrepancies between individual sample and the class-level semantic prototypes; and (ii) class-level imprecision: the manually defined semantic prototypes may not accurately reflect the true semantics of the class. Consequently, the visual-semantic mapping will be misled, reducing the effectiveness of knowledge transfer to unseen classes. In this work, we propose a prototype-guided curriculum learning framework (dubbed as CLZSL), which mitigates instance-level mismatches through a Prototype-Guided Curriculum Learning (PCL) module and addresses class-level imprecision via a Prototype Update (PUP) module. Specifically, the PCL module prioritizes samples with high cosine similarity between their visual mappings and the class-level semantic prototypes, and progressively advances to less-aligned samples, thereby reducing the interference of instance-level mismatches to achieve accurate visual-semantic mapping. Besides, the PUP module dynamically updates the class-level semantic prototypes by leveraging the visual mappings learned from instances, thereby reducing class-level imprecision and further improving the visual-semantic mapping. Experiments were conducted on standard benchmark datasets-AWA2, SUN, and CUB-to verify the effectiveness of our method.

CVDec 3, 2024
Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction

Ziqian Zou, Conghao Wong, Beihao Xia et al.

Understanding and anticipating human movement has become more critical and challenging in diverse applications such as autonomous driving and surveillance. The complex interactions brought by different relations between agents are a crucial reason that poses challenges to this task. Researchers have put much effort into designing a system using rule-based or data-based models to extract and validate the patterns between pedestrian trajectories and these interactions, which has not been adequately addressed yet. Inspired by how humans perceive social interactions with different level of relations to themself, this work proposes the GrouP ConCeption (short for GPCC) model composed of the Group method, which categorizes nearby agents into either group members or non-group members based on a long-term distance kernel function, and the Conception module, which perceives both visual and acoustic information surrounding the target agent. Evaluated across multiple datasets, the GPCC model demonstrates significant improvements in trajectory prediction accuracy, validating its effectiveness in modeling both social and individual dynamics. The qualitative analysis also indicates that the GPCC framework successfully leverages grouping and perception cues human-like intuitively to validate the proposed model's explainability in pedestrian trajectory forecasting.

CVFeb 17, 2022
CSCNet: Contextual Semantic Consistency Network for Trajectory Prediction in Crowded Spaces

Beihao Xia, Conghao Wong, Qinmu Peng et al.

Trajectory prediction aims to predict the movement trend of the agents like pedestrians, bikers, vehicles. It is helpful to analyze and understand human activities in crowded spaces and widely applied in many areas such as surveillance video analysis and autonomous driving systems. Thanks to the success of deep learning, trajectory prediction has made significant progress. The current methods are dedicated to studying the agents' future trajectories under the social interaction and the sceneries' physical constraints. Moreover, how to deal with these factors still catches researchers' attention. However, they ignore the \textbf{Semantic Shift Phenomenon} when modeling these interactions in various prediction sceneries. There exist several kinds of semantic deviations inner or between social and physical interactions, which we call the "\textbf{Gap}". In this paper, we propose a \textbf{C}ontextual \textbf{S}emantic \textbf{C}onsistency \textbf{Net}work (\textbf{CSCNet}) to predict agents' future activities with powerful and efficient context constraints. We utilize a well-designed context-aware transfer to obtain the intermediate representations from the scene images and trajectories. Then we eliminate the differences between social and physical interactions by aligning activity semantics and scene semantics to cross the Gap. Experiments demonstrate that CSCNet performs better than most of the current methods quantitatively and qualitatively.

CVOct 14, 2021
View Vertically: A Hierarchical Network for Trajectory Prediction via Fourier Spectrums

Conghao Wong, Beihao Xia, Ziming Hong et al.

Understanding and forecasting future trajectories of agents are critical for behavior analysis, robot navigation, autonomous cars, and other related applications. Previous methods mostly treat trajectory prediction as time sequence generation. Different from them, this work studies agents' trajectories in a "vertical" view, i.e., modeling and forecasting trajectories from the spectral domain. Different frequency bands in the trajectory spectrums could hierarchically reflect agents' motion preferences at different scales. The low-frequency and high-frequency portions could represent their coarse motion trends and fine motion variations, respectively. Accordingly, we propose a hierarchical network V$^2$-Net, which contains two sub-networks, to hierarchically model and predict agents' trajectories with trajectory spectrums. The coarse-level keypoints estimation sub-network first predicts the "minimal" spectrums of agents' trajectories on several "key" frequency portions. Then the fine-level spectrum interpolation sub-network interpolates the spectrums to reconstruct the final predictions. Experimental results display the competitiveness and superiority of V$^2$-Net on both ETH-UCY benchmark and the Stanford Drone Dataset.

CVJul 2, 2021
MSN: Multi-Style Network for Trajectory Prediction

Conghao Wong, Beihao Xia, Qinmu Peng et al.

Trajectory prediction aims to forecast agents' possible future locations considering their observations along with the video context. It is strongly needed by many autonomous platforms like tracking, detection, robot navigation, and self-driving cars. Whether it is agents' internal personality factors, interactive behaviors with the neighborhood, or the influence of surroundings, they all impact agents' future planning. However, many previous methods model and predict agents' behaviors with the same strategy or feature distribution, making them challenging to make predictions with sufficient style differences. This paper proposes the Multi-Style Network (MSN), which utilizes style proposal and stylized prediction using two sub-networks, to provide multi-style predictions in a novel categorical way adaptively. The proposed network contains a series of style channels, and each channel is bound to a unique and specific behavior style. We use agents' end-point plannings and their interaction context as the basis for the behavior classification, so as to adaptively learn multiple diverse behavior styles through these channels. Then, we assume that the target agents may plan their future behaviors according to each of these categorized styles, thus utilizing different style channels to make predictions with significant style differences in parallel. Experiments show that the proposed MSN outperforms current state-of-the-art methods up to 10% quantitatively on two widely used datasets, and presents better multi-style characteristics qualitatively.

CVOct 8, 2020
BGM: Building a Dynamic Guidance Map without Visual Images for Trajectory Prediction

Beihao Xia, Conghao Wong, Heng Li et al.

Visual images usually contain the informative context of the environment, thereby helping to predict agents' behaviors. However, they hardly impose the dynamic effects on agents' actual behaviors due to the respectively fixed semantics. To solve this problem, we propose a deterministic model named BGM to construct a guidance map to represent the dynamic semantics, which circumvents to use visual images for each agent to reflect the difference of activities in different periods. We first record all agents' activities in the scene within a period close to the current to construct a guidance map and then feed it to a Context CNN to obtain their context features. We adopt a Historical Trajectory Encoder to extract the trajectory features and then combine them with the context feature as the input of the social energy based trajectory decoder, thus obtaining the prediction that meets the social rules. Experiments demonstrate that BGM achieves state-of-the-art prediction accuracy on the two widely used ETH and UCY datasets and handles more complex scenarios.

CVMar 22, 2020
Modal Regression based Structured Low-rank Matrix Recovery for Multi-view Learning

Jiamiao Xu, Fangzhao Wang, Qinmu Peng et al.

Low-rank Multi-view Subspace Learning (LMvSL) has shown great potential in cross-view classification in recent years. Despite their empirical success, existing LMvSL based methods are incapable of well handling view discrepancy and discriminancy simultaneously, which thus leads to the performance degradation when there is a large discrepancy among multi-view data. To circumvent this drawback, motivated by the block-diagonal representation learning, we propose Structured Low-rank Matrix Recovery (SLMR), a unique method of effectively removing view discrepancy and improving discriminancy through the recovery of structured low-rank matrix. Furthermore, recent low-rank modeling provides a satisfactory solution to address data contaminated by predefined assumptions of noise distribution, such as Gaussian or Laplacian distribution. However, these models are not practical since complicated noise in practice may violate those assumptions and the distribution is generally unknown in advance. To alleviate such limitation, modal regression is elegantly incorporated into the framework of SLMR (term it MR-SLMR). Different from previous LMvSL based methods, our MR-SLMR can handle any zero-mode noise variable that contains a wide range of noise, such as Gaussian noise, random noise and outliers. The alternating direction method of multipliers (ADMM) framework and half-quadratic theory are used to efficiently optimize MR-SLMR. Experimental results on four public databases demonstrate the superiority of MR-SLMR and its robustness to complicated noise.

CVMar 13, 2020
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction

Beihao Xia, Conghao Wang, Qinmu Peng et al.

It remains challenging to automatically predict the multi-agent trajectory due to multiple interactions including agent to agent interaction and scene to agent interaction. Although recent methods have achieved promising performance, most of them just consider spatial influence of the interactions and ignore the fact that temporal influence always accompanies spatial influence. Moreover, those methods based on scene information always require extra segmented scene images to generate multiple socially acceptable trajectories. To solve these limitations, we propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC). First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity. Experiments are performed on the two widely used ETH-UCY datasets and demonstrate that the proposed model achieves state-of-the-art prediction accuracy and handles more complex scenarios.

CVNov 11, 2019
Kernelized Similarity Learning and Embedding for Dynamic Texture Synthesis

Shiming Chen, Peng Zhang, Guo-Sen Xie et al.

Dynamic texture (DT) exhibits statistical stationarity in the spatial domain and stochastic repetitiveness in the temporal dimension, indicating that different frames of DT possess a high similarity correlation that is critical prior knowledge. However, existing methods cannot effectively learn a promising synthesis model for high-dimensional DT from a small number of training data. In this paper, we propose a novel DT synthesis method, which makes full use of similarity prior knowledge to address this issue. Our method bases on the proposed kernel similarity embedding, which not only can mitigate the high-dimensionality and small sample issues, but also has the advantage of modeling nonlinear feature relationship. Specifically, we first raise two hypotheses that are essential for DT model to generate new frames using similarity correlation. Then, we integrate kernel learning and extreme learning machine into a unified synthesis model to learn kernel similarity embedding for representing DT. Extensive experiments on DT videos collected from the internet and two benchmark datasets, i.e., Gatech Graphcut Textures and Dyntex, demonstrate that the learned kernel similarity embedding can effectively exhibit the discriminative representation for DT. Accordingly, our method is capable of preserving the long-term temporal continuity of the synthesized DT sequences with excellent sustainability and generalization. Meanwhile, it effectively generates realistic DT videos with fast speed and low computation, compared with the state-of-the-art methods. The code and more synthesis videos are available at our project page https://shiming-chen.github.io/Similarity-page/Similarit.html.

CVMay 29, 2019
Closed-Loop Adaptation for Weakly-Supervised Semantic Segmentation

Zhengqiang Zhang, Shujian Yu, Shi Yin et al.

Weakly-supervised semantic segmentation aims to assign each pixel a semantic category under weak supervisions, such as image-level tags. Most of existing weakly-supervised semantic segmentation methods do not use any feedback from segmentation output and can be considered as open-loop systems. They are prone to accumulated errors because of the static seeds and the sensitive structure information. In this paper, we propose a generic self-adaptation mechanism for existing weakly-supervised semantic segmentation methods by introducing two feedback chains, thus constituting a closed-loop system. Specifically, the first chain iteratively produces dynamic seeds by incorporating cross-image structure information, whereas the second chain further expands seed regions by a customized random walk process to reconcile inner-image structure information characterized by superpixels. Experiments on PASCAL VOC 2012 suggest that our network outperforms state-of-the-art methods with significantly less computational and memory burden.

CVMar 21, 2019
Fast and accurate reconstruction of HARDI using a 1D encoder-decoder convolutional network

Shi Yin, Zhengqiang Zhang, Qinmu Peng et al.

High angular resolution diffusion imaging (HARDI) demands a lager amount of data measurements compared to diffusion tensor imaging, restricting its use in practice. In this work, we explore a learning-based approach to reconstruct HARDI from a smaller number of measurements in q-space. The approach aims to directly learn the mapping relationship between the measured and HARDI signals from the collecting HARDI acquisitions of other subjects. Specifically, the mapping is represented as a 1D encoder-decoder convolutional neural network under the guidance of the compressed sensing (CS) theory for HARDI reconstruction. The proposed network architecture mainly consists of two parts: an encoder network produces the sparse coefficients and a decoder network yields a reconstruction result. Experiment results demonstrate we can robustly reconstruct HARDI signals with the accurate results and fast speed.

CVJan 5, 2019
Fully-automatic segmentation of kidneys in clinical ultrasound images using a boundary distance regression network

Shi Yin, Zhengqiang Zhang, Hongming Li et al.

It remains challenging to automatically segment kidneys in clinical ultrasound images due to the kidneys' varied shapes and image intensity distributions, although semi-automatic methods have achieved promising performance. In this study, we developed a novel boundary distance regression deep neural network to segment the kidneys, informed by the fact that the kidney boundaries are relatively consistent across images in terms of their appearance. Particularly, we first use deep neural networks pre-trained for classification of natural images to extract high-level image features from ultrasound images, then these feature maps are used as input to learn kidney boundary distance maps using a boundary distance regression network, and finally the predicted boundary distance maps are classified as kidney pixels or non-kidney pixels using a pixel classification network in an end-to-end learning fashion. Experimental results have demonstrated that our method could effectively improve the performance of automatic kidney segmentation, significantly better than deep learning based pixel classification networks.

CVNov 12, 2018
Automatic kidney segmentation in ultrasound images using subsequent boundary distance regression and pixelwise classification networks

Shi Yin, Qinmu Peng, Hongming Li et al.

It remains challenging to automatically segment kidneys in clinical ultrasound (US) images due to the kidneys' varied shapes and image intensity distributions, although semi-automatic methods have achieved promising performance. In this study, we propose subsequent boundary distance regression and pixel classification networks to segment the kidneys, informed by the fact that the kidney boundaries have relatively homogenous texture patterns across images. Particularly, we first use deep neural networks pre-trained for classification of natural images to extract high-level image features from US images, then these features are used as input to learn kidney boundary distance maps using a boundary distance regression network, and finally the predicted boundary distance maps are classified as kidney pixels or non-kidney pixels using a pixel classification network in an end-to-end learning fashion. We also adopted a data-augmentation method based on kidney shape registration to generate enriched training data from a small number of US images with manually segmented kidney labels. Experimental results have demonstrated that our method could effectively improve the performance of automatic kidney segmentation, significantly better than deep learning-based pixel classification networks.

CVMay 21, 2018
Coarse-to-Fine Salient Object Detection with Low-Rank Matrix Recovery

Qi Zheng, Shujian Yu, Xinge You et al.

Low-Rank Matrix Recovery (LRMR) has recently been applied to saliency detection by decomposing image features into a low-rank component associated with background and a sparse component associated with visual salient regions. Despite its great potential, existing LRMR-based saliency detection methods seldom consider the inter-relationship among elements within these two components, thus are prone to generating scattered or incomplete saliency maps. In this paper, we introduce a novel and efficient LRMR-based saliency detection model under a coarse-to-fine framework to circumvent this limitation. First, we roughly measure the saliency of image regions with a baseline LRMR model that integrates a $\ell_1$-norm sparsity constraint and a Laplacian regularization smooth term. Given samples from the coarse saliency map, we then learn a projection that maps image features to refined saliency values, to significantly sharpen the object boundaries and to preserve the object entirety. We evaluate our framework against existing LRMR-based methods on three benchmark datasets. Experimental results validate the superiority of our method as well as the effectiveness of our suggested coarse-to-fine framework, especially for images containing multiple objects.