CVSep 6, 2023
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic SegmenterJinglong Wang, Xiawei Li, Jing Zhang et al.
The pre-trained text-image discriminative models, such as CLIP, has been explored for open-vocabulary semantic segmentation with unsatisfactory results due to the loss of crucial localization information and awareness of object shapes. Recently, there has been a growing interest in expanding the application of generative models from generation tasks to semantic segmentation. These approaches utilize generative models either for generating annotated data or extracting features to facilitate semantic segmentation. This typically involves generating a considerable amount of synthetic data or requiring additional mask annotations. To this end, we uncover the potential of generative text-to-image diffusion models (e.g., Stable Diffusion) as highly efficient open-vocabulary semantic segmenters, and introduce a novel training-free approach named DiffSegmenter. The insight is that to generate realistic objects that are semantically faithful to the input text, both the complete object shapes and the corresponding semantics are implicitly learned by diffusion models. We discover that the object shapes are characterized by the self-attention maps while the semantics are indicated through the cross-attention maps produced by the denoising U-Net, forming the basis of our segmentation results.Additionally, we carefully design effective textual prompts and a category filtering mechanism to further enhance the segmentation results. Extensive experiments on three benchmark datasets show that the proposed DiffSegmenter achieves impressive results for open-vocabulary semantic segmentation.
IVAug 10, 2022
KiPA22 Report: U-Net with Contour Regularization for Renal Structures SegmentationKangqing Ye, Peng Liu, Xiaoyang Zou et al.
Three-dimensional (3D) integrated renal structures (IRS) segmentation is important in clinical practice. With the advancement of deep learning techniques, many powerful frameworks focusing on medical image segmentation are proposed. In this challenge, we utilized the nnU-Net framework, which is the state-of-the-art method for medical image segmentation. To reduce the outlier prediction for the tumor label, we combine contour regularization (CR) loss of the tumor label with Dice loss and cross-entropy loss to improve this phenomenon.
IVAug 22, 2024
Automatic Organ and Pan-cancer Segmentation in Abdomen CT: the FLARE 2023 ChallengeJun Ma, Yao Zhang, Song Gu et al.
Organ and cancer segmentation in abdomen Computed Tomography (CT) scans is the prerequisite for precise cancer diagnosis and treatment. Most existing benchmarks and algorithms are tailored to specific cancer types, limiting their ability to provide comprehensive cancer analysis. This work presents the first international competition on abdominal organ and pan-cancer segmentation by providing a large-scale and diverse dataset, including 4650 CT scans with various cancer types from over 40 medical centers. The winning team established a new state-of-the-art with a deep learning-based cascaded framework, achieving average Dice Similarity Coefficient scores of 92.3% for organs and 64.9% for lesions on the hidden multi-national testing set. The dataset and code of top teams are publicly available, offering a benchmark platform to drive further innovations https://codalab.lisn.upsaclay.fr/competitions/12239.
LGJan 9
AIConfigurator: Lightning-Fast Configuration Optimization for Multi-Framework LLM ServingTianhao Xu, Yiming Liu, Xianglong Lu et al.
Optimizing Large Language Model (LLM) inference in production systems is increasingly difficult due to dynamic workloads, stringent latency/throughput targets, and a rapidly expanding configuration space. This complexity spans not only distributed parallelism strategies (tensor/pipeline/expert) but also intricate framework-specific runtime parameters such as those concerning the enablement of CUDA graphs, available KV-cache memory fractions, and maximum token capacity, which drastically impact performance. The diversity of modern inference frameworks (e.g., TRT-LLM, vLLM, SGLang), each employing distinct kernels and execution policies, makes manual tuning both framework-specific and computationally prohibitive. We present AIConfigurator, a unified performance-modeling system that enables rapid, framework-agnostic inference configuration search without requiring GPU-based profiling. AIConfigurator combines (1) a methodology that decomposes inference into analytically modelable primitives - GEMM, attention, communication, and memory operations while capturing framework-specific scheduling dynamics; (2) a calibrated kernel-level performance database for these primitives across a wide range of hardware platforms and popular open-weights models (GPT-OSS, Qwen, DeepSeek, LLama, Mistral); and (3) an abstraction layer that automatically resolves optimal launch parameters for the target backend, seamlessly integrating into production-grade orchestration systems. Evaluation on production LLM serving workloads demonstrates that AIConfigurator identifies superior serving configurations that improve performance by up to 40% for dense models (e.g., Qwen3-32B) and 50% for MoE architectures (e.g., DeepSeek-V3), while completing searches within 30 seconds on average. Enabling the rapid exploration of vast design spaces - from cluster topology down to engine specific flags.
CLNov 3, 2025
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI ReviewersQin Zhou, Zhexin Zhang, Zhi Li et al.
With the rapid advancement of AI models, their deployment across diverse tasks has become increasingly widespread. A notable emerging application is leveraging AI models to assist in reviewing scientific papers. However, recent reports have revealed that some papers contain hidden, injected prompts designed to manipulate AI reviewers into providing overly favorable evaluations. In this work, we present an early systematic investigation into this emerging threat. We propose two classes of attacks: (1) static attack, which employs a fixed injection prompt, and (2) iterative attack, which optimizes the injection prompt against a simulated reviewer model to maximize its effectiveness. Both attacks achieve striking performance, frequently inducing full evaluation scores when targeting frontier AI reviewers. Furthermore, we show that these attacks are robust across various settings. To counter this threat, we explore a simple detection-based defense. While it substantially reduces the attack success rate, we demonstrate that an adaptive attacker can partially circumvent this defense. Our findings underscore the need for greater attention and rigorous safeguards against prompt-injection threats in AI-assisted peer review.
CVJul 21, 2025Code
FreeCus: Free Lunch Subject-driven Customization in Diffusion TransformersYanbing Zhang, Zhe Wang, Qin Zhou et al.
In light of recent breakthroughs in text-to-image (T2I) generation, particularly with diffusion transformers (DiT), subject-driven technologies are increasingly being employed for high-fidelity customized production that preserves subject identity from reference inputs, enabling thrilling design workflows and engaging entertainment. Existing alternatives typically require either per-subject optimization via trainable text embeddings or training specialized encoders for subject feature extraction on large-scale datasets. Such dependencies on training procedures fundamentally constrain their practical applications. More importantly, current methodologies fail to fully leverage the inherent zero-shot potential of modern diffusion transformers (e.g., the Flux series) for authentic subject-driven synthesis. To bridge this gap, we propose FreeCus, a genuinely training-free framework that activates DiT's capabilities through three key innovations: 1) We introduce a pivotal attention sharing mechanism that captures the subject's layout integrity while preserving crucial editing flexibility. 2) Through a straightforward analysis of DiT's dynamic shifting, we propose an upgraded variant that significantly improves fine-grained feature extraction. 3) We further integrate advanced Multimodal Large Language Models (MLLMs) to enrich cross-modal semantic representations. Extensive experiments reflect that our method successfully unlocks DiT's zero-shot ability for consistent subject synthesis across diverse contexts, achieving state-of-the-art or comparable results compared to approaches that require additional training. Notably, our framework demonstrates seamless compatibility with existing inpainting pipelines and control modules, facilitating more compelling experiences. Our code is available at: https://github.com/Monalissaa/FreeCus.
AIOct 21, 2024Code
Large Language Models Powered Multiagent Ensemble for Mitigating Hallucination and Efficient Atrial Fibrillation Annotation of ECG ReportsJingwei Huang, Kuroush Nezafati, Ismael Villanueva-Miranda et al.
This study introduces a LLMs powered multiagent ensemble method to address challenges in hallucination and data labeling, particularly in large-scale EHR datasets. Manual labeling of such datasets requires domain expertise and is labor-intensive, time-consuming, expensive, and error-prone. To overcome this bottleneck, we developed an ensemble LLMs method and demonstrated its effectiveness in two real-world tasks: (1) labeling a large-scale unlabeled ECG dataset in MIMIC-IV; (2) identifying social determinants of health (SDOH) from the clinical notes of EHR. Trading off benefits and cost, we selected a pool of diverse open source LLMs with satisfactory performance. We treat each LLM's prediction as a vote and apply a mechanism of majority voting with minimal winning threshold for ensemble. We implemented an ensemble LLMs application for EHR data labeling tasks. By using the ensemble LLMs and natural language processing, we labeled MIMIC-IV ECG dataset of 623,566 ECG reports with an estimated accuracy of 98.2%. We applied the ensemble LLMs method to identify SDOH from social history sections of 1,405 EHR clinical notes, also achieving competitive performance. Our experiments show that the ensemble LLMs can outperform individual LLM even the best commercial one, and the method reduces hallucination errors. From the research, we found that (1) the ensemble LLMs method significantly reduces the time and effort required for labeling large-scale EHR data, automating the process with high accuracy and quality; (2) the method generalizes well to other text data labeling tasks, as shown by its application to SDOH identification; (3) the ensemble of a group of diverse LLMs can outperform or match the performance of the best individual LLM; and (4) the ensemble method substantially reduces hallucination errors. This approach provides a scalable and efficient solution to data-labeling challenges.
CVJun 3, 2019Code
Learning to Self-Train for Semi-Supervised Few-Shot ClassificationXinzhe Li, Qianru Sun, Yaoyao Liu et al.
Few-shot classification (FSC) is challenging due to the scarcity of labeled training data (e.g. only one labeled data point per class). Meta-learning has shown to achieve promising results by learning to initialize a classification model for FSC. In this paper we propose a novel semi-supervised meta-learning method called learning to self-train (LST) that leverages unlabeled data and specifically meta-learns how to cherry-pick and label such unsupervised data to further improve performance. To this end, we train the LST model through a large number of semi-supervised few-shot tasks. On each task, we train a few-shot model to predict pseudo labels for unlabeled data, and then iterate the self-training steps on labeled and pseudo-labeled data with each step followed by fine-tuning. We additionally learn a soft weighting network (SWN) to optimize the self-training weights of pseudo labels so that better ones can contribute more to gradient descent optimization. We evaluate our LST method on two ImageNet benchmarks for semi-supervised few-shot classification and achieve large improvements over the state-of-the-art method. Code is at https://github.com/xinzheli1217/learning-to-self-train.
FLU-DYNDec 1, 2025
How do trout regulate patterns of muscle contraction to optimize propulsive efficiency during steady swimmingTao Li, Chunze Zhang, Weiwei Yao et al.
Understanding efficient fish locomotion offers insights for biomechanics, fluid dynamics, and engineering. Traditional studies often miss the link between neuromuscular control and whole-body movement. To explore energy transfer in carangiform swimming, we created a bio-inspired digital trout. This model combined multibody dynamics, Hill-type muscle modeling, and a high-fidelity fluid-structure interaction algorithm, accurately replicating a real trout's form and properties. Using deep reinforcement learning, the trout's neural system achieved hierarchical spatiotemporal control of muscle activation. We systematically examined how activation strategies affect speed and energy use. Results show that axial myomere coupling-with activation spanning over 0.5 body lengths-is crucial for stable body wave propagation. Moderate muscle contraction duration ([0.1,0.3] of a tail-beat cycle) lets the body and fluid act as a passive damping system, cutting energy use. Additionally, the activation phase lag of myomeres shapes the body wave; if too large, it causes antagonistic contractions that hinder thrust. These findings advance bio-inspired locomotion understanding and aid energy-efficient underwater system design.
68.6LGApr 15
Enhancing Reinforcement Learning for Radiology Report Generation with Evidence-aware Rewards and Self-correcting Preference LearningQin Zhou, Guoyan Liang, Qianyi Yang et al.
Recent reinforcement learning (RL) approaches have advanced radiology report generation (RRG), yet two core limitations persist: (1) report-level rewards offer limited evidence-grounded guidance for clinical faithfulness; and (2) current methods lack an explicit self-improving mechanism to align with clinical preference. We introduce clinically aligned Evidence-aware Self-Correcting Reinforcement Learning (ESC-RL), comprising two key components. First, a Group-wise Evidence-aware Alignment Reward (GEAR) delivers group-wise, evidence-aware feedback. GEAR reinforces consistent grounding for true positives, recovers missed findings for false negatives, and suppresses unsupported content for false positives. Second, a Self-correcting Preference Learning (SPL) strategy automatically constructs a reliable, disease-aware preference dataset from multiple noisy observations and leverages an LLM to synthesize refined reports without human supervision. ESC-RL promotes clinically faithful, disease-aligned reward and supports continual self-improvement during training. Extensive experiments on two public chest X-ray datasets demonstrate consistent gains and state-of-the-art performance.
CVMar 27, 2024
Attention Calibration for Disentangled Text-to-Image PersonalizationYanbing Zhang, Mengping Yang, Qin Zhou et al.
Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.
83.6ITApr 1
Secure Network Function Computation for General Target and Security FunctionsQin Zhou, Fang-Wei Fu
Secure network function computation is a critical research direction in network coding, which aims to ensure that the target function is correctly computed at the sink node while preventing the wiretapper from obtaining any information about the security function. In this paper, we focus on the general secure network function computation model, where the target function f and the security function ζ are arbitrary, and the wiretapper can eavesdrop on any subset of edges with size at most a given security level. Using information-theoretic techniques, we establish a nontrivial upper bound on the secure computing capacity, which is applicable to arbitrary networks, arbitrary target and security functions, and arbitrary security levels. This upper bound is shown to degenerate to the existing bounds in the literature when the target and security functions are specific forms. Furthermore, we consider two specific models: one where the target function is vector-linear and the security function is the identity function, and another where both functions are vector-linear. For the former, we derive a simplified form of the upper bound on the secure computing capacity via order-theoretic methods and propose an efficient algorithm to compute this bound with linear time complexity in the number of network edges. For the latter, we characterize the equivalent conditions for the computability and security of linear secure network codes, develop two constructive schemes for such codes, and derive an upper bound on the minimal finite field size required for the constructions, thereby obtaining a nontrivial lower bound on the secure computing capacity.
CVOct 3, 2025
GAS-MIL: Group-Aggregative Selection Multi-Instance Learning for Ensemble of Foundation Models in Digital Pathology Image AnalysisPeiran Quan, Zifan Gu, Zhuo Zhao et al.
Foundation models (FMs) have transformed computational pathology by providing powerful, general-purpose feature extractors. However, adapting and benchmarking individual FMs for specific diagnostic tasks is often time-consuming and resource-intensive, especially given their scale and diversity. To address this challenge, we introduce Group-Aggregative Selection Multi-Instance Learning (GAS-MIL), a flexible ensemble framework that seamlessly integrates features from multiple FMs, preserving their complementary strengths without requiring manual feature selection or extensive task-specific fine-tuning. Across classification tasks in three cancer datasets-prostate (PANDA), ovarian (UBC-OCEAN), and breast (TCGA-BrCa)-GAS-MIL consistently achieves superior or on-par performance relative to individual FMs and established MIL methods, demonstrating its robustness and generalizability. By enabling efficient integration of heterogeneous FMs, GAS-MIL streamlines model deployment for pathology and provides a scalable foundation for future multimodal and precision oncology applications.
CVJun 11, 2025
ELBO-T2IAlign: A Generic ELBO-Based Method for Calibrating Pixel-level Text-Image Alignment in Diffusion ModelsQin Zhou, Zhiyang Zhang, Jinglong Wang et al.
Diffusion models excel at image generation. Recent studies have shown that these models not only generate high-quality images but also encode text-image alignment information through attention maps or loss functions. This information is valuable for various downstream tasks, including segmentation, text-guided image editing, and compositional image generation. However, current methods heavily rely on the assumption of perfect text-image alignment in diffusion models, which is not the case. In this paper, we propose using zero-shot referring image segmentation as a proxy task to evaluate the pixel-level image and class-level text alignment of popular diffusion models. We conduct an in-depth analysis of pixel-text misalignment in diffusion models from the perspective of training data bias. We find that misalignment occurs in images with small sized, occluded, or rare object classes. Therefore, we propose ELBO-T2IAlign, a simple yet effective method to calibrate pixel-text alignment in diffusion models based on the evidence lower bound (ELBO) of likelihood. Our method is training-free and generic, eliminating the need to identify the specific cause of misalignment and works well across various diffusion model architectures. Extensive experiments on commonly used benchmark datasets on image segmentation and generation have verified the effectiveness of our proposed calibration approach.
CVDec 10, 2021
Seq-Masks: Bridging the gap between appearance and gait modeling for video-based person re-identificationZhigang Chang, Zhao Yang, Yongbiao Chen et al.
ideo-based person re-identification (Re-ID) aims to match person images in video sequences captured by disjoint surveillance cameras. Traditional video-based person Re-ID methods focus on exploring appearance information, thus, vulnerable against illumination changes, scene noises, camera parameters, and especially clothes/carrying variations. Gait recognition provides an implicit biometric solution to alleviate the above headache. Nonetheless, it experiences severe performance degeneration as camera view varies. In an attempt to address these problems, in this paper, we propose a framework that utilizes the sequence masks (SeqMasks) in the video to integrate appearance information and gait modeling in a close fashion. Specifically, to sufficiently validate the effectiveness of our method, we build a novel dataset named MaskMARS based on MARS. Comprehensive experiments on our proposed large wild video Re-ID dataset MaskMARS evidenced our extraordinary performance and generalization capability. Validations on the gait recognition metric CASIA-B dataset further demonstrated the capability of our hybrid model.
CVDec 1, 2021
Subtask-dominated Transfer Learning for Long-tail Person SearchChuang Liu, Hua Yang, Qin Zhou et al.
Person search unifies person detection and person re-identification (Re-ID) to locate query persons from the panoramic gallery images. One major challenge comes from the imbalanced long-tail person identity distributions, which prevents the one-step person search model from learning discriminative person features for the final re-identification. However, it is under-explored how to solve the heavy imbalanced identity distributions for the one-step person search. Techniques designed for the long-tail classification task, for example, image-level re-sampling strategies, are hard to be effectively applied to the one-step person search which jointly solves person detection and Re-ID subtasks with a detection-based multi-task framework. To tackle this problem, we propose a Subtask-dominated Transfer Learning (STL) method. The STL method solves the long-tail problem in the pretraining stage of the dominated Re-ID subtask and improves the one-step person search by transfer learning of the pretrained model. We further design a Multi-level RoI Fusion Pooling layer to enhance the discrimination ability of person features for the one-step person search. Extensive experiments on CUHK-SYSU and PRW datasets demonstrate the superiority and effectiveness of the proposed method.
CVAug 24, 2021
Making Person Search Enjoy the Merits of Person Re-identificationChuang Liu, Hua Yang, Qin Zhou et al.
Person search is an extended task of person re-identification (Re-ID). However, most existing one-step person search works have not studied how to employ existing advanced Re-ID models to boost the one-step person search performance due to the integration of person detection and Re-ID. To address this issue, we propose a faster and stronger one-step person search framework, the Teacher-guided Disentangling Networks (TDN), to make the one-step person search enjoy the merits of the existing Re-ID researches. The proposed TDN can significantly boost the person search performance by transferring the advanced person Re-ID knowledge to the person search model. In the proposed TDN, for better knowledge transfer from the Re-ID teacher model to the one-step person search model, we design a strong one-step person search base framework by partially disentangling the two subtasks. Besides, we propose a Knowledge Transfer Bridge module to bridge the scale gap caused by different input formats between the Re-ID model and one-step person search model. During testing, we further propose the Ranking with Context Persons strategy to exploit the context information in panoramic images for better retrieval. Experiments on two public person search datasets demonstrate the favorable performance of the proposed method.
LGJul 30, 2021
Random vector functional link neural network based ensemble deep learning for short-term load forecastingRuobin Gao, Liang Du, P. N. Suganthan et al.
Electricity load forecasting is crucial for the power systems' planning and maintenance. However, its un-stationary and non-linear characteristics impose significant difficulties in anticipating future demand. This paper proposes a novel ensemble deep Random Vector Functional Link (edRVFL) network for electricity load forecasting. The weights of hidden layers are randomly initialized and kept fixed during the training process. The hidden layers are stacked to enforce deep representation learning. Then, the model generates the forecasts by ensembling the outputs of each layer. Moreover, we also propose to augment the random enhancement features by empirical wavelet transformation (EWT). The raw load data is decomposed by EWT in a walk-forward fashion, not introducing future data leakage problems in the decomposition process. Finally, all the sub-series generated by the EWT, including raw data, are fed into the edRVFL for forecasting purposes. The proposed model is evaluated on twenty publicly available time series from the Australian Energy Market Operator of the year 2020. The simulation results demonstrate the proposed model's superior performance over eleven forecasting methods in three error metrics and statistical tests on electricity load forecasting tasks.
CVNov 17, 2019
Distribution Context Aware Loss for Person Re-identificationZhigang Chang, Qin Zhou, Mingyang Yu et al.
To learn the optimal similarity function between probe and gallery images in Person re-identification, effective deep metric learning methods have been extensively explored to obtain discriminative feature embedding. However, existing metric loss like triplet loss and its variants always emphasize pair-wise relations but ignore the distribution context in feature space, leading to inconsistency and sub-optimal. In fact, the similarity of one pair not only decides the match of this pair, but also has potential impacts on other sample pairs. In this paper, we propose a novel Distribution Context Aware (DCA) loss based on triplet loss to combine both numerical similarity and relation similarity in feature space for better clustering. Extensive experiments on three benchmarks including Market-1501, DukeMTMC-reID and MSMT17, evidence the favorable performance of our method against the corresponding baseline and other state-of-the-art methods.
CVMay 15, 2018
Robust and Efficient Graph Correspondence Transfer for Person Re-identificationQin Zhou, Heng Fan, Hua Yang et al.
Spatial misalignment caused by variations in poses and viewpoints is one of the most critical issues that hinders the performance improvement in existing person re-identification (Re-ID) algorithms. To address this problem, in this paper, we present a robust and efficient graph correspondence transfer (REGCT) approach for explicit spatial alignment in Re-ID. Specifically, we propose to establish the patch-wise correspondences of positive training pairs via graph matching. By exploiting both spatial and visual contexts of human appearance in graph matching, meaningful semantic correspondences can be obtained. To circumvent the cumbersome \emph{on-line} graph matching in testing phase, we propose to transfer the \emph{off-line} learned patch-wise correspondences from the positive training pairs to test pairs. In detail, for each test pair, the training pairs with similar pose-pair configurations are selected as references. The matching patterns (i.e., the correspondences) of the selected references are then utilized to calculate the patch-wise feature distances of this test pair. To enhance the robustness of correspondence transfer, we design a novel pose context descriptor to accurately model human body configurations, and present an approach to measure the similarity between a pair of pose context descriptors. Meanwhile, to improve testing efficiency, we propose a correspondence template ensemble method using the voting mechanism, which significantly reduces the amount of patch-wise matchings involved in distance calculation. With aforementioned strategies, the REGCT model can effectively and efficiently handle the spatial misalignment problem in Re-ID. Extensive experiments on five challenging benchmarks, including VIPeR, Road, PRID450S, 3DPES and CUHK01, evidence the superior performance of REGCT over other state-of-the-art approaches.
CVApr 1, 2018
Graph Correspondence Transfer for Person Re-identificationQin Zhou, Heng Fan, Shibao Zheng et al.
In this paper, we propose a graph correspondence transfer (GCT) approach for person re-identification. Unlike existing methods, the GCT model formulates person re-identification as an off-line graph matching and on-line correspondence transferring problem. In specific, during training, the GCT model aims to learn off-line a set of correspondence templates from positive training pairs with various pose-pair configurations via patch-wise graph matching. During testing, for each pair of test samples, we select a few training pairs with the most similar pose-pair configurations as references, and transfer the correspondences of these references to test pair for feature distance calculation. The matching score is derived by aggregating distances from different references. For each probe image, the gallery image with the highest matching score is the re-identifying result. Compared to existing algorithms, our GCT can handle spatial misalignment caused by large variations in view angles and human poses owing to the benefits of patch-wise graph matching. Extensive experiments on five benchmarks including VIPeR, Road, PRID450S, 3DPES and CUHK01 evidence the superior performance of GCT model over other state-of-the-art methods.
CVMar 22, 2018
Weighted Bilinear Coding over Salient Body Parts for Person Re-identificationZhigang Chang, Qin Zhou, Heng Fan et al.
Deep convolutional neural networks (CNNs) have demonstrated dominant performance in person re-identification (Re-ID). Existing CNN based methods utilize global average pooling (GAP) to aggregate intermediate convolutional features for Re-ID. However, this strategy only considers the first-order statistics of local features and treats local features at different locations equally important, leading to sub-optimal feature representation. To deal with these issues, we propose a novel weighted bilinear coding (WBC) framework for local feature aggregation in CNN networks to pursue more representative and discriminative feature representations, which can adapt to other state-of-the-art methods and improve their performance. In specific, bilinear coding is used to encode the channel-wise feature correlations to capture richer feature interactions. Meanwhile, a weighting scheme is applied on the bilinear coding to adaptively adjust the weights of local features at different locations based on their importance in recognition, further improving the discriminability of feature aggregation. To handle the spatial misalignment issue, we use a salient part net (spatial attention module) to derive salient body parts, and apply the WBC model on each part. The final representation, formed by concatenating the WBC encoded features of each part, is both discriminative and resistant to spatial misalignment. Experiments on three benchmarks including Market-1501, DukeMTMC-reID and CUHK03 evidence the favorable performance of our method against other outstanding methods.