CVDec 29, 2022
A Unified Object Counting Network with Object Occupation PriorShengqin Jiang, Qing Wang, Fengna Cheng et al.
The counting task, which plays a fundamental role in numerous applications (e.g., crowd counting, traffic statistics), aims to predict the number of objects with various densities. Existing object counting tasks are designed for a single object class. However, it is inevitable to encounter newly coming data with new classes in our real world. We name this scenario as \textit{evolving object counting}. In this paper, we build the first evolving object counting dataset and propose a unified object counting network as the first attempt to address this task. The proposed model consists of two key components: a class-agnostic mask module and a class-incremental module. The class-agnostic mask module learns generic object occupation prior via predicting a class-agnostic binary mask (e.g., 1 denotes there exists an object at the considering position in an image and 0 otherwise). The class-incremental module is used to handle new coming classes and provides discriminative class guidance for density map prediction. The combined outputs of class-agnostic mask module and image feature extractor are used to predict the final density map. When new classes come, we first add new neural nodes into the last regression and classification layers of class-incremental module. Then, instead of retraining the model from scratch, we utilize knowledge distillation to help the model remember what have already learned about previous object classes. We also employ a support sample bank to store a small number of typical training samples of each class, which are used to prevent the model from forgetting key information of old data. With this design, our model can efficiently and effectively adapt to new coming classes while keeping good performance on already seen data without large-scale retraining. Extensive experiments on the collected dataset demonstrate the favorable performance.
CVMar 18, 2023
Remote Sensing Object Counting with Online Knowledge LearningShengqin Jiang, Yuan Gao, Bowen Li et al.
Efficient models for remote sensing object counting are urgently required for applications in scenarios with limited computing resources, such as drones or embedded systems. A straightforward yet powerful technique to achieve this is knowledge distillation, which steers the learning of student networks by leveraging the experience of already-trained teacher networks. However, it faces a pair of challenges: Firstly, due to its two-stage training nature, a longer training period is essential, especially as the training samples increase. Secondly, despite the proficiency of teacher networks in transmitting assimilated knowledge, they tend to overlook the latent insights gained during their learning process. To address these challenges, we introduce an online distillation learning method for remote sensing object counting. It builds an end-to-end training framework that seamlessly integrates two distinct networks into a unified one. It comprises a shared shallow module, a teacher branch, and a student branch. The shared module serving as the foundation for both branches is dedicated to learning some primitive information. The teacher branch utilizes prior knowledge to reduce the difficulty of learning and guides the student branch in online learning. In parallel, the student branch achieves parameter reduction and rapid inference capabilities by means of channel reduction. This design empowers the student branch not only to receive privileged insights from the teacher branch but also to tap into the latent reservoir of knowledge held by the teacher branch during the learning process. Moreover, we propose a relation-in-relation distillation method that allows the student branch to effectively comprehend the evolution of the relationship of intra-layer teacher features among different inter-layer features. Extensive experiments demonstrate the effectiveness of our method.
CVJun 1, 2023
Teacher Agent: A Knowledge Distillation-Free Framework for Rehearsal-based Video Incremental LearningShengqin Jiang, Yaoyu Fang, Haokui Zhang et al.
Rehearsal-based video incremental learning often employs knowledge distillation to mitigate catastrophic forgetting of previously learned data. However, this method faces two major challenges for video task: substantial computing resources from loading teacher model and limited replay capability from performance-limited teacher model. To address these problems, we first propose a knowledge distillation-free framework for rehearsal-based video incremental learning called \textit{Teacher Agent}. Instead of loading parameter-heavy teacher networks, we introduce an agent generator that is either parameter-free or uses only a few parameters to obtain accurate and reliable soft labels. This method not only greatly reduces the computing requirement but also circumvents the problem of knowledge misleading caused by inaccurate predictions of the teacher model. Moreover, we put forward a self-correction loss which provides an effective regularization signal for the review of old knowledge, which in turn alleviates the problem of catastrophic forgetting. Further, to ensure that the samples in the memory buffer are memory-efficient and representative, we introduce a unified sampler for rehearsal-based video incremental learning to mine fixed-length key video frames. Interestingly, based on the proposed strategies, the network exhibits a high level of robustness against spatial resolution reduction when compared to the baseline. Extensive experiments demonstrate the advantages of our method, yielding significant performance improvements while utilizing only half the spatial resolution of video clips as network inputs in the incremental phases.
CVFeb 5
Unlocking Prototype Potential: An Efficient Tuning Framework for Few-Shot Class-Incremental LearningShengqin Jiang, Xiaoran Feng, Yuankai Qi et al.
Few-shot class-incremental learning (FSCIL) seeks to continuously learn new classes from very limited samples while preserving previously acquired knowledge. Traditional methods often utilize a frozen pre-trained feature extractor to generate static class prototypes, which suffer from the inherent representation bias of the backbone. While recent prompt-based tuning methods attempt to adapt the backbone via minimal parameter updates, given the constraint of extreme data scarcity, the model's capacity to assimilate novel information and substantively enhance its global discriminative power is inherently limited. In this paper, we propose a novel shift in perspective: freezing the feature extractor while fine-tuning the prototypes. We argue that the primary challenge in FSCIL is not feature acquisition, but rather the optimization of decision regions within a static, high-quality feature space. To this end, we introduce an efficient prototype fine-tuning framework that evolves static centroids into dynamic, learnable components. The framework employs a dual-calibration method consisting of class-specific and task-aware offsets. These components function synergistically to improve the discriminative capacity of prototypes for ongoing incremental classes. Extensive results demonstrate that our method attains superior performance across multiple benchmarks while requiring minimal learnable parameters.
CVFeb 4
SALAD-Pan: Sensor-Agnostic Latent Adaptive Diffusion for Pan-SharpeningJunjie Li, Congyang Ou, Haokui Zhang et al.
Recently, diffusion models bring novel insights for Pan-sharpening and notably boost fusion precision. However, most existing models perform diffusion in the pixel space and train distinct models for different multispectral (MS) imagery, suffering from high latency and sensor-specific limitations. In this paper, we present SALAD-Pan, a sensor-agnostic latent space diffusion method for efficient pansharpening. Specifically, SALAD-Pan trains a band-wise single-channel VAE to encode high-resolution multispectral (HRMS) into compact latent representations, supporting MS images with various channel counts and establishing a basis for acceleration. Then spectral physical properties, along with PAN and MS images, are injected into the diffusion backbone through unidirectional and bidirectional interactive control structures respectively, achieving high-precision fusion in the diffusion process. Finally, a lightweight cross-spectral attention module is added to the central layer of diffusion model, reinforcing spectral connections to boost spectral consistency and further elevate fusion precision. Experimental results on GaoFen-2, QuickBird, and WorldView-3 demonstrate that SALAD-Pan outperforms state-of-the-art diffusion-based methods across all three datasets, attains a 2-3x inference speedup, and exhibits robust zero-shot (cross-sensor) capability.
CVNov 15, 2025
Teaching Prompts to Coordinate: Hierarchical Layer-Grouped Prompt Tuning for Continual LearningShengqin Jiang, Tianqi Kong, Yuankai Qi et al.
Prompt-based continual learning methods fine-tune only a small set of additional learnable parameters while keeping the pre-trained model's parameters frozen. It enables efficient adaptation to new tasks while mitigating the risk of catastrophic forgetting. These methods typically attach one independent task-specific prompt to each layer of pre-trained models to locally modulate its features, ensuring that the layer's representation aligns with the requirements of the new task. However, although introducing learnable prompts independently at each layer provides high flexibility for adapting to new tasks, this overly flexible tuning could make certain layers susceptible to unnecessary updates. As all prompts till the current task are added together as a final prompt for all seen tasks, the model may easily overwrite feature representations essential to previous tasks, which increases the risk of catastrophic forgetting. To address this issue, we propose a novel hierarchical layer-grouped prompt tuning method for continual learning. It improves model stability in two ways: (i) Layers in the same group share roughly the same prompts, which are adjusted by position encoding. This helps preserve the intrinsic feature relationships and propagation pathways of the pre-trained model within each group. (ii) It utilizes a single task-specific root prompt to learn to generate sub-prompts for each layer group. In this way, all sub-prompts are conditioned on the same root prompt, enhancing their synergy and reducing independence. Extensive experiments across four benchmarks demonstrate that our method achieves favorable performance compared with several state-of-the-art methods.
CVApr 18, 2025
ProgRoCC: A Progressive Approach to Rough Crowd CountingShengqin Jiang, Linfei Li, Haokui Zhang et al.
As the number of individuals in a crowd grows, enumeration-based techniques become increasingly infeasible and their estimates increasingly unreliable. We propose instead an estimation-based version of the problem: we label Rough Crowd Counting that delivers better accuracy on the basis of training data that is easier to acquire. Rough crowd counting requires only rough annotations of the number of targets in an image, instead of the more traditional, and far more expensive, per-target annotations. We propose an approach to the rough crowd counting problem based on CLIP, termed ProgRoCC. Specifically, we introduce a progressive estimation learning strategy that determines the object count through a coarse-to-fine approach. This approach delivers answers quickly, outperforms the state-of-the-art in semi- and weakly-supervised crowd counting. In addition, we design a vision-language matching adapter that optimizes key-value pairs by mining effective matches of two modalities to refine the visual features, thereby improving the final performance. Extensive experimental results on three widely adopted crowd counting datasets demonstrate the effectiveness of our method.
CVDec 18, 2018
Mask-aware networks for crowd countingShengqin Jiang, Xiaobo Lu, Yinjie Lei et al.
Crowd counting problem aims to count the number of objects within an image or a frame in the videos and is usually solved by estimating the density map generated from the object location annotations. The values in the density map, by nature, take two possible states: zero indicating no object around, a non-zero value indicating the existence of objects and the value denoting the local object density. In contrast to traditional methods which do not differentiate the density prediction of these two states, we propose to use a dedicated network branch to predict the object/non-object mask and then combine its prediction with the input image to produce the density map. Our rationale is that the mask prediction could be better modeled as a binary segmentation problem and the difficulty of estimating the density could be reduced if the mask is known. A key to the proposed scheme is the strategy of incorporating the mask prediction into the density map estimator. To this end, we study five possible solutions, and via analysis and experimental validation we identify the most effective one. Through extensive experiments on five public datasets, we demonstrate the superior performance of the proposed approach over the baselines and show that our network could achieve the state-of-the-art performance.