Can Peng

h-index25

4papers

153citations

Novelty38%

AI Score36

Ranked #99,426 of 194,257 authors (top 51%)#33,413 in CV (top 56%)

4 Papers

27.7CVJul 30, 2022Code

Few-Shot Class-Incremental Learning from an Open-Set Perspective

Can Peng, Kun Zhao, Tianren Wang et al.

The continual appearance of new objects in the visual world poses considerable challenges for current deep learning methods in real-world deployments. The challenge of new task learning is often exacerbated by the scarcity of data for the new categories due to rarity or cost. Here we explore the important task of Few-Shot Class-Incremental Learning (FSCIL) and its extreme data scarcity condition of one-shot. An ideal FSCIL model needs to perform well on all classes, regardless of their presentation order or paucity of data. It also needs to be robust to open-set real-world conditions and be easily adapted to the new tasks that always arise in the field. In this paper, we first reevaluate the current task setting and propose a more comprehensive and practical setting for the FSCIL task. Then, inspired by the similarity of the goals for FSCIL and modern face recognition systems, we propose our method -- Augmented Angular Loss Incremental Classification or ALICE. In ALICE, instead of the commonly used cross-entropy loss, we propose to use the angular penalty loss to obtain well-clustered features. As the obtained features not only need to be compactly clustered but also diverse enough to maintain generalization for future incremental classes, we further discuss how class augmentation, data augmentation, and data balancing affect classification performance. Experiments on benchmark datasets, including CIFAR100, miniImageNet, and CUB200, demonstrate the improved performance of ALICE over the state-of-the-art FSCIL methods.

4.8IVDec 20, 2022

Unified Framework for Histopathology Image Augmentation and Classification via Generative Models

Meng Li, Chaoyi Li, Can Peng et al.

Deep learning techniques have become widely utilized in histopathology image classification due to their superior performance. However, this success heavily relies on the availability of substantial labeled data, which necessitates extensive and costly manual annotation by domain experts. To address this challenge, researchers have recently employed generative models to synthesize data for augmentation, thereby enhancing classification model performance. Traditionally, this involves generating synthetic data first and then training the classification model with both synthetic and real data, which creates a two-stage, time-consuming workflow. To overcome this limitation, we propose an innovative unified framework that integrates the data generation and model training stages into a unified process. Our approach utilizes a pure Vision Transformer (ViT)-based conditional Generative Adversarial Network (cGAN) model to simultaneously handle both image synthesis and classification. An additional classification head is incorporated into the cGAN model to enable simultaneous classification of histopathology images. To improve training stability and enhance the quality of generated data, we introduce a conditional class projection technique that helps maintain class separation during the generation process. We also employ a dynamic multi-loss weighting mechanism to effectively balance the losses of the classification tasks. Furthermore, our selective augmentation mechanism actively selects the most suitable generated images for data augmentation to further improve performance. Extensive experiments on histopathology datasets show that our unified synthetic augmentation framework consistently enhances the performance of histopathology image classification models.

8.4CVJun 1, 2025Code

AuralSAM2: Enabling SAM2 Hear Through Pyramid Audio-Visual Feature Prompting

Yuyuan Liu, Yuanhong Chen, Chong Wang et al.

Segment Anything Model 2 (SAM2) exhibits strong generalisation for promptable segmentation in video clips; however, its integration with the audio modality remains underexplored. Existing approaches mainly follow two directions: (1) injecting adapters into the image encoder to receive audio signals, which incurs efficiency costs during prompt engineering, and (2) leveraging additional foundation models to generate visual prompts for the sounding objects, which are often imprecisely localised, leading to misguidance in SAM2. Moreover, these methods overlook the rich semantic interplay between hierarchical visual features and other modalities, resulting in suboptimal cross-modal fusion. In this work, we propose AuralSAM2, comprising the novel AuralFuser module, which externally attaches to SAM2 to integrate features from different modalities and generate feature-level prompts, guiding SAM2's decoder in segmenting sounding targets. Such integration is facilitated by a feature pyramid, further refining semantic understanding and enhancing object awareness in multimodal scenarios. Additionally, the audio-guided contrastive learning is introduced to explicitly align audio and visual representations and to also mitigate biases caused by dominant visual patterns. Results on public benchmarks show that our approach achieves remarkable improvements over the previous methods in the field. Code is available at https://github.com/yyliu01/AuralSAM2.

3.6CVFeb 27, 2025Code

WalnutData: A UAV Remote Sensing Dataset of Green Walnuts and Model Evaluation

Mingjie Wu, Chenggui Yang, Huihua Wang et al.

The UAV technology is gradually maturing and can provide extremely powerful support for smart agriculture and precise monitoring. Currently, there is no dataset related to green walnuts in the field of agricultural computer vision. Thus, in order to promote the algorithm design in the field of agricultural computer vision, we used UAV to collect remote-sensing data from 8 walnut sample plots. Considering that green walnuts are subject to various lighting conditions and occlusion, we constructed a large-scale dataset with a higher-granularity of target features - WalnutData. This dataset contains a total of 30,240 images and 706,208 instances, and there are 4 target categories: being illuminated by frontal light and unoccluded (A1), being backlit and unoccluded (A2), being illuminated by frontal light and occluded (B1), and being backlit and occluded (B2). Subsequently, we evaluated many mainstream algorithms on WalnutData and used these evaluation results as the baseline standard. The dataset and all evaluation results can be obtained at https://github.com/1wuming/WalnutData.