CVOct 3, 2023
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic SegmentationJialei Chen, Daisuke Deguchi, Chenkai Zhang et al.
Generalized Zero-shot Semantic Segmentation aims to segment both seen and unseen categories only under the supervision of the seen ones. To tackle this, existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance. However, as the VLMs are designed for classification tasks, directly adapting the VLMs may lead to sub-optimal performance. Consequently, we propose CLIP-ZSS (Zero-shot Semantic Segmentation), a simple but effective training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks in testing without combining with VLMs or inserting new modules. CLIP-ZSS consists of two key modules: Global Learning Module (GLM) and Pixel Learning Module (PLM). GLM is proposed to probe the knowledge from the CLIP visual encoder by pulling the CLS token and the dense features from the image encoder of the same image and pushing others apart. Moreover, to enhance the ability to discriminate unseen categories, PLM consisting of pseudo labels and weight generation is designed. To generate semantically discriminated pseudo labels, a multi-scale K-Means with mask fusion working on the dense tokens is proposed. In pseudo weight generation, a synthesizer generating pseudo semantic features for the unannotated area is introduced. Experiments on three benchmarks show large performance gains compared with SOTA methods.
CVJun 27, 2021Code
SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-FlowHitoshi Nishimura, Satoshi Komorita, Yasutomo Kawanishi et al.
Multiple human tracking is a fundamental problem for scene understanding. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning have focused on accuracy and require substantial running time. This study aims to improve running speed by performing human detection at a certain frame interval because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that complements the detection results with optical flow, based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point selection within human regions and a tracking termination metric calculated by the distribution of the interest points. On the MOT20 dataset in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of the total running speed while maintaining the MOTA metric. Our code is available at https://github.com/hitottiez/sdof-tracker.
CVSep 18, 2019Code
Multiple Human Tracking using Multi-Cues including Primitive Action FeaturesHitoshi Nishimura, Kazuyuki Tasaka, Yasutomo Kawanishi et al.
In this paper, we propose a Multiple Human Tracking method using multi-cues including Primitive Action Features (MHT-PAF). MHT-PAF can perform the accurate human tracking in dynamic aerial videos captured by a drone. PAF employs a global context, rich information by multi-label actions, and a middle level feature. The accurate human tracking result using PAF helps multi-frame-based action recognition. In the experiments, we verified the effectiveness of the proposed method using the Okutama-Action dataset. Our code is available online.
CVMay 8, 2025
Split Matching for Inductive Zero-shot Semantic SegmentationJialei Chen, Xu Zheng, Dongyue Li et al.
Zero-shot Semantic Segmentation (ZSS) aims to segment categories that are not annotated during training. While fine-tuning vision-language models has achieved promising results, these models often overfit to seen categories due to the lack of supervision for unseen classes. As an alternative to fully supervised approaches, query-based segmentation has shown great latent in ZSS, as it enables object localization without relying on explicit labels. However, conventional Hungarian matching, a core component in query-based frameworks, needs full supervision and often misclassifies unseen categories as background in the setting of ZSS. To address this issue, we propose Split Matching (SM), a novel assignment strategy that decouples Hungarian matching into two components: one for seen classes in annotated regions and another for latent classes in unannotated regions (referred to as unseen candidates). Specifically, we partition the queries into seen and candidate groups, enabling each to be optimized independently according to its available supervision. To discover unseen candidates, we cluster CLIP dense features to generate pseudo masks and extract region-level embeddings using CLS tokens. Matching is then conducted separately for the two groups based on both class-level similarity and mask-level consistency. Additionally, we introduce a Multi-scale Feature Enhancement (MFE) module that refines decoder features through residual multi-scale aggregation, improving the model's ability to capture spatial details across resolutions. SM is the first to introduce decoupled Hungarian matching under the inductive ZSS setting, and achieves state-of-the-art performance on two standard benchmarks.
CVFeb 21, 2024
Generalizable Semantic Vision Query Generation for Zero-shot Panoptic and Semantic SegmentationJialei Chen, Daisuke Deguchi, Chenkai Zhang et al.
Zero-shot Panoptic Segmentation (ZPS) aims to recognize foreground instances and background stuff without images containing unseen categories in training. Due to the visual data sparsity and the difficulty of generalizing from seen to unseen categories, this task remains challenging. To better generalize to unseen classes, we propose Conditional tOken aligNment and Cycle trAnsiTion (CONCAT), to produce generalizable semantic vision queries. First, a feature extractor is trained by CON to link the vision and semantics for providing target queries. Formally, CON is proposed to align the semantic queries with the CLIP visual CLS token extracted from complete and masked images. To address the lack of unseen categories, a generator is required. However, one of the gaps in synthesizing pseudo vision queries, ie, vision queries for unseen categories, is describing fine-grained visual details through semantic embeddings. Therefore, we approach CAT to train the generator in semantic-vision and vision-semantic manners. In semantic-vision, visual query contrast is proposed to model the high granularity of vision by pulling the pseudo vision queries with the corresponding targets containing segments while pushing those without segments away. To ensure the generated queries retain semantic information, in vision-semantic, the pseudo vision queries are mapped back to semantic and supervised by real semantic embeddings. Experiments on ZPS achieve a 5.2% hPQ increase surpassing SOTA. We also examine inductive ZPS and open-vocabulary semantic segmentation and obtain comparative results while being 2 times faster in testing.
CVJun 27, 2025
Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic SegmentationJialei Chen, Xu Zheng, Danda Pani Paudel et al.
Zero-shot Semantic Segmentation (ZSS) aims to segment both seen and unseen classes using supervision from only seen classes. Beyond adaptation-based methods, distillation-based approaches transfer vision-language alignment of vision-language model, e.g., CLIP, to segmentation models. However, such knowledge transfer remains challenging due to: (1) the difficulty of aligning vision-based features with the textual space, which requires combining spatial precision with vision-language alignment; and (2) the semantic gap between CLIP's global representations and the local, fine-grained features of segmentation models. To address challenge (1), we propose Chimera-Seg, which integrates a segmentation backbone as the body and a CLIP-based semantic head as the head, like the Chimera in Greek mythology, combining spatial precision with vision-language alignment. Specifically, Chimera-Seg comprises a trainable segmentation model and a CLIP Semantic Head (CSH), which maps dense features into the CLIP-aligned space. The CSH incorporates a frozen subnetwork and fixed projection layers from the CLIP visual encoder, along with lightweight trainable components. The partial module from CLIP visual encoder, paired with the segmentation model, retains segmentation capability while easing the mapping to CLIP's semantic space. To address challenge (2), we propose Selective Global Distillation (SGD), which distills knowledge from dense features exhibiting high similarity to the CLIP CLS token, while gradually reducing the number of features used for alignment as training progresses. Besides, we also use a Semantic Alignment Module (SAM) to further align dense visual features with semantic embeddings extracted from the frozen CLIP text encoder. Experiments on two benchmarks show improvements of 0.9% and 1.2% in hIoU.
CVJun 4, 2025
BiXFormer: A Robust Framework for Maximizing Modality Effectiveness in Multi-Modal Semantic SegmentationJialei Chen, Xu Zheng, Danda Pani Paudel et al.
Utilizing multi-modal data enhances scene understanding by providing complementary semantic and geometric information. Existing methods fuse features or distill knowledge from multiple modalities into a unified representation, improving robustness but restricting each modality's ability to fully leverage its strengths in different situations. We reformulate multi-modal semantic segmentation as a mask-level classification task and propose BiXFormer, which integrates Unified Modality Matching (UMM) and Cross Modality Alignment (CMA) to maximize modality effectiveness and handle missing modalities. Specifically, BiXFormer first categorizes multi-modal inputs into RGB and X, where X represents any non-RGB modalities, e.g., depth, allowing separate processing for each. This design leverages the well-established pretraining for RGB, while addressing the relative lack of attention to X modalities. Then, we propose UMM, which includes Modality Agnostic Matching (MAM) and Complementary Matching (CM). MAM assigns labels to features from all modalities without considering modality differences, leveraging each modality's strengths. CM then reassigns unmatched labels to remaining unassigned features within their respective modalities, ensuring that each available modality contributes to the final prediction and mitigating the impact of missing modalities. Moreover, to further facilitate UMM, we introduce CMA, which enhances the weaker queries assigned in CM by aligning them with optimally matched queries from MAM. Experiments on both synthetic and real-world multi-modal benchmarks demonstrate the effectiveness of our method, achieving significant improvements in mIoU of +2.75% and +22.74% over the prior arts.
CVMar 11, 2025
CQVPR: Landmark-aware Contextual Queries for Visual Place RecognitionDongyue Li, Daisuke Deguchi, Hiroshi Murase
Visual Place Recognition (VPR) aims to estimate the location of the given query image within a database of geo-tagged images. To identify the exact location in an image, detecting landmarks is crucial. However, in some scenarios, such as urban environments, there are numerous landmarks, such as various modern buildings, and the landmarks in different cities often exhibit high visual similarity. Therefore, it is essential not only to leverage the landmarks but also to consider the contextual information surrounding them, such as whether there are trees, roads, or other features around the landmarks. We propose the Contextual Query VPR (CQVPR), which integrates contextual information with detailed pixel-level visual features. By leveraging a set of learnable contextual queries, our method automatically learns the high-level contexts with respect to landmarks and their surrounding areas. Heatmaps depicting regions that each query attends to serve as context-aware features, offering cues that could enhance the understanding of each scene. We further propose a query matching loss to supervise the extraction process of contextual queries. Extensive experiments on several datasets demonstrate that the proposed method outperforms other state-of-the-art methods, especially in challenging scenarios.
CVMay 9, 2021
Interaction Detection Between Vehicles and Vulnerable Road Users: A Deep Generative Approach with AttentionHao Cheng, Li Feng, Hailong Liu et al.
Intersections where vehicles are permitted to turn and interact with vulnerable road users (VRUs) like pedestrians and cyclists are among some of the most challenging locations for automated and accurate recognition of road users' behavior. In this paper, we propose a deep conditional generative model for interaction detection at such locations. It aims to automatically analyze massive video data about the continuity of road users' behavior. This task is essential for many intelligent transportation systems such as traffic safety control and self-driving cars that depend on the understanding of road users' locomotion. A Conditional Variational Auto-Encoder based model with Gaussian latent variables is trained to encode road users' behavior and perform probabilistic and diverse predictions of interactions. The model takes as input the information of road users' type, position and motion automatically extracted by a deep learning object detector and optical flow from videos, and generates frame-wise probabilities that represent the dynamics of interactions between a turning vehicle and any VRUs involved. The model's efficacy was validated by testing on real--world datasets acquired from two different intersections. It achieved an F1-score above 0.96 at a right--turn intersection in Germany and 0.89 at a left--turn intersection in Japan, both with very busy traffic flows.
HCMar 2, 2020
What Timing for an Automated Vehicle to Make Pedestrians Understand Its Driving Intentions for Improving Their Perception of Safety?Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales et al.
Although automated driving systems have been used frequently, they are still unpopular in society. To increase the popularity of automated vehicles (AVs), assisting pedestrians to accurately understand the driving intentions and improving their perception of safety when interacting with AVs are considered effective. Therefore, the AV should send information about its driving intention to pedestrians when they interact with each other. However, the following questions should be answered regarding how the AV sends the information to them: 1) What timing for an AV to make pedestrians understand its driving intentions after being noticed by them? 2) What timing for an AV to make pedestrians feel safe after being noticed by them? Thirteen participants were invited to interact with a manually driven vehicle and an AV in an experiment. The participants' gaze information and a subjective evaluation of their understanding of the driving intention as well as their perception of safety were collected. By analyzing the participants' gaze duration on the vehicle with their subjective evaluations, we found that the AV should enable the pedestrian to accurately understand its driving intention within 0.5~6.5 [s] and make the pedestrian feel safe within 0.5~8.0 [s] while the pedestrian is gazing at it.
HCJan 6, 2020
What Is the Gaze Behavior of Pedestrians in Interactions with an Automated Vehicle When They Do Not Understand Its Intentions?Hailong Liu, Takatsugu Hirayama, Luis Yoichi Morales et al.
Interactions between pedestrians and automated vehicles (AVs) will increase significantly with the popularity of AV. However, pedestrians often have not enough trust on the AVs , particularly when they are confused about an AV's intention in a interaction. This study seeks to evaluate if pedestrians clearly understand the driving intentions of AVs in interactions and presents experimental research on the relationship between gaze behaviors of pedestrians and their understanding of the intentions of the AV. The hypothesis investigated in this study was that the less the pedestrian understands the driving intentions of the AV, the longer the duration of their gazing behavior will be. A pedestrian--vehicle interaction experiment was designed to verify the proposed hypothesis. A robotic wheelchair was used as the manual driving vehicle (MV) and AV for interacting with pedestrians while pedestrians' gaze data and their subjective evaluation of the driving intentions were recorded. The experimental results supported our hypothesis as there was a negative correlation between the pedestrians' gaze duration on the AV and their understanding of the driving intentions of the AV. Moreover, the gaze duration of most of the pedestrians on the MV was shorter than that on an AV. Therefore, we conclude with two recommendations to designers of external human-machine interfaces (eHMI): (1) when a pedestrian is engaged in an interaction with an AV, the driving intentions of the AV should be provided; (2) if the pedestrian still gazes at the AV after the AV displays its driving intentions, the AV should provide clearer information about its driving intentions.