Kunyu Peng

CV
h-index61
88papers
1,661citations
Novelty53%
AI Score64

88 Papers

CVMar 2, 2022Code
Bending Reality: Distortion-aware Transformers for Adapting to Panoramic Semantic Segmentation

Jiaming Zhang, Kailun Yang, Chaoxiang Ma et al.

Panoramic images with their 360-degree directional view encompass exhaustive information about the surrounding space, providing a rich foundation for scene understanding. To unfold this potential in the form of robust panoramic segmentation models, large quantities of expensive, pixel-wise annotations are crucial for success. Such annotations are available, but predominantly for narrow-angle, pinhole-camera images which, off the shelf, serve as sub-optimal resources for training panoramic models. Distortions and the distinct image-feature distribution in 360-degree panoramas impede the transfer from the annotation-rich pinhole domain and therefore come with a big dent in performance. To get around this domain difference and bring together semantic annotations from pinhole- and 360-degree surround-visuals, we propose to learn object deformations and panoramic image distortions in the Deformable Patch Embedding (DPE) and Deformable MLP (DMLP) components which blend into our Transformer for PAnoramic Semantic Segmentation (Trans4PASS) model. Finally, we tie together shared semantics in pinhole- and panoramic feature embeddings by generating multi-scale prototype features and aligning them in our Mutual Prototypical Adaptation (MPA) for unsupervised domain adaptation. On the indoor Stanford2D3D dataset, our Trans4PASS with MPA maintains comparable performance to fully-supervised state-of-the-arts, cutting the need for over 1,400 labeled panoramas. On the outdoor DensePASS dataset, we break state-of-the-art by 14.39% mIoU and set the new bar at 56.38%. Code will be made publicly available at https://github.com/jamycheung/Trans4PASS.

CVJul 25, 2022Code
Behind Every Domain There is a Shift: Adapting Distortion-aware Vision Transformers for Panoramic Semantic Segmentation

Jiaming Zhang, Kailun Yang, Hao Shi et al.

In this paper, we address panoramic semantic segmentation which is under-explored due to two critical challenges: (1) image distortions and object deformations on panoramas; (2) lack of semantic annotations in the 360° imagery. To tackle these problems, first, we propose the upgraded Transformer for Panoramic Semantic Segmentation, i.e., Trans4PASS+, equipped with Deformable Patch Embedding (DPE) and Deformable MLP (DMLPv2) modules for handling object deformations and image distortions whenever (before or after adaptation) and wherever (shallow or deep levels). Second, we enhance the Mutual Prototypical Adaptation (MPA) strategy via pseudo-label rectification for unsupervised domain adaptive panoramic segmentation. Third, aside from Pinhole-to-Panoramic (Pin2Pan) adaptation, we create a new dataset (SynPASS) with 9,080 panoramic images, facilitating Synthetic-to-Real (Syn2Real) adaptation scheme in 360° imagery. Extensive experiments are conducted, which cover indoor and outdoor scenarios, and each of them is investigated with Pin2Pan and Syn2Real regimens. Trans4PASS+ achieves state-of-the-art performances on four domain adaptive panoramic semantic segmentation benchmarks. Code is available at https://github.com/jamycheung/Trans4PASS.

CVJul 15, 2023Code
Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Ruiping Liu, Jiaming Zhang, Kunyu Peng et al.

Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus on the application of GSR in assisting people with visual impairments (PVI). However, precise localization information of detected objects is often required to navigate their surroundings confidently and make informed decisions. For the first time, we propose an Open Scene Understanding (OpenSU) system that aims to generate pixel-wise dense segmentation masks of involved entities instead of bounding boxes. Specifically, we build our OpenSU system on top of GSR by additionally adopting an efficient Segment Anything Model (SAM). Furthermore, to enhance the feature extraction and interaction between the encoder-decoder structure, we construct our OpenSU system using a solid pure transformer backbone to improve the performance of GSR. In order to accelerate the convergence, we replace all the activation functions within the GSR decoders with GELU, thereby reducing the training duration. In quantitative analysis, our model achieves state-of-the-art performance on the SWiG dataset. Moreover, through field testing on dedicated assistive technology datasets and application demonstrations, the proposed OpenSU system can be used to enhance scene understanding and facilitate the independent mobility of people with visual impairments. Our code will be available at https://github.com/RuipingL/OpenSU.

CVMar 2, 2022Code
TransDARC: Transformer-based Driver Activity Recognition with Latent Space Feature Calibration

Kunyu Peng, Alina Roitberg, Kailun Yang et al.

Traditional video-based human activity recognition has experienced remarkable progress linked to the rise of deep learning, but this effect was slower as it comes to the downstream task of driver behavior understanding. Understanding the situation inside the vehicle cabin is essential for Advanced Driving Assistant System (ADAS) as it enables identifying distraction, predicting driver's intent and leads to more convenient human-vehicle interaction. At the same time, driver observation systems face substantial obstacles as they need to capture different granularities of driver states, while the complexity of such secondary activities grows with the rising automation and increased driver freedom. Furthermore, a model is rarely deployed under conditions identical to the ones in the training set, as sensor placements and types vary from vehicle to vehicle, constituting a substantial obstacle for real-life deployment of data-driven models. In this work, we present a novel vision-based framework for recognizing secondary driver behaviours based on visual transformers and an additional augmented feature distribution calibration module. This module operates in the latent feature-space enriching and diversifying the training set at feature-level in order to improve generalization to novel data appearances, (e.g., sensor changes) and general feature quality. Our framework consistently leads to better recognition rates, surpassing previous state-of-the-art results of the public Drive&Act benchmark on all granularity levels. Our code is publicly available at https://github.com/KPeng9510/TransDARC.

CVSep 21, 2023Code
Elevating Skeleton-Based Action Recognition with Efficient Multi-Modality Self-Supervision

Yiping Wei, Kunyu Peng, Alina Roitberg et al.

Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at https://github.com/desehuileng0o0/IKEM.

CVMar 24, 2023Code
FishDreamer: Towards Fisheye Semantic Completion via Unified Image Outpainting and Segmentation

Hao Shi, Yu Li, Kailun Yang et al.

This paper raises the new task of Fisheye Semantic Completion (FSC), where dense texture, structure, and semantics of a fisheye image are inferred even beyond the sensor field-of-view (FoV). Fisheye cameras have larger FoV than ordinary pinhole cameras, yet its unique special imaging model naturally leads to a blind area at the edge of the image plane. This is suboptimal for safety-critical applications since important perception tasks, such as semantic segmentation, become very challenging within the blind zone. Previous works considered the out-FoV outpainting and in-FoV segmentation separately. However, we observe that these two tasks are actually closely coupled. To jointly estimate the tightly intertwined complete fisheye image and scene semantics, we introduce the new FishDreamer which relies on successful ViTs enhanced with a novel Polar-aware Cross Attention module (PCA) to leverage dense context and guide semantically-consistent content generation while considering different polar distributions. In addition to the contribution of the novel task and architecture, we also derive Cityscapes-BF and KITTI360-BF datasets to facilitate training and evaluation of this new track. Our experiments demonstrate that the proposed FishDreamer outperforms methods solving each task in isolation and surpasses alternative approaches on the Fisheye Semantic Completion. Code and datasets are publicly available at https://github.com/MasterHow/FishDreamer.

CVMar 19, 2022Code
Towards Robust Semantic Segmentation of Accident Scenes via Multi-Source Mixed Sampling and Meta-Learning

Xinyu Luo, Jiaming Zhang, Kailun Yang et al.

Autonomous vehicles utilize urban scene segmentation to understand the real world like a human and react accordingly. Semantic segmentation of normal scenes has experienced a remarkable rise in accuracy on conventional benchmarks. However, a significant portion of real-life accidents features abnormal scenes, such as those with object deformations, overturns, and unexpected traffic behaviors. Since even small mis-segmentation of driving scenes can lead to serious threats to human lives, the robustness of such models in accident scenarios is an extremely important factor in ensuring safety of intelligent transportation systems. In this paper, we propose a Multi-source Meta-learning Unsupervised Domain Adaptation (MMUDA) framework, to improve the generalization of segmentation transformers to extreme accident scenes. In MMUDA, we make use of Multi-Domain Mixed Sampling to augment the images of multiple-source domains (normal scenes) with the target data appearances (abnormal scenes). To train our model, we intertwine and study a meta-learning strategy in the multi-source setting for robustifying the segmentation results. We further enhance the segmentation backbone (SegFormer) with a HybridASPP decoder design, featuring large window attention spatial pyramid pooling and strip pooling, to efficiently aggregate long-range contextual dependencies. Our approach achieves a mIoU score of 46.97% on the DADA-seg benchmark, surpassing the previous state-of-the-art model by more than 7.50%. Code will be made publicly available at https://github.com/xinyu-laura/MMUDA.

CVJul 13, 2022Code
Trans4Map: Revisiting Holistic Bird's-Eye-View Mapping from Egocentric Images to Allocentric Semantics with Vision Transformers

Chang Chen, Jiaming Zhang, Kailun Yang et al.

Humans have an innate ability to sense their surroundings, as they can extract the spatial representation from the egocentric perception and form an allocentric semantic map via spatial transformation and memory updating. However, endowing mobile agents with such a spatial sensing ability is still a challenge, due to two difficulties: (1) the previous convolutional models are limited by the local receptive field, thus, struggling to capture holistic long-range dependencies during observation; (2) the excessive computational budgets required for success, often lead to a separation of the mapping pipeline into stages, resulting the entire mapping process inefficient. To address these issues, we propose an end-to-end one-stage Transformer-based framework for Mapping, termed Trans4Map. Our egocentric-to-allocentric mapping process includes three steps: (1) the efficient transformer extracts the contextual features from a batch of egocentric images; (2) the proposed Bidirectional Allocentric Memory (BAM) module projects egocentric features into the allocentric memory; (3) the map decoder parses the accumulated memory and predicts the top-down semantic segmentation map. In contrast, Trans4Map achieves state-of-the-art results, reducing 67.2% parameters, yet gaining a +3.25% mIoU and a +4.09% mBF1 improvements on the Matterport3D dataset. Code at: https://github.com/jamycheung/Trans4Map.

CVAug 23, 2023Code
Towards Privacy-Supporting Fall Detection via Deep Unsupervised RGB2Depth Adaptation

Hejun Xiao, Kunyu Peng, Xiangsheng Huang et al.

Fall detection is a vital task in health monitoring, as it allows the system to trigger an alert and therefore enabling faster interventions when a person experiences a fall. Although most previous approaches rely on standard RGB video data, such detailed appearance-aware monitoring poses significant privacy concerns. Depth sensors, on the other hand, are better at preserving privacy as they merely capture the distance of objects from the sensor or camera, omitting color and texture information. In this paper, we introduce a privacy-supporting solution that makes the RGB-trained model applicable in depth domain and utilizes depth data at test time for fall detection. To achieve cross-modal fall detection, we present an unsupervised RGB to Depth (RGB2Depth) cross-modal domain adaptation approach that leverages labelled RGB data and unlabelled depth data during training. Our proposed pipeline incorporates an intermediate domain module for feature bridging, modality adversarial loss for modality discrimination, classification loss for pseudo-labeled depth data and labeled source data, triplet loss that considers both source and target domains, and a novel adaptive loss weight adjustment method for improved coordination among various losses. Our approach achieves state-of-the-art results in the unsupervised RGB2Depth domain adaptation task for fall detection. Code is available at https://github.com/1015206533/privacy_supporting_fall_detection.

CVJul 28, 2023Code
OAFuser: Towards Omni-Aperture Fusion for Light Field Semantic Segmentation

Fei Teng, Jiaming Zhang, Kunyu Peng et al.

Light field cameras are capable of capturing intricate angular and spatial details. This allows for acquiring complex light patterns and details from multiple angles, significantly enhancing the precision of image semantic segmentation. However, two significant issues arise: (1) The extensive angular information of light field cameras contains a large amount of redundant data, which is overwhelming for the limited hardware resources of intelligent agents. (2) A relative displacement difference exists in the data collected by different micro-lenses. To address these issues, we propose an Omni-Aperture Fusion model (OAFuser) that leverages dense context from the central view and extracts the angular information from sub-aperture images to generate semantically consistent results. To simultaneously streamline the redundant information from the light field cameras and avoid feature loss during network propagation, we present a simple yet very effective Sub-Aperture Fusion Module (SAFM). This module efficiently embeds sub-aperture images in angular features, allowing the network to process each sub-aperture image with a minimal computational demand of only (around 1GFlops). Furthermore, to address the mismatched spatial information across viewpoints, we present a Center Angular Rectification Module (CARM) to realize feature resorting and prevent feature occlusion caused by misalignment. The proposed OAFuser achieves state-of-the-art performance on four UrbanLF datasets in terms of all evaluation metrics and sets a new record of 84.93% in mIoU on the UrbanLF-Real Extended dataset, with a gain of +3.69%. The source code for OAFuser is available at https://github.com/FeiBryantkit/OAFuser.

LGSep 26, 2024Code
Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler

Kunyu Peng, Di Wen, Kailun Yang et al.

In Open-Set Domain Generalization (OSDG), the model is exposed to both new variations of data appearance (domains) and open-set conditions, where both known and novel categories are present at test time. The challenges of this task arise from the dual need to generalize across diverse domains and accurately quantify category novelty, which is critical for applications in dynamic environments. Recently, meta-learning techniques have demonstrated superior results in OSDG, effectively orchestrating the meta-train and -test tasks by employing varied random categories and predefined domain partition strategies. These approaches prioritize a well-designed training schedule over traditional methods that focus primarily on data augmentation and the enhancement of discriminative feature learning. The prevailing meta-learning models in OSDG typically utilize a predefined sequential domain scheduler to structure data partitions. However, a crucial aspect that remains inadequately explored is the influence brought by strategies of domain schedulers during training. In this paper, we observe that an adaptive domain scheduler benefits more in OSDG compared with prefixed sequential and random domain schedulers. We propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve an adaptive domain scheduler. This method strategically sequences domains by assessing their reliabilities in utilizing a follower network, trained with confidence scores learned in an evidential manner, regularized by max rebiasing discrepancy, and optimized in a bi-level manner. The results show that our method substantially improves OSDG performance and achieves more discriminative embeddings for both the seen and unseen categories. The source code is publicly available at https://github.com/KPeng9510/EBiL-HaDS.

CVJul 2, 2024Code
Referring Atomic Video Action Recognition

Kunyu Peng, Jia Fu, Kailun Yang et al.

We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR.

CVMar 2, 2023
Delivering Arbitrary-Modal Semantic Segmentation

Jiaming Zhang, Ruiping Liu, Hao Shi et al.

Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather conditions as well as five sensor failure cases to exploit modal complementarity and resolve partial outages. To make this possible, we present the arbitrary cross-modal segmentation model CMNeXt. It encompasses a Self-Query Hub (SQ-Hub) designed to extract effective information from any modality for subsequent fusion with the RGB representation and adds only negligible amounts of parameters (~0.01M) per additional modality. On top, to efficiently and flexibly harvest discriminative cues from the auxiliary modalities, we introduce the simple Parallel Pooling Mixer (PPX). With extensive experiments on a total of six benchmarks, our CMNeXt achieves state-of-the-art performance on the DeLiVER, KITTI-360, MFNet, NYU Depth V2, UrbanLF, and MCubeS datasets, allowing to scale from 1 to 81 modalities. On the freshly collected DeLiVER, the quad-modal CMNeXt reaches up to 66.30% in mIoU with a +9.10% gain as compared to the mono-modal baseline. The DeLiVER dataset and our code are at: https://jamycheung.github.io/DELIVER.html.

CVMar 17, 2022
MatchFormer: Interleaving Attention in Transformers for Feature Matching

Qing Wang, Jiaming Zhang, Kailun Yang et al.

Local feature matching is a computationally intensive task at the subpixel level. While detector-based methods coupled with feature descriptors struggle in low-texture scenes, CNN-based methods with a sequential extract-to-match pipeline, fail to make use of the matching capacity of the encoder and tend to overburden the decoder for matching. In contrast, we propose a novel hierarchical extract-and-match transformer, termed as MatchFormer. Inside each stage of the hierarchical encoder, we interleave self-attention for feature extraction and cross-attention for feature matching, yielding a human-intuitive extract-and-match scheme. Such a match-aware encoder releases the overloaded decoder and makes the model highly efficient. Further, combining self- and cross-attention on multi-scale features in a hierarchical architecture improves matching robustness, particularly in low-texture indoor scenes or with less outdoor training data. Thanks to such a strategy, MatchFormer is a multi-win solution in efficiency, robustness, and precision. Compared to the previous best method in indoor pose estimation, our lite MatchFormer has only 45% GFLOPs, yet achieves a +1.3% precision gain and a 41% running speed boost. The large MatchFormer reaches state-of-the-art on four different benchmarks, including indoor pose estimation (ScanNet), outdoor pose estimation (MegaDepth), homography estimation and image matching (HPatch), and visual localization (InLoc).

CVJul 2, 2024Code
Occlusion-Aware Seamless Segmentation

Yihong Cao, Jiaming Zhang, Hao Shi et al.

Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can deepen the understanding of the scene, and domain adaptation can transfer across viewing domains. In this work, we introduce a novel task, Occlusion-Aware Seamless Segmentation (OASS), which simultaneously tackles all these three challenges. For benchmarking OASS, we establish a new human-annotated dataset for Blending Panoramic Amodal Seamless Segmentation, i.e., BlendPASS. Besides, we propose the first solution UnmaskFormer, aiming at unmasking the narrow FoV, occlusions, and domain gaps all at once. Specifically, UnmaskFormer includes the crucial designs of Unmasking Attention (UA) and Amodal-oriented Mix (AoMix). Our method achieves state-of-the-art performance on the BlendPASS dataset, reaching a remarkable mAPQ of 26.58% and mIoU of 43.66%. On public panoramic semantic segmentation datasets, i.e., SynPASS and DensePASS, our method outperforms previous methods and obtains 45.34% and 48.08% in mIoU, respectively. The fresh BlendPASS dataset and our source code are available at https://github.com/yihong-97/OASS.

CVApr 12Code
IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly

Di Wen, Zeyun Zhong, David Schneider et al.

We introduce IMPACT, a synchronized five-view RGB-D dataset for deployment-oriented industrial procedural understanding, built around real assembly and disassembly of a commercial angle grinder with professional-grade tools. To our knowledge, IMPACT is the first real industrial assembly benchmark that jointly provides synchronized ego-exo RGB-D capture, decoupled bimanual annotation, compliance-aware state tracking, and explicit anomaly--recovery supervision within a single real industrial workflow. It comprises 112 trials from 13 participants totaling 39.5 hours, with multi-route execution governed by a partial-order prerequisite graph, a six-category anomaly taxonomy, and operator cognitive load measured via NASA-TLX. The annotation hierarchy links hand-specific atomic actions to coarse procedural steps, component assembly states, and per-hand compliance phases, with synchronized null spans across views to decouple perceptual limitations from algorithmic failure. Systematic baselines reveal fundamental limitations that remain invisible to single-task benchmarks, particularly under realistic deployment conditions that involve incomplete observations, flexible execution paths, and corrective behavior. The full dataset, annotations, and evaluation code are available at https://github.com/Kratos-Wen/IMPACT.

LGApr 11Code
Towards Multi-Source Domain Generalization for Sleep Staging with Noisy Labels

Kening Wang, Di Wen, Yufan Chen et al.

Automatic sleep staging is a multimodal learning problem involving heterogeneous physiological signals such as EEG and EOG, which often suffer from domain shifts across institutions, devices, and populations. In practice, these data are also affected by noisy annotations, yet label-noise-robust multi-source domain generalization remains underexplored. We present the first benchmark for Noisy Labels in Multi-Source Domain-Generalized Sleep Staging (NL-DGSS) and show that existing noisy-label learning methods degrade substantially when domain shifts and label noise coexist. To address this challenge, we propose FF-TRUST, a domain-invariant multimodal sleep staging framework with Joint Time-Frequency Early Learning Regularization (JTF-ELR). By jointly exploiting temporal and spectral consistency together with confidence-diversity regularization, FF-TRUST improves robustness under noisy supervision. Experiments on five public datasets demonstrate consistent state-of-the-art performance under diverse symmetric and asymmetric noise settings. The benchmark and code will be made publicly available at https://github.com/KNWang970918/FF-TRUST.git.

CVJul 13, 2022
Multi-modal Depression Estimation based on Sub-attentional Fusion

Ping-Cheng Wei, Kunyu Peng, Alina Roitberg et al.

Failure to timely diagnose and effectively treat depression leads to over 280 million people suffering from this psychological disorder worldwide. The information cues of depression can be harvested from diverse heterogeneous resources, e.g., audio, visual, and textual data, raising demand for new effective multi-modal fusion approaches for automatic estimation. In this work, we tackle the task of automatically identifying depression from multi-modal data and introduce a sub-attention mechanism for linking heterogeneous information while leveraging Convolutional Bidirectional LSTM as our backbone. To validate this idea, we conduct extensive experiments on the public DAIC-WOZ benchmark for depression assessment featuring different evaluation modes and taking gender-specific biases into account. The proposed model yields effective results with 0.89 precision and 0.70 F1-score in detecting major depression and 4.92 MAE in estimating the severity. Our attention-based fusion module consistently outperforms conventional late fusion approaches and achieves competitive performance compared to the previously published depression estimation frameworks, while learning to diagnose the disorder end-to-end and relying on far fewer preprocessing steps.

CVSep 21, 2023Code
Exploring Self-supervised Skeleton-based Action Recognition in Occluded Environments

Yifei Chen, Kunyu Peng, Alina Roitberg et al.

To integrate action recognition into autonomous robotic systems, it is essential to address challenges such as person occlusions-a common yet often overlooked scenario in existing self-supervised skeleton-based action recognition methods. In this work, we propose IosPSTL, a simple and effective self-supervised learning framework designed to handle occlusions. IosPSTL combines a cluster-agnostic KNN imputer with an Occluded Partial Spatio-Temporal Learning (OPSTL) strategy. First, we pre-train the model on occluded skeleton sequences. Then, we introduce a cluster-agnostic KNN imputer that performs semantic grouping using k-means clustering on sequence embeddings. It imputes missing skeleton data by applying K-Nearest Neighbors in the latent space, leveraging nearby sample representations to restore occluded joints. This imputation generates more complete skeleton sequences, which significantly benefits downstream self-supervised models. To further enhance learning, the OPSTL module incorporates Adaptive Spatial Masking (ASM) to make better use of intact, high-quality skeleton sequences during training. Our method achieves state-of-the-art performance on the occluded versions of the NTU-60 and NTU-120 datasets, demonstrating its robustness and effectiveness under challenging conditions. Code is available at https://github.com/cyfml/OPSTL.

CVApr 22Code
IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

Weitong Kong, Di Wen, Kunyu Peng et al.

Correcting errors in long-video understanding is disproportionately costly: existing multimodal pipelines produce opaque, end-to-end outputs that expose no intermediate state for inspection, forcing annotators to revisit raw video and reconstruct temporal logic from scratch. The core bottleneck is not generation quality alone, but the absence of a supervisory interface through which human effort can be proportional to the scope of each error. We present IMPACT-CYCLE, a supervisory multi-agent system that reformulates long-video understanding as iterative claim-level maintenance of a shared semantic memory -- a structured, versioned state encoding typed claims, a claim dependency graph, and a provenance log. Role-specialized agents operating under explicit authority contracts decompose verification into local object-relation correctness, cross-temporal consistency, and global semantic coherence, with corrections confined to structurally dependent claims. When automated evidence is insufficient, the system escalates to human arbitration as the supervisory authority with final override rights; dependency-closure re-verification then ensures correction cost remains proportional to error scope. Experiments on VidOR show substantially improved downstream reasoning (VQA: 0.71 to 0.79) and a 4.8x reduction in human arbitration cost, with workload significantly lower than manual annotation. Code will be released at https://github.com/MKong17/IMPACT_CYCLE.

ROJul 15, 2023
Tightly-Coupled LiDAR-Visual SLAM Based on Geometric Features for Mobile Agents

Ke Cao, Ruiping Liu, Ze Wang et al.

The mobile robot relies on SLAM (Simultaneous Localization and Mapping) to provide autonomous navigation and task execution in complex and unknown environments. However, it is hard to develop a dedicated algorithm for mobile robots due to dynamic and challenging situations, such as poor lighting conditions and motion blur. To tackle this issue, we propose a tightly-coupled LiDAR-visual SLAM based on geometric features, which includes two sub-systems (LiDAR and monocular visual SLAM) and a fusion framework. The fusion framework associates the depth and semantics of the multi-modal geometric features to complement the visual line landmarks and to add direction optimization in Bundle Adjustment (BA). This further constrains visual odometry. On the other hand, the entire line segment detected by the visual subsystem overcomes the limitation of the LiDAR subsystem, which can only perform the local calculation for geometric features. It adjusts the direction of linear feature points and filters out outliers, leading to a higher accurate odometry system. Finally, we employ a module to detect the subsystem's operation, providing the LiDAR subsystem's output as a complementary trajectory to our system while visual subsystem tracking fails. The evaluation results on the public dataset M2DGR, gathered from ground robots across various indoor and outdoor scenarios, show that our system achieves more accurate and robust pose estimation compared to current state-of-the-art multi-modal methods.

CVMar 2, 2023Code
Towards Activated Muscle Group Estimation in the Wild

Kunyu Peng, David Schneider, Alina Roitberg et al.

In this paper, we tackle the new task of video-based Activated Muscle Group Estimation (AMGE) aiming at identifying active muscle regions during physical activity in the wild. To this intent, we provide the MuscleMap dataset featuring >15K video clips with 135 different activities and 20 labeled muscle groups. This dataset opens the vistas to multiple video-based applications in sports and rehabilitation medicine under flexible environment constraints. The proposed MuscleMap dataset is constructed with YouTube videos, specifically targeting High-Intensity Interval Training (HIIT) physical exercise in the wild. To make the AMGE model applicable in real-life situations, it is crucial to ensure that the model can generalize well to numerous types of physical activities not present during training and involving new combinations of activated muscles. To achieve this, our benchmark also covers an evaluation setting where the model is exposed to activity types excluded from the training set. Our experiments reveal that the generalizability of existing architectures adapted for the AMGE task remains a challenge. Therefore, we also propose a new approach, TransM3E, which employs a multi-modality feature fusion mechanism between both the video transformer model and the skeleton-based graph convolution model with novel cross-modal knowledge distillation executed on multi-classification tokens. The proposed method surpasses all popular video classification models when dealing with both, previously seen and new types of physical activities. The database and code can be found at https://github.com/KPeng9510/MuscleMap.

CVNov 1, 2025Code
OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback

Kai Luo, Hao Shi, Kunyu Peng et al.

This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360° Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360° FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.

CVApr 10, 2022
A Comparative Analysis of Decision-Level Fusion for Multimodal Driver Behaviour Understanding

Alina Roitberg, Kunyu Peng, Zdravko Marinov et al.

Visual recognition inside the vehicle cabin leads to safer driving and more intuitive human-vehicle interaction but such systems face substantial obstacles as they need to capture different granularities of driver behaviour while dealing with highly limited body visibility and changing illumination. Multimodal recognition mitigates a number of such issues: prediction outcomes of different sensors complement each other due to different modality-specific strengths and weaknesses. While several late fusion methods have been considered in previously published frameworks, they constantly feature different architecture backbones and building blocks making it very hard to isolate the role of the chosen late fusion strategy itself. This paper presents an empirical evaluation of different paradigms for decision-level late fusion in video-based driver observation. We compare seven different mechanisms for joining the results of single-modal classifiers which have been both popular, (e.g. score averaging) and not yet considered (e.g. rank-level fusion) in the context of driver observation evaluating them based on different criteria and benchmark settings. This is the first systematic study of strategies for fusing outcomes of multimodal predictors inside the vehicles, conducted with the goal to provide guidance for fusion scheme selection.

CVMar 21, 2023
360BEV: Panoramic Semantic Mapping for Indoor Bird's-Eye View

Zhifeng Teng, Jiaming Zhang, Kailun Yang et al.

Seeing only a tiny part of the whole is not knowing the full circumstance. Bird's-eye-view (BEV) perception, a process of obtaining allocentric maps from egocentric views, is restricted when using a narrow Field of View (FoV) alone. In this work, mapping from 360° panoramas to BEV semantics, the 360BEV task, is established for the first time to achieve holistic representations of indoor scenes in a top-down view. Instead of relying on narrow-FoV image sequences, a panoramic image with depth information is sufficient to generate a holistic BEV semantic map. To benchmark 360BEV, we present two indoor datasets, 360BEV-Matterport and 360BEV-Stanford, both of which include egocentric panoramic images and semantic segmentation labels, as well as allocentric semantic maps. Besides delving deep into different mapping paradigms, we propose a dedicated solution for panoramic semantic mapping, namely 360Mapper. Through extensive experiments, our methods achieve 44.32% and 45.78% in mIoU on both datasets respectively, surpassing previous counterparts with gains of +7.60% and +9.70% in mIoU. Code and datasets are available at the project page: https://jamycheung.github.io/360BEV.html.

CVNov 10, 2023Code
Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments

Calvin Tanama, Kunyu Peng, Zdravko Marinov et al.

Deep learning-based models are at the forefront of most driver observation benchmarks due to their remarkable accuracies but are also associated with high computational costs. This is challenging, as resources are often limited in real-world driving scenarios. This paper introduces a lightweight framework for resource-efficient driver activity recognition. The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification, by incorporating knowledge distillation and model quantization to balance model accuracy and computational efficiency. Knowledge distillation helps maintain accuracy while reducing the model size by leveraging soft labels from a larger teacher model (I3D), instead of relying solely on original ground truth data. Model quantization significantly lowers memory and computation demands by using lower precision integers for model weights and activations. Extensive testing on a public dataset for in-vehicle monitoring during autonomous driving demonstrates that this new framework achieves a threefold reduction in model size and a 1.4-fold improvement in inference time, compared to an already optimized architecture. The code for this study is available at https://github.com/calvintanama/qd-driver-activity-reco.

CVMar 13Code
InterEdit: Navigating Text-Guided Multi-Human 3D Motion Editing

Yebin Yang, Di Wen, Lei Qi et al.

Text-guided 3D motion editing has seen success in single-person scenarios, but its extension to multi-person settings is less explored due to limited paired data and the complexity of inter-person interactions. We introduce the task of multi-person 3D motion editing, where a target motion is generated from a source and a text instruction. To support this, we propose InterEdit3D, a new dataset with manual two-person motion change annotations, and a Text-guided Multi-human Motion Editing (TMME) benchmark. We present InterEdit, a synchronized classifier-free conditional diffusion model for TMME. It introduces Semantic-Aware Plan Token Alignment with learnable tokens to capture high-level interaction cues and an Interaction-Aware Frequency Token Alignment strategy using DCT and energy pooling to model periodic motion dynamics. Experiments show that InterEdit improves text-to-motion consistency and edit fidelity, achieving state-of-the-art TMME performance. The dataset and code will be released at https://github.com/YNG916/InterEdit.

CVApr 3, 2022
Indoor Navigation Assistance for Visually Impaired People via Dynamic SLAM and Panoptic Segmentation with an RGB-D Sensor

Wenyan Ou, Jiaming Zhang, Kunyu Peng et al.

Exploring an unfamiliar indoor environment and avoiding obstacles is challenging for visually impaired people. Currently, several approaches achieve the avoidance of static obstacles based on the mapping of indoor scenes. To solve the issue of distinguishing dynamic obstacles, we propose an assistive system with an RGB-D sensor to detect dynamic information of a scene. Once the system captures an image, panoptic segmentation is performed to obtain the prior dynamic object information. With sparse feature points extracted from images and the depth information, poses of the user can be estimated. After the ego-motion estimation, the dynamic object can be identified and tracked. Then, poses and speed of tracked dynamic objects can be estimated, which are passed to the users through acoustic feedback.

ROFeb 28, 2023
MateRobot: Material Recognition in Wearable Robotics for People with Visual Impairments

Junwei Zheng, Jiaming Zhang, Kailun Yang et al.

People with Visual Impairments (PVI) typically recognize objects through haptic perception. Knowing objects and materials before touching is desired by the target users but under-explored in the field of human-centered robotics. To fill this gap, in this work, a wearable vision-based robotic system, MateRobot, is established for PVI to recognize materials and object categories beforehand. To address the computational constraints of mobile platforms, we propose a lightweight yet accurate model MateViT to perform pixel-wise semantic segmentation, simultaneously recognizing both objects and materials. Our methods achieve respective 40.2% and 51.1% of mIoU on COCOStuff-10K and DMS datasets, surpassing the previous method with +5.7% and +7.0% gains. Moreover, on the field test with participants, our wearable system reaches a score of 28 in the NASA-Task Load Index, indicating low cognitive demands and ease of use. Our MateRobot demonstrates the feasibility of recognizing material property through visual cues and offers a promising step towards improving the functionality of wearable robots for PVI. The source code has been made publicly available at https://junweizheng93.github.io/publications/MATERobot/MATERobot.html.

CVMar 29Code
RHO: Robust Holistic OSM-Based Metric Cross-View Geo-Localization

Junwei Zheng, Ruize Dai, Ruiping Liu et al.

Metric Cross-View Geo-Localization (MCVGL) aims to estimate the 3-DoF camera pose (position and heading) by matching ground and satellite images. In this work, instead of pinhole and satellite images, we study robust MCVGL using holistic panoramas and OpenStreetMap (OSM). To this end, we establish a large-scale MCVGL benchmark dataset, CV-RHO, with over 2.7M images under different weather and lighting conditions, as well as sensor noise. Furthermore, we propose a model termed RHO with a two-branch Pin-Pan architecture for accurate visual localization. A Split-Undistort-Merge (SUM) module is introduced to address the panoramic distortion, and a Position-Orientation Fusion (POF) mechanism is designed to enhance the localization accuracy. Extensive experiments prove the value of our CV-RHO dataset and the effectiveness of the RHO model, with a significant performance gain up to 20% compared with the state-of-the-art baselines. Project page: https://github.com/InSAI-Lab/RHO.

CVApr 10, 2022
Is my Driver Observation Model Overconfident? Input-guided Calibration Networks for Reliable and Interpretable Confidence Estimates

Alina Roitberg, Kunyu Peng, David Schneider et al.

Driver observation models are rarely deployed under perfect conditions. In practice, illumination, camera placement and type differ from the ones present during training and unforeseen behaviours may occur at any time. While observing the human behind the steering wheel leads to more intuitive human-vehicle-interaction and safer driving, it requires recognition algorithms which do not only predict the correct driver state, but also determine their prediction quality through realistic and interpretable confidence measures. Reliable uncertainty estimates are crucial for building trust and are a serious obstacle for deploying activity recognition networks in real driving systems. In this work, we for the first time examine how well the confidence values of modern driver observation models indeed match the probability of the correct outcome and show that raw neural network-based approaches tend to significantly overestimate their prediction quality. To correct this misalignment between the confidence values and the actual uncertainty, we consider two strategies. First, we enhance two activity recognition models often used for driver observation with temperature scaling-an off-the-shelf method for confidence calibration in image classification. Then, we introduce Calibrated Action Recognition with Input Guidance (CARING)-a novel approach leveraging an additional neural network to learn scaling the confidences depending on the video representation. Extensive experiments on the Drive&Act dataset demonstrate that both strategies drastically improve the quality of model confidences, while our CARING model out-performs both, the original architectures and their temperature scaling enhancement, leading to best uncertainty estimates.

CVMar 16Code
Seeing Beyond: Extrapolative Domain Adaptive Panoramic Segmentation

Yuanfan Zheng, Kunyu Peng, Xu Zheng et al.

Cross-domain panoramic semantic segmentation has attracted growing interest as it enables comprehensive 360° scene understanding for real-world applications. However, it remains particularly challenging due to severe geometric Field of View (FoV) distortions and inconsistent open-set semantics across domains. In this work, we formulate an open-set domain adaptation setting, and propose Extrapolative Domain Adaptive Panoramic Segmentation (EDA-PSeg) framework that trains on local perspective views and tests on full 360° panoramic images, explicitly tackling both geometric FoV shifts across domains and semantic uncertainty arising from previously unseen classes. To this end, we propose the Euler-Margin Attention (EMA), which introduces an angular margin to enhance viewpoint-invariant semantic representation, while performing amplitude and phase modulation to improve generalization toward unseen classes. Additionally, we design the Graph Matching Adapter (GMA), which builds high-order graph relations to align shared semantics across FoV shifts while effectively separating novel categories through structural adaptation. Extensive experiments on four benchmark datasets under camera-shift, weather-condition, and open-set scenarios demonstrate that EDA-PSeg achieves state-of-the-art performance, robust generalization to diverse viewing geometries, and resilience under varying environmental conditions. The code is available at https://github.com/zyfone/EDA-PSeg.

CVMar 10Code
$M^2$-Occ: Resilient 3D Semantic Occupancy Prediction for Autonomous Driving with Incomplete Camera Inputs

Kaixin Lin, Kunyu Peng, Di Wen et al.

Semantic occupancy prediction enables dense 3D geometric and semantic understanding for autonomous driving. However, existing camera-based approaches implicitly assume complete surround-view observations, an assumption that rarely holds in real-world deployment due to occlusion, hardware malfunction, or communication failures. We study semantic occupancy prediction under incomplete multi-camera inputs and introduce $M^2$-Occ, a framework designed to preserve geometric structure and semantic coherence when views are missing. $M^2$-Occ addresses two complementary challenges. First, a Multi-view Masked Reconstruction (MMR) module leverages the spatial overlap among neighboring cameras to recover missing-view representations directly in the feature space. Second, a Feature Memory Module (FMM) introduces a learnable memory bank that stores class-level semantic prototypes. By retrieving and integrating these global priors, the FMM refines ambiguous voxel features, ensuring semantic consistency even when observational evidence is incomplete. We introduce a systematic missing-view evaluation protocol on the nuScenes-based SurroundOcc benchmark, encompassing both deterministic single-view failures and stochastic multi-view dropout scenarios. Under the safety-critical missing back-view setting, $M^2$-Occ improves the IoU by 4.93%. As the number of missing cameras increases, the robustness gap further widens; for instance, under the setting with five missing views, our method boosts the IoU by 5.01%. These gains are achieved without compromising full-view performance. The source code will be publicly released at https://github.com/qixi7up/M2-Occ.

CVJul 2, 2024
Open Panoramic Segmentation

Junwei Zheng, Ruiping Liu, Yufan Chen et al.

Panoramic images, capturing a 360° field of view (FoV), encompass omnidirectional spatial information crucial for scene understanding. However, it is not only costly to obtain training-sufficient dense-annotated panoramas but also application-restricted when training models in a close-vocabulary setting. To tackle this problem, in this work, we define a new task termed Open Panoramic Segmentation (OPS), where models are trained with FoV-restricted pinhole images in the source domain in an open-vocabulary setting while evaluated with FoV-open panoramic images in the target domain, enabling the zero-shot open panoramic semantic segmentation ability of models. Moreover, we propose a model named OOOPS with a Deformable Adapter Network (DAN), which significantly improves zero-shot panoramic semantic segmentation performance. To further enhance the distortion-aware modeling ability from the pinhole source domain, we propose a novel data augmentation method called Random Equirectangular Projection (RERP) which is specifically designed to address object deformations in advance. Surpassing other state-of-the-art open-vocabulary semantic segmentation approaches, a remarkable performance boost on three panoramic datasets, WildPASS, Stanford2D3D, and Matterport3D, proves the effectiveness of our proposed OOOPS model with RERP on the OPS task, especially +2.2% on outdoor WildPASS and +2.4% mIoU on indoor Stanford2D3D. The source code is publicly available at https://junweizheng93.github.io/publications/OPS/OPS.html.

CVMay 18Code
Seeing Together:Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models

Kunyu Peng, Zhikun Zhou, Kailun Yang et al.

Multimodal Large Language Models (MLLMs) have made substantial progress in egocentric video understanding, but their ability to reason cooperatively from multiple embodied viewpoints remains largely unexplored. We study this problem through multi-robot cooperative dynamic spatial reasoning, where a model must answer spatial, temporal, visibility, and coordination questions by integrating synchronized egocentric videos from a team of moving robots. To support this setting, we introduce CoopSR, the first benchmark for this task, together with EgoTeam, a multi-robot egocentric QA dataset. EgoTeam contains 114,227 QA pairs spanning 19 question types, four difficulty tiers, and three team sizes in Habitat and iGibson, along with a real-world test set of around 2,326 QAs collected using two quadruped robots. We further propose SP-CoR (Spectral and Physics-Informed Cooperative Reasoner), an MLLM framework for fine-grained cooperative spatial reasoning. SP-CoR combines dynamics-aware multi-robot frame sampling, spectral- and physics-guided view fusion, and physics-aligned prompt distillation, enabling the model to benefit from privileged robot-pose supervision during training while requiring only egocentric videos at test time. Across 22 MLLM baselines, SP-CoR consistently improves cooperative reasoning, outperforming the strongest fine-tuned baseline by +3.87% on Habitat and +7.12% on iGibson. It also shows stronger generalization to unseen team sizes and real-world robot tests. Code can be found at https://github.com/KPeng9510/seeing-together.git.

CVApr 12
Rethinking Video Human-Object Interaction: Set Prediction over Time for Unified Detection and Anticipation

Yuanhao Luo, Di Wen, Kunyu Peng et al.

Video-based human-object interaction (HOI) understanding requires both detecting ongoing interactions and anticipating their future evolution. However, existing methods usually treat anticipation as a downstream forecasting task built on externally constructed human-object pairs, limiting joint reasoning between detection and prediction. In addition, sparse keyframe annotations in current benchmarks can temporally misalign nominal future labels from actual future dynamics, reducing the reliability of anticipation evaluation. To address these issues, we introduce DETAnt-HOI, a temporally corrected benchmark derived from VidHOI and Action Genome for more faithful multi-horizon evaluation, and HOI-DA, a pair-centric framework that jointly performs subject-object localization, present HOI detection, and future anticipation by modeling future interactions as residual transitions from current pair states. Experiments show consistent improvements in both detection and anticipation, with larger gains at longer horizons. Our results highlight that anticipation is most effective when learned jointly with detection as a structural constraint on pair-level video representation learning. Benchmark and code will be publicly available.

CVJul 22, 2024
Probing Fine-Grained Action Understanding and Cross-View Generalization of Foundation Models

Thinesh Thiyakesan Ponbagavathi, Kunyu Peng, Alina Roitberg

Foundation models (FMs) are large neural networks trained on broad datasets, excelling in downstream tasks with minimal fine-tuning. Human activity recognition in video has advanced with FMs, driven by competition among different architectures. However, high accuracies on standard benchmarks can draw an artificially rosy picture, as they often overlook real-world factors like changing camera perspectives. Popular benchmarks, mostly from YouTube or movies, offer diverse views but only coarse actions, which are insufficient for use-cases needing fine-grained, domain-specific actions. Domain-specific datasets (e.g., for industrial assembly) typically use data from limited static perspectives. This paper empirically evaluates how perspective changes affect different FMs in fine-grained human activity recognition. We compare multiple backbone architectures and design choices, including image- and video- based models, and various strategies for temporal information fusion, including commonly used score averaging and more novel attention-based temporal aggregation mechanisms. This is the first systematic study of different foundation models and specific design choices for human activity recognition from unknown views, conducted with the goal to provide guidance for backbone- and temporal- fusion scheme selection. Code and models will be made publicly available to the community.

CVMar 1
VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Mingkang Dong, Hongyi Cai, Jie Li et al.

The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the LLaVA-665K dataset selected by VisNec achieves 100.2% of full-data performance. On the smaller Vision-Flan-186K dataset, our selection not only further reduces data size but also surpasses full-data training by 15.8%. These results demonstrate that measuring and leveraging visual necessity provides an effective solution for both efficient and robust multimodal instruction tuning. Codes and selected subsets will be released upon acceptance.

CVMay 3Code
IMPACT-HOI: Supervisory Control for Onset-Anchored Partial HOI Event Construction

Haoshen Zhang, Di Wen, Kunyu Peng et al.

We present IMPACT-HOI, a mixed-initiative framework for annotating egocentric procedural video by constructing structured event graphs for Human-Object Interactions (HOI), motivated by the need for high-quality structured supervision for learning robot manipulation from human demonstration. IMPACT-HOI frames this task as the incremental resolution of a partially specified, onset-anchored event state. A trust-calibrated controller selects among direct queries, human-confirmed suggestions, and conservative completions based on empirical annotator behavior and evidence quality. A risk-bounded execution protocol, utilizing atomic rollback, ensures that human-confirmed decisions are preserved against conflicting automated updates. A user study with 9 participants shows a 13.5% reduction in manual annotation actions, a 46.67% event match rate, and zero confirmed-field violations under the studied protocol. The code will be made publicly available at https://github.com/541741106/IMPACT_HOI.

CVMay 3Code
IMPACT-Scribe: Interactive Temporal Action Segmentation with Boundary Scribbles and Query Planning

Qian Yin, Di Wen, Kunyu Peng et al.

Dense temporal annotation of procedural activity videos is vital for action understanding and embodied intelligence but remains labor-intensive due to reactive tools. Each correction is treated as an isolated edit, limiting reuse of information on annotator uncertainty and model reliability. We introduce IMPACT-Scribe, a correction-driven framework for dense labeling that uses each correction to improve future human-machine collaboration. IMPACT-Scribe combines uncertainty-aware boundary scribble supervision, local proposal modeling, cost-aware query planning, structured propagation, and correction-driven adaptation. Experiments and a human study show that this closed-loop design improves labeling quality per effort, enhances boundary accuracy, and fosters better human-machine interaction over time. The code will be made publicly available at https://github.com/BanzQians/IMPACT_AS.

ROMar 20
Not an Obstacle for Dog, but a Hazard for Human: A Co-Ego Navigation System for Guide Dog Robots

Ruiping Liu, Jingqi Zhang, Junwei Zheng et al.

Guide dogs offer independence to Blind and Low-Vision (BLV) individuals, yet their limited availability leaves the vast majority of BLV users without access. Quadruped robotic guide dogs present a promising alternative, but existing systems rely solely on the robot's ground-level sensors for navigation, overlooking a critical class of hazards: obstacles that are transparent to the robot yet dangerous at human body height, such as bent branches. We term this the viewpoint asymmetry problem and present the first system to explicitly address it. Our Co-Ego system adopts a dual-branch obstacle avoidance framework that integrates the robot-centric ground sensing with the user's elevated egocentric perspective to ensure comprehensive navigation safety. Deployed on a quadruped robot, the system is evaluated in a controlled user study with sighted participants under blindfold across three conditions: unassisted, single-view, and cross-view fusion. Results demonstrate that cross-view fusion significantly reduces collision times and cognitive load, verifying the necessity of viewpoint complementarity for safe robotic guide dog navigation.

CVDec 11, 2023Code
Navigating Open Set Scenarios for Skeleton-based Action Recognition

Kunyu Peng, Cheng Yin, Junwei Zheng et al.

In real-world scenarios, human actions often fall outside the distribution of training data, making it crucial for models to recognize known actions and reject unknown ones. However, using pure skeleton data in such open-set conditions poses challenges due to the lack of visual background cues and the distinct sparse structure of body pose sequences. In this paper, we tackle the unexplored Open-Set Skeleton-based Action Recognition (OS-SAR) task and formalize the benchmark on three skeleton-based datasets. We assess the performance of seven established open-set approaches on our task and identify their limits and critical generalization issues when dealing with skeleton information. To address these challenges, we propose a distance-based cross-modality ensemble method that leverages the cross-modal alignment of skeleton joints, bones, and velocities to achieve superior open-set recognition performance. We refer to the key idea as CrossMax - an approach that utilizes a novel cross-modality mean max discrepancy suppression mechanism to align latent spaces during training and a cross-modality distance-based logits refinement method during testing. CrossMax outperforms existing approaches and consistently yields state-of-the-art results across all datasets and backbones. The benchmark, code, and models will be released at https://github.com/KPeng9510/OS-SAR.

CVFeb 28, 2024Code
EchoTrack: Auditory Referring Multi-Object Tracking for Autonomous Driving

Jiacheng Lin, Jiajun Chen, Kunyu Peng et al.

This paper introduces the task of Auditory Referring Multi-Object Tracking (AR-MOT), which dynamically tracks specific objects in a video sequence based on audio expressions and appears as a challenging problem in autonomous driving. Due to the lack of semantic modeling capacity in audio and video, existing works have mainly focused on text-based multi-object tracking, which often comes at the cost of tracking quality, interaction efficiency, and even the safety of assistance systems, limiting the application of such methods in autonomous driving. In this paper, we delve into the problem of AR-MOT from the perspective of audio-video fusion and audio-video tracking. We put forward EchoTrack, an end-to-end AR-MOT framework with dual-stream vision transformers. The dual streams are intertwined with our Bidirectional Frequency-domain Cross-attention Fusion Module (Bi-FCFM), which bidirectionally fuses audio and video features from both frequency- and spatiotemporal domains. Moreover, we propose the Audio-visual Contrastive Tracking Learning (ACTL) regime to extract homogeneous semantic features between expressions and visual objects by learning homogeneous features between different audio and video objects effectively. Aside from the architectural design, we establish the first set of large-scale AR-MOT benchmarks, including Echo-KITTI, Echo-KITTI+, and Echo-BDD. Extensive experiments on the established benchmarks demonstrate the effectiveness of the proposed EchoTrack and its components. The source code and datasets are available at https://github.com/lab206/EchoTrack.

CVJan 30, 2024Code
Fourier Prompt Tuning for Modality-Incomplete Scene Segmentation

Ruiping Liu, Jiaming Zhang, Kunyu Peng et al.

Integrating information from multiple modalities enhances the robustness of scene perception systems in autonomous vehicles, providing a more comprehensive and reliable sensory framework. However, the modality incompleteness in multi-modal segmentation remains under-explored. In this work, we establish a task called Modality-Incomplete Scene Segmentation (MISS), which encompasses both system-level modality absence and sensor-level modality errors. To avoid the predominant modality reliance in multi-modal fusion, we introduce a Missing-aware Modal Switch (MMS) strategy to proactively manage missing modalities during training. Utilizing bit-level batch-wise sampling enhances the model's performance in both complete and incomplete testing scenarios. Furthermore, we introduce the Fourier Prompt Tuning (FPT) method to incorporate representative spectral information into a limited number of learnable prompts that maintain robustness against all MISS scenarios. Akin to fine-tuning effects but with fewer tunable parameters (1.1%). Extensive experiments prove the efficacy of our proposed approach, showcasing an improvement of 5.84% mIoU over the prior state-of-the-art parameter-efficient methods in modality missing. The source code is publicly available at https://github.com/RuipingL/MISS.

CVMay 18
EgoExoMem: Cross-View Memory Reasoning over Synchronized Egocentric and Exocentric Videos

Ruiping Liu, Junwei Zheng, Yufan Chen et al.

Egocentric memory is widely used in embodied intelligence, but it may be insufficient for comprehensive spatial-temporal reasoning. Inspired by human recall from both field and observer perspectives, we introduce EgoExoMem, the first benchmark for cross-view memory reasoning over synchronized egocentric and exocentric videos. EgoExoMem contains $2.6K$ high-quality MCQs across eight temporal, spatial, and cross-view QA types. To support dual-view retrieval, we propose E$^2$-Select, a training-free frame selection method for synchronized ego-exo videos. It combines relevance-based budget allocation with per-view k-DPP sampling to handle view asymmetry and cross-view temporal consistency. Experiments show that ego and exo views provide complementary memory cues, while existing MLLMs remain far from solving the benchmark: the best model reaches only $55.3\%$. E$^2$-Select achieves state-of-the-art performance of $58.2\%$ over frame-selection and RAG-based memory baselines. Further analysis reveals systematic view-preference conflicts between question framing and answer grounding, underscoring the novelty and challenge of cross-view memory reasoning.

CVMar 15, 2024Code
Skeleton-Based Human Action Recognition with Noisy Labels

Yi Xu, Kunyu Peng, Di Wen et al.

Understanding human actions from body poses is critical for assistive robots sharing space with humans in order to make informed and safe decisions about the next interaction. However, precise temporal localization and annotation of activity sequences is time-consuming and the resulting labels are often noisy. If not effectively addressed, label noise negatively affects the model's training, resulting in lower recognition quality. Despite its importance, addressing label noise for skeleton-based action recognition has been overlooked so far. In this study, we bridge this gap by implementing a framework that augments well-established skeleton-based human action recognition methods with label-denoising strategies from various research areas to serve as the initial benchmark. Observations reveal that these baselines yield only marginal performance when dealing with sparse skeleton data. Consequently, we introduce a novel methodology, NoiseEraSAR, which integrates global sample selection, co-teaching, and Cross-Modal Mixture-of-Experts (CM-MOE) strategies, aimed at mitigating the adverse impacts of label noise. Our proposed approach demonstrates better performance on the established benchmark, setting new state-of-the-art standards. The source code for this study is accessible at https://github.com/xuyizdby/NoiseEraSAR.

LGMay 4, 2024Code
Position: Quo Vadis, Unsupervised Time Series Anomaly Detection?

M. Saquib Sarfraz, Mei-Yen Chen, Lukas Layer et al.

The current state of machine learning scholarship in Timeseries Anomaly Detection (TAD) is plagued by the persistent use of flawed evaluation metrics, inconsistent benchmarking practices, and a lack of proper justification for the choices made in novel deep learning-based model designs. Our paper presents a critical analysis of the status quo in TAD, revealing the misleading track of current research and highlighting problematic methods, and evaluation practices. Our position advocates for a shift in focus from solely pursuing novel model designs to improving benchmarking practices, creating non-trivial datasets, and critically evaluating the utility of complex methods against simpler baselines. Our findings demonstrate the need for rigorous evaluation protocols, the creation of simple baselines, and the revelation that state-of-the-art deep anomaly detection models effectively learn linear mappings. These findings suggest the need for more exploration and development of simple and interpretable TAD methods. The increment of model complexity in the state-of-the-art deep-learning based models unfortunately offers very little improvement. We offer insights and suggestions for the field to move forward. Code: https://github.com/ssarfraz/QuoVadisTAD

ROMar 16, 2025Code
VISO-Grasp: Vision-Language Informed Spatial Object-centric 6-DoF Active View Planning and Grasping in Clutter and Invisibility

Yitian Shi, Di Wen, Guanqi Chen et al.

We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of $87.5\%$ in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact

CVJul 12, 2025Code
RoHOI: Robustness Benchmark for Human-Object Interaction Detection

Di Wen, Kunyu Peng, Kailun Yang et al.

Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support. However, models trained on clean datasets degrade in real-world conditions due to unforeseen corruptions, leading to inaccurate predictions. To address this, we introduce the first robustness benchmark for HOI detection, evaluating model resilience under diverse challenges. Despite advances, current models struggle with environmental variability, occlusions, and noise. Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric. We systematically analyze existing models in the HOI field, revealing significant performance drops under corruptions. To improve robustness, we propose a Semantic-Aware Masking-based Progressive Learning (SAMPL) strategy to guide the model to be optimized based on holistic and partial cues, thus dynamically adjusting the model's optimization to enhance robust feature learning. Extensive experiments show that our approach outperforms state-of-the-art methods, setting a new standard for robust HOI detection. Benchmarks, datasets, and code are available at https://github.com/KratosWen/RoHOI.

CVDec 24, 2024Code
Mitigating Label Noise using Prompt-Based Hyperbolic Meta-Learning in Open-Set Domain Generalization

Kunyu Peng, Di Wen, Sarfraz M. Saquib et al.

Open-Set Domain Generalization (OSDG) is a challenging task requiring models to accurately predict familiar categories while minimizing confidence for unknown categories to effectively reject them in unseen domains. While the OSDG field has seen considerable advancements, the impact of label noise--a common issue in real-world datasets--has been largely overlooked. Label noise can mislead model optimization, thereby exacerbating the challenges of open-set recognition in novel domains. In this study, we take the first step towards addressing Open-Set Domain Generalization under Noisy Labels (OSDG-NL) by constructing dedicated benchmarks derived from widely used OSDG datasets, including PACS and DigitsDG. We evaluate baseline approaches by integrating techniques from both label denoising and OSDG methodologies, highlighting the limitations of existing strategies in handling label noise effectively. To address these limitations, we propose HyProMeta, a novel framework that integrates hyperbolic category prototypes for label noise-aware meta-learning alongside a learnable new-category agnostic prompt designed to enhance generalization to unseen classes. Our extensive experiments demonstrate the superior performance of HyProMeta compared to state-of-the-art methods across the newly established benchmarks. The source code of this work is released at https://github.com/KPeng9510/HyProMeta.