CVOct 26, 2022Code
SemFormer: Semantic Guided Activation Transformer for Weakly Supervised Semantic SegmentationJunliang Chen, Xiaodong Zhao, Cheng Luo et al.
Recent mainstream weakly supervised semantic segmentation (WSSS) approaches are mainly based on Class Activation Map (CAM) generated by a CNN (Convolutional Neural Network) based image classifier. In this paper, we propose a novel transformer-based framework, named Semantic Guided Activation Transformer (SemFormer), for WSSS. We design a transformer-based Class-Aware AutoEncoder (CAAE) to extract the class embeddings for the input image and learn class semantics for all classes of the dataset. The class embeddings and learned class semantics are then used to guide the generation of activation maps with four losses, i.e., class-foreground, class-background, activation suppression, and activation complementation loss. Experimental results show that our SemFormer achieves \textbf{74.3}\% mIoU and surpasses many recent mainstream WSSS approaches by a large margin on PASCAL VOC 2012 dataset. Code will be available at \url{https://github.com/JLChen-C/SemFormer}.
CVMar 25, 2022
Contrastive learning of Class-agnostic Activation Map for Weakly Supervised Object Localization and Semantic SegmentationJinheng Xie, Jianfeng Xiang, Junliang Chen et al.
While class activation map (CAM) generated by image classification network has been widely used for weakly supervised object localization (WSOL) and semantic segmentation (WSSS), such classifiers usually focus on discriminative object regions. In this paper, we propose Contrastive learning for Class-agnostic Activation Map (C$^2$AM) generation only using unlabeled image data, without the involvement of image-level supervision. The core idea comes from the observation that i) semantic information of foreground objects usually differs from their backgrounds; ii) foreground objects with similar appearance or background with similar color/texture have similar representations in the feature space. We form the positive and negative pairs based on the above relations and force the network to disentangle foreground and background with a class-agnostic activation map using a novel contrastive loss. As the network is guided to discriminate cross-image foreground-background, the class-agnostic activation maps learned by our approach generate more complete object regions. We successfully extracted from C$^2$AM class-agnostic object bounding boxes for object localization and background cues to refine CAM generated by classification network for semantic segmentation. Extensive experiments on CUB-200-2011, ImageNet-1K, and PASCAL VOC2012 datasets show that both WSOL and WSSS can benefit from the proposed C$^2$AM.
CVAug 7, 2022
Sample hardness based gradient loss for long-tailed cervical cell detectionMinmin Liu, Xuechen Li, Xiangbo Gao et al.
Due to the difficulty of cancer samples collection and annotation, cervical cancer datasets usually exhibit a long-tailed data distribution. When training a detector to detect the cancer cells in a WSI (Whole Slice Image) image captured from the TCT (Thinprep Cytology Test) specimen, head categories (e.g. normal cells and inflammatory cells) typically have a much larger number of samples than tail categories (e.g. cancer cells). Most existing state-of-the-art long-tailed learning methods in object detection focus on category distribution statistics to solve the problem in the long-tailed scenario without considering the "hardness" of each sample. To address this problem, in this work we propose a Grad-Libra Loss that leverages the gradients to dynamically calibrate the degree of hardness of each sample for different categories, and re-balance the gradients of positive and negative samples. Our loss can thus help the detector to put more emphasis on those hard samples in both head and tail categories. Extensive experiments on a long-tailed TCT WSI image dataset show that the mainstream detectors, e.g. RepPoints, FCOS, ATSS, YOLOF, etc. trained using our proposed Gradient-Libra Loss, achieved much higher (7.8%) mAP than that trained using cross-entropy classification loss.
CVJun 16, 2022
Delving into the Scale Variance Problem in Object DetectionJunliang Chen, Xiaodong Zhao, Linlin Shen
Object detection has made substantial progress in the last decade, due to the capability of convolution in extracting local context of objects. However, the scales of objects are diverse and current convolution can only process single-scale input. The capability of traditional convolution with a fixed receptive field in dealing with such a scale variance problem, is thus limited. Multi-scale feature representation has been proven to be an effective way to mitigate the scale variance problem. Recent researches mainly adopt partial connection with certain scales, or aggregate features from all scales and focus on the global information across the scales. However, the information across spatial and depth dimensions is ignored. Inspired by this, we propose the multi-scale convolution (MSConv) to handle this problem. Taking into consideration scale, spatial and depth information at the same time, MSConv is able to process multi-scale input more comprehensively. MSConv is effective and computationally efficient, with only a small increase of computational cost. For most of the single-stage object detectors, replacing the traditional convolutions with MSConvs in the detection head can bring more than 2.5\% improvement in AP (on COCO 2017 dataset), with only 3\% increase of FLOPs. MSConv is also flexible and effective for two-stage object detectors. When extended to the mainstream two-stage object detectors, MSConv can bring up to 3.0\% improvement in AP. Our best model under single-scale testing achieves 48.9\% AP on COCO 2017 \textit{test-dev} split, which surpasses many state-of-the-art methods.
CVJun 16, 2022
Selective Multi-Scale Learning for Object DetectionJunliang Chen, Weizeng Lu, Linlin Shen
Pyramidal networks are standard methods for multi-scale object detection. Current researches on feature pyramid networks usually adopt layer connections to collect features from certain levels of the feature hierarchy, and do not consider the significant differences among them. We propose a better architecture of feature pyramid networks, named selective multi-scale learning (SMSL), to address this issue. SMSL is efficient and general, which can be integrated in both single-stage and two-stage detectors to boost detection performance, with nearly no extra inference cost. RetinaNet combined with SMSL obtains 1.8\% improvement in AP (from 39.1\% to 40.9\%) on COCO dataset. When integrated with SMSL, two-stage detectors can get around 1.0\% improvement in AP.
CVOct 22, 2022
SLAMs: Semantic Learning based Activation Map for Weakly Supervised Semantic SegmentationJunliang Chen, Xiaodong Zhao, Minmin Liu et al.
Recent mainstream weakly-supervised semantic segmentation (WSSS) approaches mainly relies on image-level classification learning, which has limited representation capacity. In this paper, we propose a novel semantic learning based framework, named SLAMs (Semantic Learning based Activation Map), for WSSS.
CVMay 8, 2024Code
A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion PerspectiveHuaiyuan Xu, Junliang Chen, Shiyu Meng et al.
3D occupancy perception technology aims to observe and understand dense 3D environments for autonomous vehicles. Owing to its comprehensive perception capability, this technology is emerging as a trend in autonomous driving perception systems, and is attracting significant attention from both industry and academia. Similar to traditional bird's-eye view (BEV) perception, 3D occupancy perception has the nature of multi-source input and the necessity for information fusion. However, the difference is that it captures vertical structures that are ignored by 2D BEV. In this survey, we review the most recent works on 3D occupancy perception, and provide in-depth analyses of methodologies with various input modalities. Specifically, we summarize general network pipelines, highlight information fusion techniques, and discuss effective network training. We evaluate and analyze the occupancy perception performance of the state-of-the-art on the most popular datasets. Furthermore, challenges and future research directions are discussed. We hope this paper will inspire the community and encourage more research work on 3D occupancy perception. A comprehensive list of studies in this survey is publicly available in an active repository that continuously collects the latest work: https://github.com/HuaiyuanXu/3D-Occupancy-Perception.
CVFeb 21, 2025Code
OccProphet: Pushing Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with Observer-Forecaster-Refiner FrameworkJunliang Chen, Huaiyuan Xu, Yi Wang et al.
Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, i.e., OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while improving forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D multi-frame voxels using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58\%$\sim$78\% of the computational cost with a 2.6$\times$ speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4\%$\sim$18\% relatively higher forecasting accuracy. Code and models are publicly available at https://github.com/JLChen-C/OccProphet.
80.1ROMay 17
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic ManipulationSixu Lin, Junliang Chen, Huaiyuan Xu et al.
Planning and acting in 3D environments is a fundamental capability for robotic manipulation in the real world. Although prior work has explored predictive flow planners to guide 3D manipulation, existing approaches often rely on modular pipelines stacking multiple submodels, resulting in high computational overhead and limited real-time performance. To address these challenges, we introduce RoboFlow4D, a lightweight flow world model that unifies perception and planning by estimating temporal motion in physical 3D space. As an end-to-end framework, RoboFlow4D directly predicts multi-frame 3D flows from visual observations and textual instructions, providing explicit flow-based planning to guide action generation. This design allows seamless integration with general action policies, forming an efficient observation-planning-execution closed loop. Through slow-fast collaboration between flow prediction and action control, RoboFlow4D enables real-time and resource-efficient manipulation. Extensive experiments in both simulation and real-world settings demonstrate that RoboFlow4D consistently improves manipulation success rates and computational efficiency, advancing flow-guided planning for embodied intelligence.
CVMar 27, 2025Code
FaceBench: A Multi-View Multi-Level Facial Attribute VQA Dataset for Benchmarking Face Perception MLLMsXiaoqin Wang, Xusen Ma, Xianxu Hou et al.
Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in various tasks. However, effectively evaluating these MLLMs on face perception remains largely unexplored. To address this gap, we introduce FaceBench, a dataset featuring hierarchical multi-view and multi-level attributes specifically designed to assess the comprehensive face perception abilities of MLLMs. Initially, we construct a hierarchical facial attribute structure, which encompasses five views with up to three levels of attributes, totaling over 210 attributes and 700 attribute values. Based on the structure, the proposed FaceBench consists of 49,919 visual question-answering (VQA) pairs for evaluation and 23,841 pairs for fine-tuning. Moreover, we further develop a robust face perception MLLM baseline, Face-LLaVA, by training with our proposed face VQA data. Extensive experiments on various mainstream MLLMs and Face-LLaVA are conducted to test their face perception ability, with results also compared against human performance. The results reveal that, the existing MLLMs are far from satisfactory in understanding the fine-grained facial attributes, while our Face-LLaVA significantly outperforms existing open-source models with a small amount of training data and is comparable to commercial ones like GPT-4o and Gemini. The dataset will be released at https://github.com/CVI-SZU/FaceBench.
67.5HCMar 27
"Law at Your Fingertips": Understanding Legal Information Seeking on Video-Sharing Platforms in ChinaZhiyang Wu, Junliang Chen, Qian Wan et al.
Equipping laypeople with the capabilities to seek legal information has been an important goal for Legal Empowerment in modern society. However, unlike general information-seeking behaviors, legal information seeking is characterized by high stakes, urgency, and a critical need for emotional support, which traditional text-based searching platforms struggle to satisfy. In recent years, people have been increasingly turning to Video-Sharing Platforms (VSPs) for access to legal information and to fulfill their legal needs. Despite the importance of this shift, such VSP-mediated legal information-seeking practices remain underexplored. Through an observational analysis of legal content on two VSPs (Douyin and Bilibili) and interviews with 20 Chinese information seekers, this study examined the practices and challenges associated with seeking, comprehending, and evaluating legal information on VSPs. We further revealed the formation of trust and engagement on the VSP-based legal knowledge-sharing community, highlighting how VSP affordances helped mitigate seekers' epistemic discomfort and satisfy their needs for emotional support. In the discussion, we provided insights on balancing heuristic and systematic processing to encourage information cross-validation, and offered implications for designing trustworthy civic information systems and fostering an accessible, safe, and efficient information-seeking environment in digital space.
CVAug 2, 2025Code
DisFaceRep: Representation Disentanglement for Co-occurring Facial Components in Weakly Supervised Face ParsingXiaoqin Wang, Xianxu Hou, Meidan Ding et al.
Face parsing aims to segment facial images into key components such as eyes, lips, and eyebrows. While existing methods rely on dense pixel-level annotations, such annotations are expensive and labor-intensive to obtain. To reduce annotation cost, we introduce Weakly Supervised Face Parsing (WSFP), a new task setting that performs dense facial component segmentation using only weak supervision, such as image-level labels and natural language descriptions. WSFP introduces unique challenges due to the high co-occurrence and visual similarity of facial components, which lead to ambiguous activations and degraded parsing performance. To address this, we propose DisFaceRep, a representation disentanglement framework designed to separate co-occurring facial components through both explicit and implicit mechanisms. Specifically, we introduce a co-occurring component disentanglement strategy to explicitly reduce dataset-level bias, and a text-guided component disentanglement loss to guide component separation using language supervision implicitly. Extensive experiments on CelebAMask-HQ, LaPa, and Helen demonstrate the difficulty of WSFP and the effectiveness of DisFaceRep, which significantly outperforms existing weakly supervised semantic segmentation methods. The code will be released at \href{https://github.com/CVI-SZU/DisFaceRep}{\textcolor{cyan}{https://github.com/CVI-SZU/DisFaceRep}}.
CEDec 16, 2025
Wearable-informed generative digital avatars predict task-conditioned post-stroke locomotionYanning Dai, Chenyu Tang, Ruizhi Zhang et al.
Dynamic prediction of locomotor capacity after stroke could enable more individualized rehabilitation, yet current assessments largely provide static impairment scores and do not indicate whether patients can perform specific tasks such as slope walking or stair climbing. Here, we present a wearable-informed data-physics hybrid generative framework that reconstructs a stroke survivor's locomotor control from wearable inertial sensing and predicts task-conditioned post-stroke locomotion in new environments. From a single 20 m level-ground walking trial recorded by five IMUs, the framework personalizes a physics-based digital avatar using a healthy-motion prior and hybrid imitation learning, generating dynamically feasible, patient-specific movements for inclined walking and stair negotiation. Across 11 stroke inpatients, predicted postures reached 82.2% similarity for slopes and 69.9% for stairs, substantially exceeding a physics-only baseline. In a multicentre pilot randomized study (n = 21; 28 days), access to scenario-specific locomotion predictions to support task selection and difficulty titration was associated with larger gains in Fugl-Meyer lower-extremity scores than standard care (mean change 6.0 vs 3.7 points; $p < 0.05$). These results suggest that wearable-informed generative digital avatars may augment individualized gait rehabilitation planning and provide a pathway toward dynamically personalized post-stroke motor recovery strategies.
CVNov 24, 2025
FineXtrol: Controllable Motion Generation via Fine-Grained TextKeming Shen, Bizhu Wu, Junliang Chen et al.
Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations. To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, user-friendly, and fine-grained textual control signals that describe specific body part movements over time. In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability. Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements.
HCNov 28, 2024
An AI-Driven Multimodal Smart Home Platform for Continuous Monitoring and Assistance in Post-Stroke Motor ImpairmentChenyu Tang, Ruizhi Zhang, Shuo Gao et al.
At-home rehabilitation for post-stroke patients presents significant challenges, as continuous, personalized care is often limited outside clinical settings. Moreover, the lack of integrated solutions capable of simultaneously monitoring motor recovery and providing intelligent assistance in home environments hampers rehabilitation outcomes. Here, we present a multimodal smart home platform designed for continuous, at-home rehabilitation of post-stroke patients, integrating wearable sensing, ambient monitoring, and adaptive automation. A plantar pressure insole equipped with a machine learning pipeline classifies users into motor recovery stages with up to 94\% accuracy, enabling quantitative tracking of walking patterns during daily activities. An optional head-mounted eye-tracking module, together with ambient sensors such as cameras and microphones, supports seamless hands-free control of household devices with a 100\% success rate and sub-second response time. These data streams are fused locally via a hierarchical Internet of Things (IoT) architecture, ensuring low latency and data privacy. An embedded large language model (LLM) agent, Auto-Care, continuously interprets multimodal data to provide real-time interventions -- issuing personalized reminders, adjusting environmental conditions, and notifying caregivers. Implemented in a post-stroke context, this integrated smart home platform increased mean user satisfaction from 3.9 $\pm$ 0.8 in conventional home environments to 8.4 $\pm$ 0.6 with the full system ($n=20$). Beyond stroke, the system offers a scalable, patient-centered framework with potential for long-term use in broader neurorehabilitation and aging-in-place applications.
CRApr 22, 2020
MobiGyges: A mobile hidden volume for preventing data loss, improving storage utilization, and avoiding device rebootWendi Feng, Chuanchang Liu, Zehua Guo et al.
Sensitive data protection is essential for mobile users. Plausibly Deniable Encryption (PDE) systems provide an effective manner to protect sensitive data by hiding them on the device. However, existing PDE systems can lose data due to overriding the hidden volume, waste physical storage because of the reserved area used for avoiding data loss, and require device reboot when using the hidden volume. This paper presents MobiGyges, a hidden volume-based mobile PDE system, to fill the gap. MobiGyges addresses the problem of data loss by restricting each storage block used only by one volume, and it improves storage utilization by eliminating the reserved area. MobiGyges can also avoid device reboot by mounting the hidden volume dynamically on-demand with the Dynamic Mounting service. Moreover, we identify two novel PDE oriented attacks, the capacity comparison attack and the fill-to-full attack. MobiGyges can defend them by jointly leveraging the Shrunk U-disk method and multi-level deniability. We implement the MobiGyges proof-of-concept system on a real mobile phone Google Nexus 6P with LineageOS 13. Experimental results show that MobiGyges prevents data loss, avoids device reboot, improves storage utilization by over 30% with acceptable performance overhead compared with current works.