CVMar 16, 2022Code
WeakM3D: Towards Weakly Supervised Monocular 3D Object DetectionLiang Peng, Senbo Yan, Boxi Wu et al.
Monocular 3D object detection is one of the most challenging tasks in 3D scene understanding. Due to the ill-posed nature of monocular imagery, existing monocular 3D detection methods highly rely on training with the manually annotated 3D box labels on the LiDAR point clouds. This annotation process is very laborious and expensive. To dispense with the reliance on 3D box labels, in this paper we explore the weakly supervised monocular 3D detection. Specifically, we first detect 2D boxes on the image. Then, we adopt the generated 2D boxes to select corresponding RoI LiDAR points as the weak supervision. Eventually, we adopt a network to predict 3D boxes which can tightly align with associated RoI LiDAR points. This network is learned by minimizing our newly-proposed 3D alignment loss between the 3D box estimates and the corresponding RoI LiDAR points. We will illustrate the potential challenges of the above learning problem and resolve these challenges by introducing several effective designs into our method. Codes will be available at https://github.com/SPengLiang/WeakM3D.
CVJul 18, 2022Code
DID-M3D: Decoupling Instance Depth for Monocular 3D Object DetectionLiang Peng, Xiaopei Wu, Zheng Yang et al.
Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects' appearances and positions on the image. By contrast, the attribute depth relies on objects' inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at https://github.com/SPengLiang/DID-M3D.
CVMar 19, 2022
CLRNet: Cross Layer Refinement Network for Lane DetectionTu Zheng, Yifei Huang, Yang Liu et al.
Lane is critical in the vision navigation system of the intelligent vehicle. Naturally, lane is a traffic sign with high-level semantics, whereas it owns the specific local pattern which needs detailed low-level features to localize accurately. Using different feature levels is of great importance for accurate lane detection, but it is still under-explored. In this work, we present Cross Layer Refinement Network (CLRNet) aiming at fully utilizing both high-level and low-level features in lane detection. In particular, it first detects lanes with high-level semantic features then performs refinement based on low-level features. In this way, we can exploit more contextual information to detect lanes while leveraging local detailed lane features to improve localization accuracy. We present ROIGather to gather global context, which further enhances the feature representation of lanes. In addition to our novel network design, we introduce Line IoU loss which regresses the lane line as a whole unit to improve the localization accuracy. Experiments demonstrate that the proposed method greatly outperforms the state-of-the-art lane detection approaches.
74.9LGMay 7
HCInfer: An Efficient Inference System via Error Compensation for Resource-Constrained DevicesShen Xu, Xiangwen Zhuge, Zhe Xu et al.
LLMs often struggle with memory-constrained deployment on consumer-grade hardware due to their massive parameter sizes. While existing solutions such as model compression and offloading improve deployment feasibility, they often suffer from substantial accuracy degradation or severe throughput bottlenecks. Recent error compensation methods recover accuracy through auxiliary LoRA-style branches, and we observe that these branches are inherently amenable to offloading: they require substantial parameter storage but access only a small subset of compensation parameters during each inference step. Motivated by this opportunity, we propose HCInfer, a heterogeneous inference system that offloads residual compensation to the CPU while executing the compressed backbone on the GPU, and further introduces an asynchronous compensation pipeline and sensitivity-aware dynamic rank allocation to hide compensation overhead and maximize accuracy recovery. Experimental results show that HCInfer achieves a maximum accuracy improvement of 5.2% on downstream tasks compared to compression model and sustaining a maximum speedup of 10.4x compared to full-precision model.
CVNov 29, 2023Code
SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video TuningLiang Peng, Haoran Cheng, Zheng Yang et al.
Recent one-shot video tuning methods, which fine-tune the network on a specific video based on pre-trained text-to-image models (e.g., Stable Diffusion), are popular in the community because of the flexibility. However, these methods often produce videos marred by incoherence and inconsistency. To address these limitations, this paper introduces a simple yet effective noise constraint across video frames. This constraint aims to regulate noise predictions across their temporal neighbors, resulting in smooth latents. It can be simply included as a loss term during the training phase. By applying the loss to existing one-shot video tuning methods, we significantly improve the overall consistency and smoothness of the generated videos. Furthermore, we argue that current video evaluation metrics inadequately capture smoothness. To address this, we introduce a novel metric that considers detailed features and their temporal dynamics. Experimental results validate the effectiveness of our approach in producing smoother videos on various one-shot video tuning baselines. The source codes and video demos are available at \href{https://github.com/SPengLiang/SmoothVideo}{https://github.com/SPengLiang/SmoothVideo}.
SPJun 20, 2022
Hands-on Wireless Sensing with Wi-Fi: A TutorialZheng Yang, Yi Zhang, Guoxuan Chi et al. · tsinghua
With the rapid development of wireless communication technology, wireless access points (AP) and internet of things (IoT) devices have been widely deployed in our surroundings. Various types of wireless signals (e.g., Wi-Fi, LoRa, LTE) are filling out our living and working spaces. Previous researches reveal the fact that radio waves are modulated by the spatial structure during the propagation process (e.g., reflection, diffraction, and scattering) and superimposed on the receiver. This observation allows us to reconstruct the surrounding environment based on received wireless signals, called "wireless sensing". Wireless sensing is an emerging technology that enables a wide range of applications, such as gesture recognition for human-computer interaction, vital signs monitoring for health care, and intrusion detection for security management. Compared with other sensing paradigms, such as vision-based and IMU-based sensing, wireless sensing solutions have unique advantages such as high coverage, pervasiveness, low cost, and robustness under adverse light and texture scenarios. Besides, wireless sensing solutions are generally lightweight in terms of both computation overhead and device size. This tutorial takes Wi-Fi sensing as an example. It introduces both the theoretical principles and the code implementation of data collection, signal processing, features extraction, and model design. In addition, this tutorial highlights state-of-the-art deep learning models (e.g., CNN, RNN, and adversarial learning models) and their applications in wireless sensing systems. We hope this tutorial will help people in other research fields to break into wireless sensing research and learn more about its theories, designs, and implementation skills, promoting prosperity in the wireless sensing research field.
CVNov 19, 2023Code
Pair-wise Layer Attention with Spatial Masking for Video PredictionPing Li, Chenhan Zhang, Zheng Yang et al.
Video prediction yields future frames by employing the historical frames and has exhibited its great potential in many applications, e.g., meteorological prediction, and autonomous driving. Previous works often decode the ultimate high-level semantic features to future frames without texture details, which deteriorates the prediction quality. Motivated by this, we develop a Pair-wise Layer Attention (PLA) module to enhance the layer-wise semantic dependency of the feature maps derived from the U-shape structure in Translator, by coupling low-level visual cues and high-level features. Hence, the texture details of predicted frames are enriched. Moreover, most existing methods capture the spatiotemporal dynamics by Translator, but fail to sufficiently utilize the spatial features of Encoder. This inspires us to design a Spatial Masking (SM) module to mask partial encoding features during pretraining, which adds the visibility of remaining feature pixels by Decoder. To this end, we present a Pair-wise Layer Attention with Spatial Masking (PLA-SM) framework for video prediction to capture the spatiotemporal dynamics, which reflect the motion trend. Extensive experiments and rigorous ablation studies on five benchmarks demonstrate the advantages of the proposed approach. The code is available at GitHub.
CVAug 1, 2023
NormKD: Normalized Logits for Knowledge DistillationZhihao Chi, Tu Zheng, Hengjia Li et al.
Logit based knowledge distillation gets less attention in recent years since feature based methods perform better in most cases. Nevertheless, we find it still has untapped potential when we re-investigate the temperature, which is a crucial hyper-parameter to soften the logit outputs. For most of the previous works, it was set as a fixed value for the entire distillation procedure. However, as the logits from different samples are distributed quite variously, it is not feasible to soften all of them to an equal degree by just a single temperature, which may make the previous work transfer the knowledge of each sample inadequately. In this paper, we restudy the hyper-parameter temperature and figure out its incapability to distill the knowledge from each sample sufficiently when it is a single value. To address this issue, we propose Normalized Knowledge Distillation (NormKD), with the purpose of customizing the temperature for each sample according to the characteristic of the sample's logit distribution. Compared to the vanilla KD, NormKD barely has extra computation or storage cost but performs significantly better on CIRAR-100 and ImageNet for image classification. Furthermore, NormKD can be easily applied to the other logit based methods and achieve better performance which can be closer to or even better than the feature based method.
CVOct 31, 2022
Improving Multi-generation Robustness of Learned Image CompressionLitian Li, Zheng Yang, Ronggang Wang
Benefit from flexible network designs and end-to-end joint optimization approach, learned image compression (LIC) has demonstrated excellent coding performance and practical feasibility in recent years. However, existing compression models suffer from serious multi-generation loss, which always occurs during image editing and transcoding. During the process of repeatedly encoding and decoding, the quality of the image will rapidly degrade, resulting in various types of distortion, which significantly limits the practical application of LIC. In this paper, a thorough analysis is carried out to determine the source of generative loss in successive image compression (SIC). We point out and solve the quantization drift problem that affects SIC, reversibility loss function as well as channel relaxation method are proposed to further reduce the generation loss. Experiments show that by using our proposed solutions, LIC can achieve comparable performance to the first compression of BPG even after 50 times reencoding without any change of the network structure.
CVMar 31, 2023
APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud UnderstandingHengjia Li, Tu Zheng, Zhihao Chi et al.
Transformer-based networks have achieved impressive performance in 3D point cloud understanding. However, most of them concentrate on aggregating local features, but neglect to directly model global dependencies, which results in a limited effective receptive field. Besides, how to effectively incorporate local and global components also remains challenging. To tackle these problems, we propose Asymmetric Parallel Point Transformer (APPT). Specifically, we introduce Global Pivot Attention to extract global features and enlarge the effective receptive field. Moreover, we design the Asymmetric Parallel structure to effectively integrate local and global information. Combined with these designs, APPT is able to capture features globally throughout the entire network while focusing on local-detailed features. Extensive experiments show that our method outperforms the priors and achieves state-of-the-art on several benchmarks for 3D point cloud understanding, such as 3D semantic segmentation on S3DIS, 3D shape classification on ModelNet40, and 3D part segmentation on ShapeNet.
CVOct 30, 2023
Few-shot Hybrid Domain Adaptation of Image GeneratorsHengjia Li, Yang Liu, Linxuan Xia et al.
Can a pre-trained generator be adapted to the hybrid of multiple target domains and generate images with integrated attributes of them? In this work, we introduce a new task -- Few-shot Hybrid Domain Adaptation (HDA). Given a source generator and several target domains, HDA aims to acquire an adapted generator that preserves the integrated attributes of all target domains, without overriding the source domain's characteristics. Compared with Domain Adaptation (DA), HDA offers greater flexibility and versatility to adapt generators to more composite and expansive domains. Simultaneously, HDA also presents more challenges than DA as we have access only to images from individual target domains and lack authentic images from the hybrid domain. To address this issue, we introduce a discriminator-free framework that directly encodes different domains' images into well-separable subspaces. To achieve HDA, we propose a novel directional subspace loss comprised of a distance loss and a direction loss. Concretely, the distance loss blends the attributes of all target domains by reducing the distances from generated images to all target subspaces. The direction loss preserves the characteristics from the source domain by guiding the adaptation along the perpendicular to subspaces. Experiments show that our method can obtain numerous domain-specific attributes in a single adapted generator, which surpasses the baseline methods in semantic similarity, image fidelity, and cross-domain consistency.
CVDec 20, 2023Code
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without TrainingYuqi Lin, Minghao Chen, Kaipeng Zhang et al.
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. The class token in the image encoder is trained to capture the global features to distinguish different text descriptions supervised by contrastive loss, making it highly effective for single-label classification. However, it shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class and the contrastive nature of softmax operation aggravates it. In this study, we observe that the multi-label classification results heavily rely on discriminative local features but are overlooked by CLIP. As a result, we dissect the preservation of patch-wise spatial information in CLIP and proposed a local-to-global framework to obtain image tags. It comprises three steps: (1) patch-level classification to obtain coarse scores; (2) dual-masking attention refinement (DMAR) module to refine the coarse scores; (3) class-wise reidentification (CWR) module to remedy predictions from a global perspective. This framework is solely based on frozen CLIP and significantly enhances its multi-label classification performance on various benchmarks without dataset-specific training. Besides, to comprehensively assess the quality and practicality of generated tags, we extend their application to the downstream task, i.e., weakly supervised semantic segmentation (WSSS) with generated tags as image-level pseudo labels. Experiments demonstrate that this classify-then-segment paradigm dramatically outperforms other annotation-free segmentation methods and validates the effectiveness of generated tags. Our code is available at https://github.com/linyq2117/TagCLIP.
IRJul 12, 2024
Movie Recommendation with Poster Attention via Multi-modal Transformer Feature FusionLinhan Xia, Yicheng Yang, Ziou Chen et al.
Pre-trained models learn general representations from large datsets which can be fine-turned for specific tasks to significantly reduce training time. Pre-trained models like generative pretrained transformers (GPT), bidirectional encoder representations from transformers (BERT), vision transfomers (ViT) have become a cornerstone of current research in machine learning. This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie and the narrative text description of the movie. This system uses the BERT model to extract the information of text modality, the ViT model applied to extract the information of poster/image modality, and the Transformer architecture for feature fusion of all modalities to predict users' preference. The integration of pre-trained foundational models with some smaller data sets in downstream applications capture multi-modal content features in a more comprehensive manner, thereby providing more accurate recommendations. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets. The prediction accuracy of user ratings is enhanced in comparison to the baseline algorithm, thereby demonstrating the potential of this cross-modal algorithm to be applied for movie or video recommendation.
CVFeb 10, 2025Code
SAMRefiner: Taming Segment Anything Model for Universal Mask RefinementYuqi Lin, Hengjia Li, Wenqi Shao et al.
In this paper, we explore a principal way to enhance the quality of widely pre-existing coarse masks, enabling them to serve as reliable training data for segmentation models to reduce the annotation cost. In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task. The core technique of our model is the noise-tolerant prompting scheme. Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM (i.e., distance-guided points, context-aware elastic bounding boxes, and Gaussian-style masks) from initial coarse masks. These prompts can collaborate with each other to mitigate the effect of defects in coarse masks. In particular, considering the difficulty of SAM to handle the multi-object case in semantic segmentation, we introduce a split-then-merge (STM) pipeline. Additionally, we extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset. This step is self-boosted and requires no additional annotation. The proposed framework is versatile and can flexibly cooperate with existing segmentation methods. We evaluate our mask framework on a wide range of benchmarks under different settings, demonstrating better accuracy and efficiency. SAMRefiner holds significant potential to expedite the evolution of refinement tools. Our code is available at https://github.com/linyq2117/SAMRefiner.
CVFeb 3
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token BypassChen Qian, Xinran Yu, Danyang Li et al.
Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method that performs pruning at model-specific layers with strong visual token selection capability, while enabling independent pruning decisions across layers. Experiments across multiple VLMs and benchmarks demonstrate that SwiftVLM consistently outperforms existing pruning strategies, achieving superior accuracy-efficiency trade-offs and more faithful visual token selection behavior.
LGSep 26, 2025Code
HEAPr: Hessian-based Efficient Atomic Expert Pruning in Output SpaceKe Li, Zheng Yang, Zhongbin Zhou et al.
Mixture-of-Experts (MoE) architectures in large language models (LLMs) deliver exceptional performance and reduced inference costs compared to dense LLMs. However, their large parameter counts result in prohibitive memory requirements, limiting practical deployment. While existing pruning methods primarily focus on expert-level pruning, this coarse granularity often leads to substantial accuracy degradation. In this work, we introduce HEAPr, a novel pruning algorithm that decomposes experts into smaller, indivisible atomic experts, enabling more precise and flexible atomic expert pruning. To measure the importance of each atomic expert, we leverage second-order information based on principles similar to Optimal Brain Surgeon (OBS) theory. To address the computational and storage challenges posed by second-order information, HEAPr exploits the inherent properties of atomic experts to transform the second-order information from expert parameters into that of atomic expert parameters, and further simplifies it to the second-order information of atomic expert outputs. This approach reduces the space complexity from $O(d^4)$, where d is the model's dimensionality, to $O(d^2)$. HEAPr requires only two forward passes and one backward pass on a small calibration set to compute the importance of atomic experts. Extensive experiments on MoE models, including DeepSeek MoE and Qwen MoE family, demonstrate that HEAPr outperforms existing expert-level pruning methods across a wide range of compression ratios and benchmarks. Specifically, HEAPr achieves nearly lossless compression at compression ratios of 20% ~ 25% in most models, while also reducing FLOPs nearly by 20%. The code can be found at \href{https://github.com/LLIKKE/HEAPr}{https://github.com/LLIKKE/HEAPr}.
CVAug 18, 2025Code
OpenMoCap: Rethinking Optical Motion Capture under Real-world OcclusionChen Qian, Danyang Li, Xinran Yu et al.
Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: https://github.com/qianchen214/OpenMoCap.
LGMay 15, 2025Code
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained DevicesXiangwen Zhuge, Xu Shen, Zeyu Wang et al.
Efficient LLM inference on resource-constrained devices presents significant challenges in compute and memory utilization. Due to limited GPU memory, existing systems offload model weights to CPU memory, incurring substantial I/O overhead between the CPU and GPU. This leads to two major inefficiencies: (1) GPU cores are underutilized, often remaining idle while waiting for data to be loaded; and (2) GPU memory has low impact on performance, as reducing its capacity has minimal effect on overall throughput.In this paper, we propose SpecOffload, a high-throughput inference engine that embeds speculative decoding into offloading. Our key idea is to unlock latent GPU resources for storing and executing a draft model used for speculative decoding, thus accelerating inference at near-zero additional cost. To support this, we carefully orchestrate the interleaved execution of target and draft models in speculative decoding within the offloading pipeline, and propose a planner to manage tensor placement and select optimal parameters. Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x. Our code is available at https://github.com/MobiSense/SpecOffload-public .
CVMay 25, 2023Code
Learning Occupancy for Monocular 3D Object DetectionLiang Peng, Junkai Xu, Haoran Cheng et al.
Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning, but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper, we propose \textbf{OccupancyM3D}, a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space, leading to more discriminative and informative 3D features and representations. Specifically, by using synchronized raw sparse LiDAR point clouds, we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result, experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin. Codes and pre-trained models will be available at \url{https://github.com/SPengLiang/OccupancyM3D}.
CVApr 19, 2021Code
Lidar Point Cloud Guided Monocular 3D Object DetectionLiang Peng, Fei Liu, Zhengxu Yu et al.
Monocular 3D object detection is a challenging task in the self-driving and computer vision community. As a common practice, most previous works use manually annotated 3D box labels, where the annotating process is expensive. In this paper, we find that the precisely and carefully annotated labels may be unnecessary in monocular 3D detection, which is an interesting and counterintuitive finding. Using rough labels that are randomly disturbed, the detector can achieve very close accuracy compared to the one using the ground-truth labels. We delve into this underlying mechanism and then empirically find that: concerning the label accuracy, the 3D location part in the label is preferred compared to other parts of labels. Motivated by the conclusions above and considering the precise LiDAR 3D measurement, we propose a simple and effective framework, dubbed LiDAR point cloud guided monocular 3D object detection (LPCG). This framework is capable of either reducing the annotation costs or considerably boosting the detection accuracy without introducing extra annotation costs. Specifically, It generates pseudo labels from unlabeled LiDAR point clouds. Thanks to accurate LiDAR 3D measurements in 3D space, such pseudo labels can replace manually annotated labels in the training of monocular 3D detectors, since their 3D location information is precise. LPCG can be applied into any monocular 3D detector to fully use massive unlabeled data in a self-driving system. As a result, in KITTI benchmark, we take the first place on both monocular 3D and BEV (bird's-eye-view) detection with a significant margin. In Waymo benchmark, our method using 10% labeled data achieves comparable accuracy to the baseline detector using 100% labeled data. The codes are released at https://github.com/SPengLiang/LPCG.
CVMar 18, 2021Code
Suppress-and-Refine Framework for End-to-End 3D Object DetectionZili Liu, Guodong Xu, Honghui Yang et al.
3D object detector based on Hough voting achieves great success and derives many follow-up works. Despite constantly refreshing the detection accuracy, these works suffer from handcrafted components used to eliminate redundant boxes, and thus are non-end-to-end and time-consuming. In this work, we propose a suppress-and-refine framework to remove these handcrafted components. To fully utilize full-resolution information and achieve real-time speed, it directly consumes feature points and redundant 3D proposals. Specifically, it first suppresses noisy 3D feature points and then feeds them to 3D proposals for the following RoI-aware refinement. With the gating mechanism to build fine proposal features and the self-attention mechanism to model relationships, our method can produce high-quality predictions with a small computation budget in an end-to-end manner. To this end, we present the first fully end-to-end 3D detector, SRDet, on the basis of VoteNet. It achieves state-of-the-art performance on the challenging ScanNetV2 and SUN RGB-D datasets with the fastest speed ever. Our code will be available at https://github.com/ZJULearning/SRDet.
CVAug 31, 2020Code
RESA: Recurrent Feature-Shift Aggregator for Lane DetectionTu Zheng, Hao Fang, Yi Zhang et al.
Lane detection is one of the most important tasks in self-driving. Due to various complex scenarios (e.g., severe occlusion, ambiguous lanes, etc.) and the sparse supervisory signals inherent in lane annotations, lane detection task is still challenging. Thus, it is difficult for the ordinary convolutional neural network (CNN) to train in general scenes to catch subtle lane feature from the raw image. In this paper, we present a novel module named REcurrent Feature-Shift Aggregator (RESA) to enrich lane feature after preliminary feature extraction with an ordinary CNN. RESA takes advantage of strong shape priors of lanes and captures spatial relationships of pixels across rows and columns. It shifts sliced feature map recurrently in vertical and horizontal directions and enables each pixel to gather global information. RESA can conjecture lanes accurately in challenging scenarios with weak appearance clues by aggregating sliced feature map. Moreover, we propose a Bilateral Up-Sampling Decoder that combines coarse-grained and fine-detailed features in the up-sampling stage. It can recover the low-resolution feature map into pixel-wise prediction meticulously. Our method achieves state-of-the-art results on two popular lane detection benchmarks (CULane and Tusimple). Code has been made available at: https://github.com/ZJULearning/resa.
CVOct 26, 2019Code
Region Mutual Information Loss for Semantic SegmentationShuai Zhao, Yang Wang, Zheng Yang et al.
Semantic segmentation is a fundamental problem in computer vision. It is considered as a pixel-wise classification problem in practice, and most segmentation models use a pixel-wise loss as their optimization riterion. However, the pixel-wise loss ignores the dependencies between pixels in an image. Several ways to exploit the relationship between pixels have been investigated, \eg, conditional random fields (CRF) and pixel affinity based methods. Nevertheless, these methods usually require additional model branches, large extra memories, or more inference time. In this paper, we develop a region mutual information (RMI) loss to model the dependencies among pixels more simply and efficiently. In contrast to the pixel-wise loss which treats the pixels as independent samples, RMI uses one pixel and its neighbour pixels to represent this pixel. Then for each pixel in an image, we get a multi-dimensional point that encodes the relationship between pixels, and the image is cast into a multi-dimensional distribution of these high-dimensional points. The prediction and ground truth thus can achieve high order consistency through maximizing the mutual information (MI) between their multi-dimensional distributions. Moreover, as the actual value of the MI is hard to calculate, we derive a lower bound of the MI and maximize the lower bound to maximize the real value of the MI. RMI only requires a few extra computational resources in the training stage, and there is no overhead during testing. Experimental results demonstrate that RMI can achieve substantial and consistent improvements in performance on PASCAL VOC 2012 and CamVid datasets. The code is available at https://github.com/ZJULearning/RMI.
CVSep 2, 2019Code
Training-Time-Friendly Network for Real-Time Object DetectionZili Liu, Tu Zheng, Guodong Xu et al.
Modern object detectors can rarely achieve short training time, fast inference speed, and high accuracy at the same time. To strike a balance among them, we propose the Training-Time-Friendly Network (TTFNet). In this work, we start with light-head, single-stage, and anchor-free designs, which enable fast inference speed. Then, we focus on shortening training time. We notice that encoding more training samples from annotated boxes plays a similar role as increasing batch size, which helps enlarge the learning rate and accelerate the training process. To this end, we introduce a novel approach using Gaussian kernels to encode training samples. Besides, we design the initiative sample weights for better information utilization. Experiments on MS COCO show that our TTFNet has great advantages in balancing training time, inference speed, and accuracy. It has reduced training time by more than seven times compared to previous real-time detectors while maintaining state-of-the-art performances. In addition, our super-fast version of TTFNet-18 and TTFNet-53 can outperform SSD300 and YOLOv3 by less than one-tenth of their training time, respectively. The code has been made available at \url{https://github.com/ZJULearning/ttfnet}.
ROOct 16, 2017Code
GroundSLAM: A Robust Visual SLAM System for Warehouse Robots Using Ground TexturesKuan Xu, Zheng Yang, Lihua Xie et al.
A robust visual localization and mapping system is essential for warehouse robot navigation, as cameras offer a more cost-effective alternative to LiDAR sensors. However, existing forward-facing camera systems often encounter challenges in dynamic environments and open spaces, leading to significant performance degradation during deployment. To address these limitations, a localization system utilizing a single downward-facing camera to capture ground textures presents a promising solution. Nevertheless, existing feature-based ground-texture localization methods face difficulties when operating on surfaces with sparse features or repetitive patterns. To address this limitation, we propose GroundSLAM, a novel feature-free and ground-texture-based simultaneous localization and mapping (SLAM) system. GroundSLAM consists of three components: feature-free visual odometry, ground-texture-based loop detection and map optimization, and map reuse. Specifically, we introduce a kernel cross-correlator (KCC) for image-level pose tracking, loop detection, and map reuse to improve localization accuracy and robustness, and incorporate adaptive pruning strategies to enhance efficiency. Due to these specific designs, GroundSLAM is able to deliver efficient and stable localization across various ground surfaces such as those with sparse features and repetitive patterns. To advance research in this area, we introduce the first ground-texture dataset with precise ground-truth poses, consisting of 131k images collected from 10 kinds of indoor and outdoor ground surfaces. Extensive experimental results show that GroundSLAM outperforms state-of-the-art methods for both indoor and outdoor localization. We release our code and dataset at https://github.com/sair-lab/GroundSLAM.
LGAug 26, 2024
Category-Theoretical and Topos-Theoretical Frameworks in Machine Learning: A SurveyYiyang Jia, Guohong Peng, Zheng Yang et al.
In this survey, we provide an overview of category theory-derived machine learning from four mainstream perspectives: gradient-based learning, probability-based learning, invariance and equivalence-based learning, and topos-based learning. For the first three topics, we primarily review research in the past five years, updating and expanding on the previous survey by Shiebler et al.. The fourth topic, which delves into higher category theory, particularly topos theory, is surveyed for the first time in this paper. In certain machine learning methods, the compositionality of functors plays a vital role, prompting the development of specific categorical frameworks. However, when considering how the global properties of a network reflect in local structures and how geometric properties are expressed with logic, the topos structure becomes particularly significant and profound.
LGApr 14, 2024
RF-Diffusion: Radio Signal Generation via Time-Frequency DiffusionGuoxuan Chi, Zheng Yang, Chenshu Wu et al. · tsinghua
Along with AIGC shines in CV and NLP, its potential in the wireless domain has also emerged in recent years. Yet, existing RF-oriented generative solutions are ill-suited for generating high-quality, time-series RF data due to limited representation capabilities. In this work, inspired by the stellar achievements of the diffusion model in CV and NLP, we adapt it to the RF domain and propose RF-Diffusion. To accommodate the unique characteristics of RF signals, we first introduce a novel Time-Frequency Diffusion theory to enhance the original diffusion model, enabling it to tap into the information within the time, frequency, and complex-valued domains of RF signals. On this basis, we propose a Hierarchical Diffusion Transformer to translate the theory into a practical generative DNN through elaborated design spanning network architecture, functional block, and complex-valued operator, making RF-Diffusion a versatile solution to generate diverse, high-quality, and time-series RF data. Performance comparison with three prevalent generative models demonstrates the RF-Diffusion's superior performance in synthesizing Wi-Fi and FMCW signals. We also showcase the versatility of RF-Diffusion in boosting Wi-Fi sensing systems and performing channel estimation in 5G networks.
72.7CVApr 3
The Eleventh NTIRE 2026 Efficient Super-Resolution Challenge ReportBin Ren, Hang Guo, Yan Shu et al.
This paper reviews the NTIRE 2026 challenge on efficient single-image super-resolution with a focus on the proposed solutions and results. The aim of this challenge is to devise a network that reduces one or several aspects, such as runtime, parameters, and FLOPs, while maintaining PSNR of around 26.90 dB on the DIV2K_LSDIR_valid dataset, and 26.99 dB on the DIV2K_LSDIR_test dataset. The challenge had 95 registered participants, and 15 teams made valid submissions. They gauge the state-of-the-art results for efficient single-image super-resolution.
CVDec 14, 2023
Local Conditional Controlling for Text-to-Image Diffusion ModelsYibo Zhao, Liang Peng, Yang Yang et al.
Diffusion models have exhibited impressive prowess in the text-to-image task. Recent methods add image-level structure controls, e.g., edge and depth maps, to manipulate the generation process together with text prompts to obtain desired images. This controlling process is globally operated on the entire image, which limits the flexibility of control regions. In this paper, we explore a novel and practical task setting: local control. It focuses on controlling specific local region according to user-defined image conditions, while the remaining regions are only conditioned by the original text prompt. However, it is non-trivial to achieve local conditional controlling. The naive manner of directly adding local conditions may lead to the local control dominance problem, which forces the model to focus on the controlled region and neglect object generation in other regions. To mitigate this problem, we propose Regional Discriminate Loss to update the noised latents, aiming at enhanced object generation in non-control regions. Furthermore, the proposed Focused Token Response suppresses weaker attention scores which lack the strongest response to enhance object distinction and reduce duplication. Lastly, we adopt Feature Mask Constraint to reduce quality degradation in images caused by information differences across the local control region. All proposed strategies are operated at the inference stage. Extensive experiments demonstrate that our method can synthesize high-quality images aligned with the text prompt under local control conditions.
87.7CVMar 13
Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D WorldYuzhi Huang, Kairun Wen, Rongxin Gao et al.
Humans inhabit a physical 4D world where geometric structure and semantic content evolve over time, constituting a dynamic 4D reality (spatial with temporal dimension). While current Multimodal Large Language Models (MLLMs) excel in static visual understanding, can they also be adept at "thinking in dynamics", i.e., perceive, track and reason about spatio-temporal dynamics in evolving scenes? To systematically assess their spatio-temporal reasoning and localized dynamics perception capabilities, we introduce Dyn-Bench, a large-scale benchmark built from diverse real-world and synthetic video datasets, enabling robust and scalable evaluation of spatio-temporal understanding. Through multi-stage filtering from massive 2D and 4D data sources, Dyn-Bench provides a high-quality collection of dynamic scenes, comprising 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding pairs. We probe general, spatial and region-level MLLMs to express how they think in dynamics both linguistically and visually, and find that existing models cannot simultaneously maintain strong performance in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Notably, conventional prompting strategies (e.g., chain-of-thought or caption-based hints) provide limited improvement, whereas structured integration approaches, including Mask-Guided Fusion and Spatio-Temporal Textual Cognitive Map (ST-TCM), significantly enhance MLLMs' dynamics perception and spatio-temporal reasoning in the physical 4D world. Code and benchmark are available at https://dyn-bench.github.io/.
88.2NIApr 27
MatchRDMA: A Segmented and Rate-Matched Long-Haul RDMA Scheme for Geo-distributed LLM Training over OTNJun Dai, Xiaorun Wang, Xingde Li et al.
We propose MatchRDMA, a proactive, segmented, and rate-matched long-haul RDMA scheme for geo-distributed LLM training over OTN. By coordinating source and destination OTN rates, it improves inter-DC throughput by up to 20x compared with conventional RDMA, and reduces destination-OTN buffer occupancy by up to 62.7%.
CVApr 2, 2025
All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch LearningZheng Yang, Ruoxin Chen, Zhiyuan Yan et al. · tencent-ai
The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods. In this paper, we establish two key principles for AIGI detection through systematic analysis: (1) All Patches Matter: Unlike conventional image classification where discriminative features concentrate on object-centric regions, each patch in AIGIs inherently contains synthetic artifacts due to the uniform generation process, suggesting that every patch serves as an important artifact source for detection. (2) More Patches Better: Leveraging distributed artifacts across more patches improves detection robustness by capturing complementary forensic evidence and reducing over-reliance on specific patches, thereby enhancing robustness and generalization. However, our counterfactual analysis reveals an undesirable phenomenon: naively trained detectors often exhibit a Few-Patch Bias, discriminating between real and synthetic images based on minority patches. We identify Lazy Learner as the root cause: detectors preferentially learn conspicuous artifacts in limited patches while neglecting broader artifact distributions. To address this bias, we propose the Panoptic Patch Learning (PPL) framework, involving: (1) Random Patch Replacement that randomly substitutes synthetic patches with real counterparts to compel models to identify artifacts in underutilized regions, encouraging the broader use of more patches; (2) Patch-wise Contrastive Learning that enforces consistent discriminative capability across all patches, ensuring uniform utilization of all patches. Extensive experiments across two different settings on several benchmarks verify the effectiveness of our approach.
CVDec 20, 2024
GCA-3D: Towards Generalized and Consistent Domain Adaptation of 3D GeneratorsHengjia Li, Yang Liu, Yibo Zhao et al.
Recently, 3D generative domain adaptation has emerged to adapt the pre-trained generator to other domains without collecting massive datasets and camera pose distributions. Typically, they leverage large-scale pre-trained text-to-image diffusion models to synthesize images for the target domain and then fine-tune the 3D model. However, they suffer from the tedious pipeline of data generation, which inevitably introduces pose bias between the source domain and synthetic dataset. Furthermore, they are not generalized to support one-shot image-guided domain adaptation, which is more challenging due to the more severe pose bias and additional identity bias introduced by the single image reference. To address these issues, we propose GCA-3D, a generalized and consistent 3D domain adaptation method without the intricate pipeline of data generation. Different from previous pipeline methods, we introduce multi-modal depth-aware score distillation sampling loss to efficiently adapt 3D generative models in a non-adversarial manner. This multi-modal loss enables GCA-3D in both text prompt and one-shot image prompt adaptation. Besides, it leverages per-instance depth maps from the volume rendering module to mitigate the overfitting problem and retain the diversity of results. To enhance the pose and identity consistency, we further propose a hierarchical spatial consistency loss to align the spatial structure between the generated images in the source and target domain. Experiments demonstrate that GCA-3D outperforms previous methods in terms of efficiency, generalization, pose accuracy, and identity consistency.
LGSep 18, 2025
Generative AI Meets Wireless Sensing: Towards Wireless Foundation ModelZheng Yang, Guoxuan Chi, Chenshu Wu et al. · tsinghua
Generative Artificial Intelligence (GenAI) has made significant advancements in fields such as computer vision (CV) and natural language processing (NLP), demonstrating its capability to synthesize high-fidelity data and improve generalization. Recently, there has been growing interest in integrating GenAI into wireless sensing systems. By leveraging generative techniques such as data augmentation, domain adaptation, and denoising, wireless sensing applications, including device localization, human activity recognition, and environmental monitoring, can be significantly improved. This survey investigates the convergence of GenAI and wireless sensing from two complementary perspectives. First, we explore how GenAI can be integrated into wireless sensing pipelines, focusing on two modes of integration: as a plugin to augment task-specific models and as a solver to directly address sensing tasks. Second, we analyze the characteristics of mainstream generative models, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion models, and discuss their applicability and unique advantages across various wireless sensing tasks. We further identify key challenges in applying GenAI to wireless sensing and outline a future direction toward a wireless foundation model: a unified, pre-trained design capable of scalable, adaptable, and efficient signal understanding across diverse sensing tasks.
CVSep 10, 2025
Symmetry Interactive Transformer with CNN Framework for Diagnosis of Alzheimer's Disease Using Structural MRIZheng Yang, Yanteng Zhang, Xupeng Kou et al.
Structural magnetic resonance imaging (sMRI) combined with deep learning has achieved remarkable progress in the prediction and diagnosis of Alzheimer's disease (AD). Existing studies have used CNN and transformer to build a well-performing network, but most of them are based on pretraining or ignoring the asymmetrical character caused by brain disorders. We propose an end-to-end network for the detection of disease-based asymmetric induced by left and right brain atrophy which consist of 3D CNN Encoder and Symmetry Interactive Transformer (SIT). Following the inter-equal grid block fetch operation, the corresponding left and right hemisphere features are aligned and subsequently fed into the SIT for diagnostic analysis. SIT can help the model focus more on the regions of asymmetry caused by structural changes, thus improving diagnostic performance. We evaluated our method based on the ADNI dataset, and the results show that the method achieves better diagnostic accuracy (92.5\%) compared to several CNN methods and CNNs combined with a general transformer. The visualization results show that our network pays more attention in regions of brain atrophy, especially for the asymmetric pathological characteristics induced by AD, demonstrating the interpretability and effectiveness of the method.
CVAug 18, 2025
edgeVLM: Cloud-edge Collaborative Real-time VLM based on Context TransferChen Qian, Xinran Yu, Zewen Huang et al.
Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design edgeVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision-lanuage reasoning tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.
MNAug 16, 2025
Modeling GRNs with a Probabilistic Categorical FrameworkYiyang Jia, Zheng Wei, Zheng Yang et al.
Understanding the complex and stochastic nature of Gene Regulatory Networks (GRNs) remains a central challenge in systems biology. Existing modeling paradigms often struggle to effectively capture the intricate, multi-factor regulatory logic and to rigorously manage the dual uncertainties of network structure and kinetic parameters. In response, this work introduces the Probabilistic Categorical GRN(PC-GRN) framework. It is a novel theoretical approach founded on the synergistic integration of three core methodologies. Firstly, category theory provides a formal language for the modularity and composition of regulatory pathways. Secondly, Bayesian Typed Petri Nets (BTPNs) serve as an interpretable,mechanistic substrate for modeling stochastic cellular processes, with kinetic parameters themselves represented as probability distributions. The central innovation of PC-GRN is its end-to-end generative Bayesian inference engine, which learns a full posterior distribution over BTPN models (P (G, Θ|D)) directly from data. This is achieved by the novel interplay of a GFlowNet, which learns a policy to sample network topologies, and a HyperNetwork, which performs amortized inference to predict their corresponding parameter distributions. The resulting framework provides a mathematically rigorous, biologically interpretable, and uncertainty-aware representation of GRNs, advancing predictive modeling and systems-level analysis.
CVAug 15, 2025
Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object TrackingHaonan Zhang, Xinyao Wang, Boxi Wu et al.
3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.
CVJul 5, 2025
VISC: mmWave Radar Scene Flow Estimation using Pervasive Visual-Inertial SupervisionKezhong Liu, Yiwen Zhou, Mozi Chen et al.
This work proposes a mmWave radar's scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.
CVJan 23, 2024
UniHDA: A Unified and Versatile Framework for Multi-Modal Hybrid Domain AdaptationHengjia Li, Yang Liu, Yuqi Lin et al.
Recently, generative domain adaptation has achieved remarkable progress, enabling us to adapt a pre-trained generator to a new target domain. However, existing methods simply adapt the generator to a single target domain and are limited to a single modality, either text-driven or image-driven. Moreover, they cannot maintain well consistency with the source domain, which impedes the inheritance of the diversity. In this paper, we propose UniHDA, a \textbf{unified} and \textbf{versatile} framework for generative hybrid domain adaptation with multi-modal references from multiple domains. We use CLIP encoder to project multi-modal references into a unified embedding space and then linearly interpolate the direction vectors from multiple target domains to achieve hybrid domain adaptation. To ensure \textbf{consistency} with the source domain, we propose a novel cross-domain spatial structure (CSS) loss that maintains detailed spatial structure information between source and target generator. Experiments show that the adapted generator can synthesise realistic images with various attribute compositions. Additionally, our framework is generator-agnostic and versatile to multiple generators, e.g., StyleGAN, EG3D, and Diffusion Models.
CRApr 16, 2021
Transparent Electricity Pricing with PrivacyDaniel Reijsbergen, Zheng Yang, Aung Maw et al.
Smart grids leverage data from smart meters to improve operations management and to achieve cost reductions. The fine-grained meter data also enable pricing schemes that simultaneously benefit electricity retailers and users. Our goal is to design a practical dynamic pricing protocol for smart grids in which the rate charged by a retailer depends on the total demand among its users. Realizing this goal is challenging because neither the retailer nor the users are trusted. The first challenge is to design a pricing scheme that incentivizes consumption behavior that leads to lower costs for both the users and the retailer. The second challenge is to prevent the retailer from tampering with the data, for example, by claiming that the total consumption is much higher than its real value. The third challenge is data privacy, that is, how to hide the meter data from adversarial users. To address these challenges, we propose a scheme in which peak rates are charged if either the total or the individual consumptions exceed some thresholds. We formally define a privacy-preserving transparent pricing scheme (PPTP) that allows honest users to detect tampering at the retailer while ensuring data privacy. We present two instantiations of PPTP, and prove their security. Both protocols use secure commitments and zero-knowledge proofs. We implement and evaluate the protocols on server and edge hardware, demonstrating that PPTP has practical performance at scale.
CVJul 13, 2020
Gender Classification and Bias Mitigation in Facial ImagesWenying Wu, Pavlos Protopapas, Zheng Yang et al.
Gender classification algorithms have important applications in many domains today such as demographic research, law enforcement, as well as human-computer interaction. Recent research showed that algorithms trained on biased benchmark databases could result in algorithmic bias. However, to date, little research has been carried out on gender classification algorithms' bias towards gender minorities subgroups, such as the LGBTQ and the non-binary population, who have distinct characteristics in gender expression. In this paper, we began by conducting surveys on existing benchmark databases for facial recognition and gender classification tasks. We discovered that the current benchmark databases lack representation of gender minority subgroups. We worked on extending the current binary gender classifier to include a non-binary gender class. We did that by assembling two new facial image databases: 1) a racially balanced inclusive database with a subset of LGBTQ population 2) an inclusive-gender database that consists of people with non-binary gender. We worked to increase classification accuracy and mitigate algorithmic biases on our baseline model trained on the augmented benchmark database. Our ensemble model has achieved an overall accuracy score of 90.39%, which is a 38.72% increase from the baseline binary gender classifier trained on Adience. While this is an initial attempt towards mitigating bias in gender classification, more work is needed in modeling gender as a continuum by assembling more inclusive databases.
CVApr 1, 2020
Boundary-Aware Dense Feature Indicator for Single-Stage 3D Object Detection from Point CloudsGuodong Xu, Wenxiao Wang, Zili Liu et al.
3D object detection based on point clouds has become more and more popular. Some methods propose localizing 3D objects directly from raw point clouds to avoid information loss. However, these methods come with complex structures and significant computational overhead, limiting its broader application in real-time scenarios. Some methods choose to transform the point cloud data into compact tensors first and leverage off-the-shelf 2D detectors to propose 3D objects, which is much faster and achieves state-of-the-art results. However, because of the inconsistency between 2D and 3D data, we argue that the performance of compact tensor-based 3D detectors is restricted if we use 2D detectors without corresponding modification. Specifically, the distribution of point clouds is uneven, with most points gather on the boundary of objects, while detectors for 2D data always extract features evenly. Motivated by this observation, we propose DENse Feature Indicator (DENFI), a universal module that helps 3D detectors focus on the densest region of the point clouds in a boundary-aware manner. Moreover, DENFI is lightweight and guarantees real-time speed when applied to 3D object detectors. Experiments on KITTI dataset show that DENFI improves the performance of the baseline single-stage detector remarkably, which achieves new state-of-the-art performance among previous 3D detectors, including both two-stage and multi-sensor fusion methods, in terms of mAP with a 34FPS detection speed.
CVNov 14, 2019
PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion ModuleLiang Xie, Chao Xiang, Zhengxu Yu et al.
LIDAR point clouds and RGB-images are both extremely essential for 3D object detection. So many state-of-the-art 3D detection algorithms dedicate in fusing these two types of data effectively. However, their fusion methods based on Birds Eye View (BEV) or voxel format are not accurate. In this paper, we propose a novel fusion approach named Point-based Attentive Cont-conv Fusion(PACF) module, which fuses multi-sensor features directly on 3D points. Except for continuous convolution, we additionally add a Point-Pooling and an Attentive Aggregation to make the fused features more expressive. Moreover, based on the PACF module, we propose a 3D multi-sensor multi-task network called Pointcloud-Image RCNN(PI-RCNN as brief), which handles the image segmentation and 3D object detection tasks. PI-RCNN employs a segmentation sub-network to extract full-resolution semantic feature maps from images and then fuses the multi-sensor features via powerful PACF module. Beneficial from the effectiveness of the PACF module and the expressive semantic features from the segmentation module, PI-RCNN can improve much in 3D object detection. We demonstrate the effectiveness of the PACF module and PI-RCNN on the KITTI 3D Detection benchmark, and our method can achieve state-of-the-art on the metric of 3D AP.
CVJul 16, 2019
Improving Semantic Segmentation via Dilated AffinityBoxi Wu, Shuai Zhao, Wenqing Chu et al.
Introducing explicit constraints on the structural predictions has been an effective way to improve the performance of semantic segmentation models. Existing methods are mainly based on insufficient hand-crafted rules that only partially capture the image structure, and some methods can also suffer from the efficiency issue. As a result, most of the state-of-the-art fully convolutional networks did not adopt these techniques. In this work, we propose a simple, fast yet effective method that exploits structural information through direct supervision with minor additional expense. To be specific, our method explicitly requires the network to predict semantic segmentation as well as dilated affinity, which is a sparse version of pair-wise pixel affinity. The capability of telling the relationships between pixels are directly built into the model and enhance the quality of segmentation in two stages. 1) Joint training with dilated affinity can provide robust feature representations and thus lead to finer segmentation results. 2) The extra output of affinity information can be further utilized to refine the original segmentation with a fast propagation process. Consistent improvements are observed on various benchmark datasets when applying our framework to the existing state-of-the-art model. Codes will be released soon.
HCMay 17, 2018
Interpretable Parallel Recurrent Neural Networks with Convolutional Attentions for Multi-Modality Activity ModelingKaixuan Chen, Lina Yao, Xianzhi Wang et al.
Multimodal features play a key role in wearable sensor-based human activity recognition (HAR). Selecting the most salient features adaptively is a promising way to maximize the effectiveness of multimodal sensor data. In this regard, we propose a "collect fully and select wisely" principle as well as an interpretable parallel recurrent model with convolutional attentions to improve the recognition performance. We first collect modality features and the relations between each pair of features to generate activity frames, and then introduce an attention mechanism to select the most prominent regions from activity frames precisely. The selected frames not only maximize the utilization of valid features but also reduce the number of features to be computed effectively. We further analyze the accuracy and interpretability of the proposed model based on extensive experiments. The results show that our model achieves competitive performance on two benchmarked datasets and works well in real life scenarios.
CLJan 27, 2018
Exploration on Generating Traditional Chinese Medicine Prescription from Symptoms with an End-to-End methodWei Li, Zheng Yang, Xu Sun
Traditional Chinese Medicine (TCM) is an influential form of medical treatment in China and surrounding areas. In this paper, we propose a TCM prescription generation task that aims to automatically generate a herbal medicine prescription based on textual symptom descriptions. Sequence-to-sequence (seq2seq) model has been successful in dealing with sequence generation tasks. We explore a potential end-to-end solution to the TCM prescription generation task using seq2seq models. However, experiments show that directly applying seq2seq model leads to unfruitful results due to the repetition problem. To solve the problem, we propose a novel decoder with coverage mechanism and a novel soft loss function. The experimental results demonstrate the effectiveness of the proposed approach. Judged by professors who excel in TCM, the generated prescriptions are rated 7.3 out of 10. It shows that the model can indeed help with the prescribing procedure in real life.
CLNov 6, 2017
Distributed Representation for Traditional Chinese Medicine Herb via Deep Learning ModelsWei Li, Zheng Yang
Traditional Chinese Medicine (TCM) has accumulated a big amount of precious resource in the long history of development. TCM prescriptions that consist of TCM herbs are an important form of TCM treatment, which are similar to natural language documents, but in a weakly ordered fashion. Directly adapting language modeling style methods to learn the embeddings of the herbs can be problematic as the herbs are not strictly in order, the herbs in the front of the prescription can be connected to the very last ones. In this paper, we propose to represent TCM herbs with distributed representations via Prescription Level Language Modeling (PLLM). In one of our experiments, the correlation between our calculated similarity between medicines and the judgment of professionals achieves a Spearman score of 55.35 indicating a strong correlation, which surpasses human beginners (TCM related field bachelor student) by a big margin (over 10%).
LGJun 6, 2017
DeepKey: An EEG and Gait Based Dual-Authentication SystemXiang Zhang, Lina Yao, Chaoran Huang et al.
Biometric authentication involves various technologies to identify individuals by exploiting their unique, measurable physiological and behavioral characteristics. However, traditional biometric authentication systems (e.g., face recognition, iris, retina, voice, and fingerprint) are facing an increasing risk of being tricked by biometric tools such as anti-surveillance masks, contact lenses, vocoder, or fingerprint films. In this paper, we design a multimodal biometric authentication system named Deepkey, which uses both Electroencephalography (EEG) and gait signals to better protect against such risk. Deepkey consists of two key components: an Invalid ID Filter Model to block unauthorized subjects and an identification model based on attention-based Recurrent Neural Network (RNN) to identify a subject`s EEG IDs and gait IDs in parallel. The subject can only be granted access while all the components produce consistent evidence to match the user`s proclaimed identity. We implement Deepkey with a live deployment in our university and conduct extensive empirical experiments to study its technical feasibility in practice. DeepKey achieves the False Acceptance Rate (FAR) and the False Rejection Rate (FRR) of 0 and 1.0%, respectively. The preliminary results demonstrate that Deepkey is feasible, show consistent superior performance compared to a set of methods, and has the potential to be applied to the authentication deployment in real world settings.