2.6CVNov 2, 2022Code
Hypergraph Convolutional Network based Weakly Supervised Point Cloud Semantic Segmentation with Scene-Level AnnotationsZhuheng Lu, Peng Zhang, Yuewei Dai et al.
Point cloud segmentation with scene-level annotations is a promising but challenging task. Currently, the most popular way is to employ the class activation map (CAM) to locate discriminative regions and then generate point-level pseudo labels from scene-level annotations. However, these methods always suffer from the point imbalance among categories, as well as the sparse and incomplete supervision from CAM. In this paper, we propose a novel weighted hypergraph convolutional network-based method, called WHCN, to confront the challenges of learning point-wise labels from scene-level annotations. Firstly, in order to simultaneously overcome the point imbalance among different categories and reduce the model complexity, superpoints of a training point cloud are generated by exploiting the geometrically homogeneous partition. Then, a hypergraph is constructed based on the high-confidence superpoint-level seeds which are converted from scene-level annotations. Secondly, the WHCN takes the hypergraph as input and learns to predict high-precision point-level pseudo labels by label propagation. Besides the backbone network consisting of spectral hypergraph convolution blocks, a hyperedge attention module is learned to adjust the weights of hyperedges in the WHCN. Finally, a segmentation network is trained by these pseudo point cloud labels. We comprehensively conduct experiments on the ScanNet and S3DIS segmentation datasets. Experimental results demonstrate that the proposed WHCN is effective to predict the point labels with scene annotations, and yields state-of-the-art results in the community. The source code is available at http://zhiyongsu.github.io/Project/WHCN.html.
1.4CVNov 2, 2022
AS-PD: An Arbitrary-Size Downsampling Framework for Point CloudsPeng Zhang, Ruoyin Xie, Jinsheng Sun et al.
Point cloud downsampling is a crucial pre-processing operation to downsample points in order to unify data size and reduce computational cost, to name a few. Recent research on point cloud downsampling has achieved great success which concentrates on learning to sample in a task-aware way. However, existing learnable samplers can not directly perform arbitrary-size downsampling, and assume the input size is fixed. In this paper, we introduce the AS-PD, a novel task-aware sampling framework that directly downsamples point clouds to any smaller size based on a sample-to-refine strategy. Given an input point cloud of arbitrary size, we first perform a task-agnostic pre-sampling on the input point cloud to a specified sample size. Then, we obtain the sampled set by refining the pre-sampled set to make it task-aware, driven by downstream task losses. The refinement is realized by adding each pre-sampled point with a small offset predicted by point-wise multi-layer perceptrons (MLPs). With the density encoding and proper training scheme, the framework can learn to adaptively downsample point clouds of different input sizes to arbitrary sample sizes. We evaluate sampled results for classification and registration tasks, respectively. The proposed AS-PD surpasses the state-of-the-art method in terms of downstream performance. Further experiments also show that our AS-PD exhibits better generality to unseen task models, implying that the proposed sampler is optimized to the task rather than a specified task model.
18.5CVOct 12, 2022
Human Joint Kinematics Diffusion-Refinement for Stochastic Motion PredictionDong Wei, Huaijiang Sun, Bin Li et al.
Stochastic human motion prediction aims to forecast multiple plausible future motions given a single pose sequence from the past. Most previous works focus on designing elaborate losses to improve the accuracy, while the diversity is typically characterized by randomly sampling a set of latent variables from the latent prior, which is then decoded into possible motions. This joint training of sampling and decoding, however, suffers from posterior collapse as the learned latent variables tend to be ignored by a strong decoder, leading to limited diversity. Alternatively, inspired by the diffusion process in nonequilibrium thermodynamics, we propose MotionDiff, a diffusion probabilistic model to treat the kinematics of human joints as heated particles, which will diffuse from original states to a noise distribution. This process offers a natural way to obtain the "whitened" latents without any trainable parameters, and human motion prediction can be regarded as the reverse diffusion process that converts the noise distribution into realistic future motions conditioned on the observed sequence. Specifically, MotionDiff consists of two parts: a spatial-temporal transformer-based diffusion network to generate diverse yet plausible motions, and a graph convolutional network to further refine the outputs. Experimental results on two datasets demonstrate that our model yields the competitive performance in terms of both accuracy and diversity.
5.0CVApr 10, 2023
DeFeeNet: Consecutive 3D Human Motion Prediction with Deviation FeedbackXiaoning Sun, Huaijiang Sun, Bin Li et al.
Let us rethink the real-world scenarios that require human motion prediction techniques, such as human-robot collaboration. Current works simplify the task of predicting human motions into a one-off process of forecasting a short future sequence (usually no longer than 1 second) based on a historical observed one. However, such simplification may fail to meet practical needs due to the neglect of the fact that motion prediction in real applications is not an isolated ``observe then predict'' unit, but a consecutive process composed of many rounds of such unit, semi-overlapped along the entire sequence. As time goes on, the predicted part of previous round has its corresponding ground truth observable in the new round, but their deviation in-between is neither exploited nor able to be captured by existing isolated learning fashion. In this paper, we propose DeFeeNet, a simple yet effective network that can be added on existing one-off prediction models to realize deviation perception and feedback when applied to consecutive motion prediction task. At each prediction round, the deviation generated by previous unit is first encoded by our DeFeeNet, and then incorporated into the existing predictor to enable a deviation-aware prediction manner, which, for the first time, allows for information transmit across adjacent prediction units. We design two versions of DeFeeNet as MLP-based and GRU-based, respectively. On Human3.6M and more complicated BABEL, experimental results indicate that our proposed network improves consecutive human motion prediction performance regardless of the basic model.
5.0CVApr 13, 2023
Meta-Auxiliary Learning for Adaptive Human Pose PredictionQiongjie Cui, Huaijiang Sun, Jianfeng Lu et al.
Predicting high-fidelity future human poses, from a historically observed sequence, is decisive for intelligent robots to interact with humans. Deep end-to-end learning approaches, which typically train a generic pre-trained model on external datasets and then directly apply it to all test samples, emerge as the dominant solution to solve this issue. Despite encouraging progress, they remain non-optimal, as the unique properties (e.g., motion style, rhythm) of a specific sequence cannot be adapted. More generally, at test-time, once encountering unseen motion categories (out-of-distribution), the predicted poses tend to be unreliable. Motivated by this observation, we propose a novel test-time adaptation framework that leverages two self-supervised auxiliary tasks to help the primary forecasting network adapt to the test sequence. In the testing phase, our model can adjust the model parameters by several gradient updates to improve the generation quality. However, due to catastrophic forgetting, both auxiliary tasks typically tend to the low ability to automatically present the desired positive incentives for the final prediction performance. For this reason, we also propose a meta-auxiliary learning scheme for better adaptation. In terms of general setup, our approach obtains higher accuracy, and under two new experimental designs for out-of-distribution data (unseen subjects and categories), achieves significant improvements.
5.7CVAug 2, 2022
Overlooked Poses Actually Make Sense: Distilling Privileged Knowledge for Human Motion PredictionXiaoning Sun, Qiongjie Cui, Huaijiang Sun et al.
Previous works on human motion prediction follow the pattern of building a mapping relation between the sequence observed and the one to be predicted. However, due to the inherent complexity of multivariate time series data, it still remains a challenge to find the extrapolation relation between motion sequences. In this paper, we present a new prediction pattern, which introduces previously overlooked human poses, to implement the prediction task from the view of interpolation. These poses exist after the predicted sequence, and form the privileged sequence. To be specific, we first propose an InTerPolation learning Network (ITP-Network) that encodes both the observed sequence and the privileged sequence to interpolate the in-between predicted sequence, wherein the embedded Privileged-sequence-Encoder (Priv-Encoder) learns the privileged knowledge (PK) simultaneously. Then, we propose a Final Prediction Network (FP-Network) for which the privileged sequence is not observable, but is equipped with a novel PK-Simulator that distills PK learned from the previous network. This simulator takes as input the observed sequence, but approximates the behavior of Priv-Encoder, enabling FP-Network to imitate the interpolation process. Extensive experimental results demonstrate that our prediction pattern achieves state-of-the-art performance on benchmarked H3.6M, CMU-Mocap and 3DPW datasets in both short-term and long-term predictions.
3.7CVNov 2, 2022
Joint Data and Feature Augmentation for Self-Supervised Representation Learning on Point CloudsZhuheng Lu, Yuewei Dai, Weiqing Li et al.
To deal with the exhausting annotations, self-supervised representation learning from unlabeled point clouds has drawn much attention, especially centered on augmentation-based contrastive methods. However, specific augmentations hardly produce sufficient transferability to high-level tasks on different datasets. Besides, augmentations on point clouds may also change underlying semantics. To address the issues, we propose a simple but efficient augmentation fusion contrastive learning framework to combine data augmentations in Euclidean space and feature augmentations in feature space. In particular, we propose a data augmentation method based on sampling and graph generation. Meanwhile, we design a data augmentation network to enable a correspondence of representations by maximizing consistency between augmented graph pairs. We further design a feature augmentation network that encourages the model to learn representations invariant to the perturbations using an encoder perturbation. We comprehensively conduct extensive object classification experiments and object part segmentation experiments to validate the transferability of the proposed framework. Experimental results demonstrate that the proposed framework is effective to learn the point cloud representation in a self-supervised manner, and yields state-of-the-art results in the community. The source code is publicly available at: https://zhiyongsu.github.io/Project/AFSRL.html.
2.0CVJul 31, 2024
Fine-grained Metrics for Point Cloud Semantic SegmentationZhuheng Lu, Ting Wu, Yuewei Dai et al.
Two forms of imbalances are commonly observed in point cloud semantic segmentation datasets: (1) category imbalances, where certain objects are more prevalent than others; and (2) size imbalances, where certain objects occupy more points than others. Because of this, the majority of categories and large objects are favored in the existing evaluation metrics. This paper suggests fine-grained mIoU and mAcc for a more thorough assessment of point cloud segmentation algorithms in order to address these issues. Richer statistical information is provided for models and datasets by these fine-grained metrics, which also lessen the bias of current semantic segmentation metrics towards large objects. The proposed metrics are used to train and assess various semantic segmentation algorithms on three distinct indoor and outdoor semantic segmentation datasets.
5.9DCApr 3, 2025
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware SchedulingWeiqing Li, Guochao Jiang, Xiangyong Ding et al.
Disaggregated inference has become an essential framework that separates the prefill (P) and decode (D) stages in large language model inference to improve throughput. However, the KV cache transfer faces significant delays between prefill and decode nodes. The block-wise calling method and discontinuous KV cache memory allocation increase the number of calls to the transmission kernel. Additionally, existing frameworks often fix the roles of P and D nodes, leading to computational imbalances. In this paper, we propose FlowKV, a novel disaggregated inference framework, which reduces the average transmission latency of KV cache by 96%, from 0.944s to 0.053s, almost eliminating the transfer time relative to the total request latency by optimizing the KV cache transfer. FlowKV introduces the Load-Aware Scheduler for balanced request scheduling and flexible PD node allocation. This design maximizes hardware resource utilization, achieving peak system throughput across various scenarios, including normal, computational imbalance, and extreme overload conditions. Experimental results demonstrate that FlowKV significantly accelerates inference by 15.2%-48.9% on LongBench dataset compared to the baseline and supports applications with heterogeneous GPUs.
3.6CVAug 7, 2025
Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop FrameworkPeng Zhang, Songru Yang, Jinsheng Sun et al.
Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios. However, existing methods rely on resource-intensive offline incremental learning or densely annotated support data, limiting their practicality. To address these limitations, we propose HOW-Seg, the first human-in-the-loop framework for OW-Seg. Specifically, we construct class prototypes, the fundamental segmentation units, directly on the query data, avoiding the prototype bias caused by intra-class distribution shifts between the support and query data. By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes. Considering the lack of granularity of initial prototypes, we introduce a hierarchical prototype disambiguation mechanism to refine ambiguous prototypes, which correspond to annotations of different classes. To further enrich contextual awareness, we employ a dense conditional random field (CRF) upon the refined prototypes to optimize their label assignments. Through iterative human feedback, HOW-Seg dynamically improves its predictions, achieving high-quality segmentation for both base and novel classes. Experiments demonstrate that with sparse annotations (e.g., one-novel-class-one-click), HOW-Seg matches or surpasses the state-of-the-art generalized few-shot segmentation (GFS-Seg) method under the 5-shot setting. When using advanced backbones (e.g., Stratified Transformer) and denser annotations (e.g., 10 clicks per sub-scene), HOW-Seg achieves 85.27% mIoU on S3DIS and 66.37% mIoU on ScanNetv2, significantly outperforming alternatives.
No-reference geometry quality assessment for colorless point clouds via list-wise rank learningZheng Li, Bingxu Xie, Chao Chu et al.
Geometry quality assessment (GQA) of colorless point clouds is crucial for evaluating the performance of emerging point cloud-based solutions (e.g., watermarking, compression, and 3-Dimensional (3D) reconstruction). Unfortunately, existing objective GQA approaches are traditional full-reference metrics, whereas state-of-the-art learning-based point cloud quality assessment (PCQA) methods target both color and geometry distortions, neither of which are qualified for the no-reference GQA task. In addition, the lack of large-scale GQA datasets with subjective scores, which are always imprecise, biased, and inconsistent, also hinders the development of learning-based GQA metrics. Driven by these limitations, this paper proposes a no-reference geometry-only quality assessment approach based on list-wise rank learning, termed LRL-GQA, which comprises of a geometry quality assessment network (GQANet) and a list-wise rank learning network (LRLNet). The proposed LRL-GQA formulates the no-reference GQA as a list-wise rank problem, with the objective of directly optimizing the entire quality ordering. Specifically, a large dataset containing a variety of geometry-only distortions is constructed first, named LRL dataset, in which each sample is label-free but coupled with quality ranking information. Then, the GQANet is designed to capture intrinsic multi-scale patch-wise geometric features in order to predict a quality index for each point cloud. After that, the LRLNet leverages the LRL dataset and a likelihood loss to train the GQANet and ranks the input list of degraded point clouds according to their distortion levels. In addition, the pre-trained GQANet can be fine-tuned further to obtain absolute quality scores. Experimental results demonstrate the superior performance of the proposed no-reference LRL-GQA method compared with existing full-reference GQA metrics.
The Worse The Better: Content-Aware Viewpoint Generation Network for Projection-related Point Cloud Quality AssessmentZhiyong Su, Bingxu Xie, Zheng Li et al.
Through experimental studies, however, we observed the instability of final predicted quality scores, which change significantly over different viewpoint settings. Inspired by the "wooden barrel theory", given the default content-independent viewpoints of existing projection-related PCQA approaches, this paper presents a novel content-aware viewpoint generation network (CAVGN) to learn better viewpoints by taking the distribution of geometric and attribute features of degraded point clouds into consideration. Firstly, the proposed CAVGN extracts multi-scale geometric and texture features of the entire input point cloud, respectively. Then, for each default content-independent viewpoint, the extracted geometric and texture features are refined to focus on its corresponding visible part of the input point cloud. Finally, the refined geometric and texture features are concatenated to generate an optimized viewpoint. To train the proposed CAVGN, we present a self-supervised viewpoint ranking network (SSVRN) to select the viewpoint with the worst quality projected image to construct a default-optimized viewpoint dataset, which consists of thousands of paired default viewpoints and corresponding optimized viewpoints. Experimental results show that the projection-related PCQA methods can achieve higher performance using the viewpoints generated by the proposed CAVGN.
2.0CVJan 26, 2024
VN-Net: Vision-Numerical Fusion Graph Convolutional Network for Sparse Spatio-Temporal Meteorological ForecastingYutong Xiong, Xun Zhu, Ming Wu et al.
Sparse meteorological forecasting is indispensable for fine-grained weather forecasting and deserves extensive attention. Recent studies have highlighted the potential of spatio-temporal graph convolutional networks (ST-GCNs) in predicting numerical data from ground weather stations. However, as one of the highest fidelity and lowest latency data, the application of the vision data from satellites in ST-GCNs remains unexplored. There are few studies to demonstrate the effectiveness of combining the above multi-modal data for sparse meteorological forecasting. Towards this objective, we introduce Vision-Numerical Fusion Graph Convolutional Network (VN-Net), which mainly utilizes: 1) Numerical-GCN (N-GCN) to adaptively model the static and dynamic patterns of spatio-temporal numerical data; 2) Vision-LSTM Network (V-LSTM) to capture multi-scale joint channel and spatial features from time series satellite images; 4) a GCN-based decoder to generate hourly predictions of specified meteorological factors. As far as we know, VN-Net is the first attempt to introduce GCN method to utilize multi-modal data for better handling sparse spatio-temporal meteorological forecasting. Our experiments on Weather2k dataset show VN-Net outperforms state-of-the-art by a significant margin on mean absolute error (MAE) and root mean square error (RMSE) for temperature, relative humidity, and visibility forecasting. Furthermore, we conduct interpretation analysis and design quantitative evaluation metrics to assess the impact of incorporating vision data.
9.1CVMay 23, 2023
Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion SynthesisDong Wei, Xiaoning Sun, Huaijiang Sun et al.
The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.