49.9CVMay 28
DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and GroundingLuzhou Ge, Xiangyu Zhu, Jinyan Liu et al.
Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind
IVJul 7, 2023
Thoracic Cartilage Ultrasound-CT Registration using Dense Skeleton GraphZhongliang Jiang, Chenyang Li, Xuesong Li et al.
Autonomous ultrasound (US) imaging has gained increased interest recently, and it has been seen as a potential solution to overcome the limitations of free-hand US examinations, such as inter-operator variations. However, it is still challenging to accurately map planned paths from a generic atlas to individual patients, particularly for thoracic applications with high acoustic-impedance bone structures under the skin. To address this challenge, a graph-based non-rigid registration is proposed to enable transferring planned paths from the atlas to the current setup by explicitly considering subcutaneous bone surface features instead of the skin surface. To this end, the sternum and cartilage branches are segmented using a template matching to assist coarse alignment of US and CT point clouds. Afterward, a directed graph is generated based on the CT template. Then, the self-organizing map using geographical distance is successively performed twice to extract the optimal graph representations for CT and US point clouds, individually. To evaluate the proposed approach, five cartilage point clouds from distinct patients are employed. The results demonstrate that the proposed graph-based registration can effectively map trajectories from CT to the current setup for displaying US views through limited intercostal space. The non-rigid registration results in terms of Hausdorff distance (Mean$\pm$SD) is 9.48$\pm$0.27 mm and the path transferring error in terms of Euclidean distance is 2.21$\pm$1.11 mm.
CVSep 19, 2023
Cross-modal and Cross-domain Knowledge Transfer for Label-free 3D SegmentationJingyu Zhang, Huitong Yang, Dai-Jie Wu et al.
Current state-of-the-art point cloud-based perception methods usually rely on large-scale labeled data, which requires expensive manual annotations. A natural option is to explore the unsupervised methodology for 3D perception tasks. However, such methods often face substantial performance-drop difficulties. Fortunately, we found that there exist amounts of image-based datasets and an alternative can be proposed, i.e., transferring the knowledge in the 2D images to 3D point clouds. Specifically, we propose a novel approach for the challenging cross-modal and cross-domain adaptation task by fully exploring the relationship between images and point clouds and designing effective feature alignment strategies. Without any 3D labels, our method achieves state-of-the-art performance for 3D point cloud semantic segmentation on SemanticKITTI by using the knowledge of KITTI360 and GTA5, compared to existing unsupervised and weakly-supervised baselines.
CVMar 1
RnG: A Unified Transformer for Complete 3D Modeling from Partial ObservationsMochu Xiang, Zhelun Shen, Xuesong Li et al.
Human perceive the 3D world through 2D observations from limited viewpoints. While recent feed-forward generalizable 3D reconstruction models excel at recovering 3D structures from sparse images, their representations are often confined to observed regions, leaving unseen geometry un-modeled. This raises a key, fundamental challenge: Can we infer a complete 3D structure from partial 2D observations? We present RnG (Reconstruction and Generation), a novel feed-forward Transformer that unifies these two tasks by predicting an implicit, complete 3D representation. At the core of RnG, we propose a reconstruction-guided causal attention mechanism that separates reconstruction and generation at the attention level, and treats the KV-cache as an implicit 3D representation. Then, arbitrary poses can efficiently query this cache to render high-fidelity, novel-view RGBD outputs. As a result, RnG not only accurately reconstructs visible geometry but also generates plausible, coherent unseen geometry and appearance. Our method achieves state-of-the-art performance in both generalizable 3D reconstruction and novel view generation, while operating efficiently enough for real-time interactive applications. Project page: https://npucvr.github.io/RnG
CVSep 14, 2024
Registration between Point Cloud Streams and Sequential Bounding Boxes via Gradient DescentXuesong Li, Xinge Zhu, Yuexin Ma et al.
In this paper, we propose an algorithm for registering sequential bounding boxes with point cloud streams. Unlike popular point cloud registration techniques, the alignment of the point cloud and the bounding box can rely on the properties of the bounding box, such as size, shape, and temporal information, which provides substantial support and performance gains. Motivated by this, we propose a new approach to tackle this problem. Specifically, we model the registration process through an overall objective function that includes the final goal and all constraints. We then optimize the function using gradient descent. Our experiments show that the proposed method performs remarkably well with a 40\% improvement in IoU and demonstrates more robust registration between point cloud streams and sequential bounding boxes
IRJul 30, 2024
RevGNN: Negative Sampling Enhanced Contrastive Graph Learning for Academic Reviewer RecommendationWeibin Liao, Yifan Zhu, Yanyan Li et al.
Acquiring reviewers for academic submissions is a challenging recommendation scenario. Recent graph learning-driven models have made remarkable progress in the field of recommendation, but their performance in the academic reviewer recommendation task may suffer from a significant false negative issue. This arises from the assumption that unobserved edges represent negative samples. In fact, the mechanism of anonymous review results in inadequate exposure of interactions between reviewers and submissions, leading to a higher number of unobserved interactions compared to those caused by reviewers declining to participate. Therefore, investigating how to better comprehend the negative labeling of unobserved interactions in academic reviewer recommendations is a significant challenge. This study aims to tackle the ambiguous nature of unobserved interactions in academic reviewer recommendations. Specifically, we propose an unsupervised Pseudo Neg-Label strategy to enhance graph contrastive learning (GCL) for recommending reviewers for academic submissions, which we call RevGNN. RevGNN utilizes a two-stage encoder structure that encodes both scientific knowledge and behavior using Pseudo Neg-Label to approximate review preference. Extensive experiments on three real-world datasets demonstrate that RevGNN outperforms all baselines across four metrics. Additionally, detailed further analyses confirm the effectiveness of each component in RevGNN.
77.6ROMay 21
EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot ControlChushan Zhang, Ruihan Lu, Jinguang Tong et al.
Chunked vision-language-action (VLA) policies predict multi-step robot controls, conditioning each update on the current visual observation alone. Yet robot actions cause contact, occlusion, and object motion, and the geometry that later decisions depend on can change before the next visual update arrives. Spatial VLAs improve current-frame geometry. Temporal VLAs aggregate past frames. Neither maintains an action-updated scene prior across chunks. We argue for a persistent action-updated scene state across control calls, and introduce EvoScene-VLA. Its recurrent scene prefix carries a geometry-aware scene state across chunks. At each vision-language model (VLM) call, the VLM combines scene information from the current observation with the action-updated prior from the previous chunk; the action decoder outputs both the next action chunk and a compact scene update. This update becomes the next prior, which the VLM corrects against the new observation when the next call arrives. Each control call therefore starts from a scene prior that reflects both recent actions and fresh visual evidence. During training, \textbf{Scene Predictor} supplies future scene-token targets, and Geometric Anchor aligns scene slots with frozen depth and 3D teachers. We discard both modules at deployment. On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. On the Galaxea R1-Lite real robot, EvoScene-VLA outperforms all baselines.
44.7CVMay 6
From Diffusion to Rectified Flow: Rethinking Text-Based SegmentationZishen Qu, Xuesong Li, Haijian Gu et al.
Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.
CVJul 25, 2024
SSTD: Stripe-Like Space Target Detection Using Single-Point Weak SupervisionZijian Zhu, Ali Zia, Xuesong Li et al.
Stripe-like space target detection (SSTD) plays a key role in enhancing space situational awareness and assessing spacecraft behaviour. This domain faces three challenges: the lack of publicly available datasets, interference from stray light and stars, and the variability of stripe-like targets, which makes manual labeling both inaccurate and labor-intensive. In response, we introduces `AstroStripeSet', a pioneering dataset designed for SSTD, aiming to bridge the gap in academic resources and advance research in SSTD. Furthermore, we propose a novel teacher-student label evolution framework with single-point weak supervision, providing a new solution to the challenges of manual labeling. This framework starts with generating initial pseudo-labels using the zero-shot capabilities of the Segment Anything Model (SAM) in a single-point setting. After that, the fine-tuned StripeSAM serves as the teacher and the newly developed StripeNet as the student, consistently improving segmentation performance through label evolution, which iteratively refines these labels. We also introduce `GeoDice', a new loss function customized for the linear characteristics of stripe-like targets. Extensive experiments show that our method matches fully supervised approaches, exhibits strong zero-shot generalization for diverse space-based and ground-based real-world images, and sets a new state-of-the-art (SOTA) benchmark. Our AstroStripeSet dataset and code will be made publicly available.
99.4CVMar 16
MVHOI: Bridge Multi-view Condition to Complex Human-Object Interaction Video Reenactment via 3D Foundation ModelJinguang Tong, Jinbo Wu, Kaisiyuan Wang et al.
Human-Object Interaction (HOI) video reenactment with realistic motion remains a frontier in expressive digital human creation. Existing approaches primarily handle simple image-plane motion (e.g., in-plane translations), struggling with complex non-planar manipulations like out-of-plane reorientation. In this paper, we propose MVHOI, a two-stage HOI video reenactment framework that bridges multi-view reference conditions and video foundation models via a 3D Foundation Model (3DFM). The 3DFM first produces view-consistent object priors conditioned on implicit motion dynamics across novel viewpoints. A controllable video generation model then synthesizes high-fidelity object texture by incorporating multi-view reference images, ensuring appearance consistency via a reasonable retrieval mechanism. By enabling these two stages to mutually reinforce one another during the inference phase, our framework shows superior performance in generating long-duration HOI videos with intricate object manipulations. Extensive experiments show substantial improvements over prior approaches, especially for HOI with complex 3D object manipulations.
52.2CVMay 19
Structural Energy Guidance for View-Consistent Text-to-3D GenerationQing Zhang, Jinguang Tong, Jing Zhang et al.
Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.
CVMar 27, 2024Code
Backpropagation-free Network for 3D Test-time AdaptationYanshuo Wang, Ali Cheraghian, Zeeshan Hayder et al.
Real-world systems often encounter new data over time, which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here, we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture to maintain knowledge about the source domain as well as complementary target-domain-specific information. The backpropagation-free property of our model helps address the well-known forgetting problem and mitigates the error accumulation issue. The proposed method also eliminates the need for the usually noisy process of pseudo-labeling and reliance on costly self-supervised training. Moreover, our method leverages subspace learning, effectively reducing the distribution variance between the two domains. Furthermore, the source-domain-specific and the target-domain-specific streams are aligned using a novel entropy-based adaptive fusion strategy. Extensive experiments on popular benchmarks demonstrate the effectiveness of our method. The code will be available at \url{https://github.com/abie-e/BFTT3D}.
CVDec 16, 2025
Quality-Driven and Diversity-Aware Sample Expansion for Robust Marine Obstacle SegmentationMiaohua Zhang, Mohammad Ali Armin, Xuesong Li et al.
Marine obstacle detection demands robust segmentation under challenging conditions, such as sun glitter, fog, and rapidly changing wave patterns. These factors degrade image quality, while the scarcity and structural repetition of marine datasets limit the diversity of available training data. Although mask-conditioned diffusion models can synthesize layout-aligned samples, they often produce low-diversity outputs when conditioned on low-entropy masks and prompts, limiting their utility for improving robustness. In this paper, we propose a quality-driven and diversity-aware sample expansion pipeline that generates training data entirely at inference time, without retraining the diffusion model. The framework combines two key components:(i) a class-aware style bank that constructs high-entropy, semantically grounded prompts, and (ii) an adaptive annealing sampler that perturbs early conditioning, while a COD-guided proportional controller regulates this perturbation to boost diversity without compromising layout fidelity. Across marine obstacle benchmarks, augmenting training data with these controlled synthetic samples consistently improves segmentation performance across multiple backbones and increases visual variation in rare and texture-sensitive classes.
CVAug 9, 2024
Collaborative Static-Dynamic Teaching: A Semi-Supervised Framework for Stripe-Like Space Target DetectionZijian Zhu, Ali Zia, Xuesong Li et al.
Stripe-like space target detection (SSTD) is crucial for space situational awareness. Traditional unsupervised methods often fail in low signal-to-noise ratio and variable stripe-like space targets scenarios, leading to weak generalization. Although fully supervised learning methods improve model generalization, they require extensive pixel-level labels for training. In the SSTD task, manually creating these labels is often inaccurate and labor-intensive. Semi-supervised learning (SSL) methods reduce the need for these labels and enhance model generalizability, but their performance is limited by pseudo-label quality. To address this, we introduce an innovative Collaborative Static-Dynamic Teacher (CSDT) SSL framework, which includes static and dynamic teacher models as well as a student model. This framework employs a customized adaptive pseudo-labeling (APL) strategy, transitioning from initial static teaching to adaptive collaborative teaching, guiding the student model's training. The exponential moving average (EMA) mechanism further enhances this process by feeding new stripe-like knowledge back to the dynamic teacher model through the student model, creating a positive feedback loop that continuously enhances the quality of pseudo-labels. Moreover, we present MSSA-Net, a novel SSTD network featuring a multi-scale dual-path convolution (MDPC) block and a feature map weighted attention (FMWA) block, designed to extract diverse stripe-like features within the CSDT SSL training framework. Extensive experiments verify the state-of-the-art performance of our framework on the AstroStripeSet and various ground-based and space-based real-world datasets.
CVJun 16, 2025Code
GS-2DGS: Geometrically Supervised 2DGS for Reflective Object ReconstructionJinguang Tong, Xuesong li, Fahira Afzal Maken et al.
3D modeling of highly reflective objects remains challenging due to strong view-dependent appearances. While previous SDF-based methods can recover high-quality meshes, they are often time-consuming and tend to produce over-smoothed surfaces. In contrast, 3D Gaussian Splatting (3DGS) offers the advantage of high speed and detailed real-time rendering, but extracting surfaces from the Gaussians can be noisy due to the lack of geometric constraints. To bridge the gap between these approaches, we propose a novel reconstruction method called GS-2DGS for reflective objects based on 2D Gaussian Splatting (2DGS). Our approach combines the rapid rendering capabilities of Gaussian Splatting with additional geometric information from foundation models. Experimental results on synthetic and real datasets demonstrate that our method significantly outperforms Gaussian-based techniques in terms of reconstruction and relighting and achieves performance comparable to SDF-based methods while being an order of magnitude faster. Code is available at https://github.com/hirotong/GS2DGS
86.5CVApr 21Code
EgoSelf: From Memory to Personalized Egocentric AssistantYanshuo Wang, Yuan Xu, Xuesong Li et al.
Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself\_project/}.
CVFeb 6
Adaptive and Balanced Re-initialization for Long-timescale Continual Test-time Domain AdaptationYanshuo Wang, Jinguang Tong, Jun Lan et al.
Continual test-time domain adaptation (CTTA) aims to adjust models so that they can perform well over time across non-stationary environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: Can the model adapt to continually changing environments over a long time? In this work, we explore facilitating better CTTA in the long run using a re-initialization (or reset) based method. First, we observe that the long-term performance is associated with the trajectory pattern in label flip. Based on this observed correlation, we propose a simple yet effective policy, Adaptive-and-Balanced Re-initialization (ABR), towards preserving the model's long-term performance. In particular, ABR performs weight re-initialization using adaptive intervals. The adaptive interval is determined based on the change in label flip. The proposed method is validated on extensive CTTA benchmarks, achieving superior performance.
72.2CVMay 11
Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy MaximizationMengqi He, Xinyu Tian, Xin Shen et al.
Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.
56.1ROMar 16
ReMAP-DP: Reprojected Multi-view Aligned PointMaps for Diffusion PolicyXinzhang Yang, Renjun Wu, Jinyan Liu et al.
Generalist robot policies built upon 2D visual representations excel at semantic reasoning but inherently lack the explicit 3D spatial awareness required for high-precision tasks. Existing 3D integration methods struggle to bridge this gap due to the structural irregularity of sparse point clouds and the geometric distortion introduced by multi-view orthographic rendering. To overcome these barriers, we present ReMAP-DP, a novel framework synergizing standardized perspective reprojection with a structure-aware dual-stream diffusion policy. By coupling the re-projected views with pixel-aligned PointMaps, our dual-stream architecture leverages learnable modality embeddings to fuse frozen semantic features and explicit geometric descriptors, ensuring precise implicit patch-level alignment. Extensive experiments across simulation and real-world environments demonstrate ReMAP-DP's superior performance in diverse manipulation tasks. On RoboTwin 2.0, it attains a 59.3% average success rate, outperforming the DP3 baseline by +6.6%. On ManiSkill 3, our method yields a 28% improvement over DP3 on the geometrically challenging Stack Cube task. Furthermore, ReMAP-DP exhibits remarkable real-world robustness, executing high-precision and dynamic manipulations with superior data efficiency from only a handful of demonstrations. Project page is available at: https://icr-lab.github.io/ReMAP-DP/
CVFeb 24
Probing and Bridging Geometry-Interaction Cues for Affordance Reasoning in Vision Foundation ModelsQing Zhang, Xuesong Li, Jing Zhang
What does it mean for a visual system to truly understand affordance? We argue that this understanding hinges on two complementary capacities: geometric perception, which identifies the structural parts of objects that enable interaction, and interaction perception, which models how an agent's actions engage with those parts. To test this hypothesis, we conduct a systematic probing of Visual Foundation Models (VFMs). We find that models like DINO inherently encode part-level geometric structures, while generative models like Flux contain rich, verb-conditioned spatial attention maps that serve as implicit interaction priors. Crucially, we demonstrate that these two dimensions are not merely correlated but are composable elements of affordance. By simply fusing DINO's geometric prototypes with Flux's interaction maps in a training-free and zero-shot manner, we achieve affordance estimation competitive with weakly-supervised methods. This final fusion experiment confirms that geometric and interaction perception are the fundamental building blocks of affordance understanding in VFMs, providing a mechanistic account of how perception grounds action.
CVApr 6, 2024
Latent-based Diffusion Model for Long-tailed RecognitionPengxiao Han, Changkun Ye, Jieming Zhou et al.
Long-tailed imbalance distribution is a common issue in practical computer vision applications. Previous works proposed methods to address this problem, which can be categorized into several classes: re-sampling, re-weighting, transfer learning, and feature augmentation. In recent years, diffusion models have shown an impressive generation ability in many sub-problems of deep computer vision. However, its powerful generation has not been explored in long-tailed problems. We propose a new approach, the Latent-based Diffusion Model for Long-tailed Recognition (LDMLR), as a feature augmentation method to tackle the issue. First, we encode the imbalanced dataset into features using the baseline model. Then, we train a Denoising Diffusion Implicit Model (DDIM) using these encoded features to generate pseudo-features. Finally, we train the classifier using the encoded and pseudo-features from the previous two steps. The model's accuracy shows an improvement on the CIFAR-LT and ImageNet-LT datasets by using the proposed method.
CVJan 22, 2025
Neural Radiance Fields for the Real World: A SurveyWenhui Xiao, Remi Chierchia, Rodrigo Santa Cruz et al.
Neural Radiance Fields (NeRFs) have remodeled 3D scene representation since release. NeRFs can effectively reconstruct complex 3D scenes from 2D images, advancing different fields and applications such as scene understanding, 3D content generation, and robotics. Despite significant research progress, a thorough review of recent innovations, applications, and challenges is lacking. This survey compiles key theoretical advancements and alternative representations and investigates emerging challenges. It further explores applications on reconstruction, highlights NeRFs' impact on computer vision and robotics, and reviews essential datasets and toolkits. By identifying gaps in the literature, this survey discusses open challenges and offers directions for future research.
CVDec 5, 2024
DGNS: Deformable Gaussian Splatting and Dynamic Neural Surface for Monocular Dynamic 3D ReconstructionXuesong Li, Jinguang Tong, Jie Hong et al.
Dynamic scene reconstruction from monocular video is essential for real-world applications. We introduce DGNS, a hybrid framework integrating \underline{D}eformable \underline{G}aussian Splatting and Dynamic \underline{N}eural \underline{S}urfaces, effectively addressing dynamic novel-view synthesis and 3D geometry reconstruction simultaneously. During training, depth maps generated by the deformable Gaussian splatting module guide the ray sampling for faster processing and provide depth supervision within the dynamic neural surface module to improve geometry reconstruction. Conversely, the dynamic neural surface directs the distribution of Gaussian primitives around the surface, enhancing rendering quality. In addition, we propose a depth-filtering approach to further refine depth supervision. Extensive experiments conducted on public datasets demonstrate that DGNS achieves state-of-the-art performance in 3D reconstruction, along with competitive results in novel-view synthesis.
CVOct 30, 2024
NeFF-BioNet: Crop Biomass Prediction from Point Cloud to Drone ImageryXuesong Li, Zeeshan Hayder, Ali Zia et al.
Crop biomass offers crucial insights into plant health and yield, making it essential for crop science, farming systems, and agricultural research. However, current measurement methods, which are labor-intensive, destructive, and imprecise, hinder large-scale quantification of this trait. To address this limitation, we present a biomass prediction network (BioNet), designed for adaptation across different data modalities, including point clouds and drone imagery. Our BioNet, utilizing a sparse 3D convolutional neural network (CNN) and a transformer-based prediction module, processes point clouds and other 3D data representations to predict biomass. To further extend BioNet for drone imagery, we integrate a neural feature field (NeFF) module, enabling 3D structure reconstruction and the transformation of 2D semantic features from vision foundation models into the corresponding 3D surfaces. For the point cloud modality, BioNet demonstrates superior performance on two public datasets, with an approximate 6.1% relative improvement (RI) over the state-of-the-art. In the RGB image modality, the combination of BioNet and NeFF achieves a 7.9% RI. Additionally, the NeFF-based approach utilizes inexpensive, portable drone-mounted cameras, providing a scalable solution for large field applications.
CVApr 15, 2025
Enhancing Features in Long-tailed Data Using Large Vision ModelPengxiao Han, Changkun Ye, Jinguang Tong et al.
Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.
CVDec 24, 2024
FlameGS: Reconstruct flame light field via Gaussian SplattingYunhao Shui, Fuhao Zhang, Can Gao et al.
To address the time-consuming and computationally intensive issues of traditional ART algorithms for flame combustion diagnosis, inspired by flame simulation technology, we propose a novel representation method for flames. By modeling the luminous process of flames and utilizing 2D projection images for supervision, our experimental validation shows that this model achieves an average structural similarity index of 0.96 between actual images and predicted 2D projections, along with a Peak Signal-to-Noise Ratio of 39.05. Additionally, it saves approximately 34 times the computation time and about 10 times the memory compared to traditional algorithms.
CVAug 23, 2025
Structural Energy-Guided Sampling for View-Consistent Text-to-3DQing Zhang, Jinguang Tong, Jie Hong et al.
Text-to-3D generation often suffers from the Janus problem, where objects look correct from the front but collapse into duplicated or distorted geometry from other angles. We attribute this failure to viewpoint bias in 2D diffusion priors, which propagates into 3D optimization. To address this, we propose Structural Energy-Guided Sampling (SEGS), a training-free, plug-and-play framework that enforces multi-view consistency entirely at sampling time. SEGS defines a structural energy in a PCA subspace of intermediate U-Net features and injects its gradients into the denoising trajectory, steering geometry toward the intended viewpoint while preserving appearance fidelity. Integrated seamlessly into SDS/VSD pipelines, SEGS significantly reduces Janus artifacts, achieving improved geometric alignment and viewpoint consistency without retraining or weight modification.
ROAug 9, 2025
Vibration-Based Energy Metric for Restoring Needle Alignment in Autonomous Robotic UltrasoundZhongyu Chen, Chenyang Li, Xuesong Li et al.
Precise needle alignment is essential for percutaneous needle insertion in robotic ultrasound-guided procedures. However, inherent challenges such as speckle noise, needle-like artifacts, and low image resolution make robust needle detection difficult, particularly when visibility is reduced or lost. In this paper, we propose a method to restore needle alignment when the ultrasound imaging plane and the needle insertion plane are misaligned. Unlike many existing approaches that rely heavily on needle visibility in ultrasound images, our method uses a more robust feature by periodically vibrating the needle using a mechanical system. Specifically, we propose a vibration-based energy metric that remains effective even when the needle is fully out of plane. Using this metric, we develop a control strategy to reposition the ultrasound probe in response to misalignments between the imaging plane and the needle insertion plane in both translation and rotation. Experiments conducted on ex-vivo porcine tissue samples using a dual-arm robotic ultrasound-guided needle insertion system demonstrate the effectiveness of the proposed approach. The experimental results show the translational error of 0.41$\pm$0.27 mm and the rotational error of 0.51$\pm$0.19 degrees.
IVJul 9, 2025
Speckle2Self: Self-Supervised Ultrasound Speckle Reduction Without Clean DataXuesong Li, Nassir Navab, Zhongliang Jiang
Image denoising is a fundamental task in computer vision, particularly in medical ultrasound (US) imaging, where speckle noise significantly degrades image quality. Although recent advancements in deep neural networks have led to substantial improvements in denoising for natural images, these methods cannot be directly applied to US speckle noise, as it is not purely random. Instead, US speckle arises from complex wave interference within the body microstructure, making it tissue-dependent. This dependency means that obtaining two independent noisy observations of the same scene, as required by pioneering Noise2Noise, is not feasible. Additionally, blind-spot networks also cannot handle US speckle noise due to its high spatial dependency. To address this challenge, we introduce Speckle2Self, a novel self-supervised algorithm for speckle reduction using only single noisy observations. The key insight is that applying a multi-scale perturbation (MSP) operation introduces tissue-dependent variations in the speckle pattern across different scales, while preserving the shared anatomical structure. This enables effective speckle suppression by modeling the clean image as a low-rank signal and isolating the sparse noise component. To demonstrate its effectiveness, Speckle2Self is comprehensively compared with conventional filter-based denoising algorithms and SOTA learning-based methods, using both realistic simulated US images and human carotid US images. Additionally, data from multiple US machines are employed to evaluate model generalization and adaptability to images from unseen domains. Project page: https://noseefood.github.io/us-speckle2self/
CVJun 24, 2025
Semantic Scene Graph for Ultrasound Image Explanation and Scanning GuidanceXuesong Li, Dianye Huang, Yameng Zhang et al.
Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for ordinary, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for ordinaries.
CVJun 1, 2025
A Review on Coarse to Fine-Grained Animal Action RecognitionAli Zia, Renuka Sharma, Abdelwahed Khamis et al.
This review provides an in-depth exploration of the field of animal action recognition, focusing on coarse-grained (CG) and fine-grained (FG) techniques. The primary aim is to examine the current state of research in animal behaviour recognition and to elucidate the unique challenges associated with recognising subtle animal actions in outdoor environments. These challenges differ significantly from those encountered in human action recognition due to factors such as non-rigid body structures, frequent occlusions, and the lack of large-scale, annotated datasets. The review begins by discussing the evolution of human action recognition, a more established field, highlighting how it progressed from broad, coarse actions in controlled settings to the demand for fine-grained recognition in dynamic environments. This shift is particularly relevant for animal action recognition, where behavioural variability and environmental complexity present unique challenges that human-centric models cannot fully address. The review then underscores the critical differences between human and animal action recognition, with an emphasis on high intra-species variability, unstructured datasets, and the natural complexity of animal habitats. Techniques like spatio-temporal deep learning frameworks (e.g., SlowFast) are evaluated for their effectiveness in animal behaviour analysis, along with the limitations of existing datasets. By assessing the strengths and weaknesses of current methodologies and introducing a recently-published dataset, the review outlines future directions for advancing fine-grained action recognition, aiming to improve accuracy and generalisability in behaviour analysis across species.
IVMar 25, 2025
Prompt-Guided Dual-Path UNet with Mamba for Medical Image SegmentationShaolei Zhang, Jinyan Liu, Tianyi Qian et al.
Convolutional neural networks (CNNs) and transformers are widely employed in constructing UNet architectures for medical image segmentation tasks. However, CNNs struggle to model long-range dependencies, while transformers suffer from quadratic computational complexity. Recently, Mamba, a type of State Space Models, has gained attention for its exceptional ability to model long-range interactions while maintaining linear computational complexity. Despite the emergence of several Mamba-based methods, they still present the following limitations: first, their network designs generally lack perceptual capabilities for the original input data; second, they primarily focus on capturing global information, while often neglecting local details. To address these challenges, we propose a prompt-guided CNN-Mamba dual-path UNet, termed PGM-UNet, for medical image segmentation. Specifically, we introduce a prompt-guided residual Mamba module that adaptively extracts dynamic visual prompts from the original input data, effectively guiding Mamba in capturing global information. Additionally, we design a local-global information fusion network, comprising a local information extraction module, a prompt-guided residual Mamba module, and a multi-focus attention fusion module, which effectively integrates local and global information. Furthermore, inspired by Kolmogorov-Arnold Networks (KANs), we develop a multi-scale information extraction module to capture richer contextual information without altering the resolution. We conduct extensive experiments on the ISIC-2017, ISIC-2018, DIAS, and DRIVE. The results demonstrate that the proposed method significantly outperforms state-of-the-art approaches in multiple medical image segmentation tasks.
CVDec 28, 2024
Maintain Plasticity in Long-timescale Continual Test-time AdaptationYanshuo Wang, Xuesong Li, Jinguang Tong et al.
Continual test-time domain adaptation (CTTA) aims to adjust pre-trained source models to perform well over time across non-stationary target environments. While previous methods have made considerable efforts to optimize the adaptation process, a crucial question remains: can the model adapt to continually-changing environments with preserved plasticity over a long time? The plasticity refers to the model's capability to adjust predictions in response to non-stationary environments continually. In this work, we explore plasticity, this essential but often overlooked aspect of continual adaptation to facilitate more sustained adaptation in the long run. First, we observe that most CTTA methods experience a steady and consistent decline in plasticity during the long-timescale continual adaptation phase. Moreover, we find that the loss of plasticity is strongly associated with the change in label flip. Based on this correlation, we propose a simple yet effective policy, Adaptive Shrink-Restore (ASR), towards preserving the model's plasticity. In particular, ASR does the weight re-initialization by the adaptive intervals. The adaptive interval is determined based on the change in label flipping. Our method is validated on extensive CTTA benchmarks, achieving excellent performance.
CVDec 3, 2024
Improving Viewpoint Consistency in 3D Generation via Structure Feature and CLIP GuidanceQing Zhang, Jinguang Tong, Jing Zhang et al.
Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising generation speed, establishing ACG as an efficient, plug-and-play component for existing text-to-3D frameworks.
CVJun 6, 2024
Class-Aware Cartilage Segmentation for Autonomous US-CT Registration in Robotic Intercostal Ultrasound ImagingZhongliang Jiang, Yunfeng Kang, Yuan Bi et al.
Ultrasound imaging has been widely used in clinical examinations owing to the advantages of being portable, real-time, and radiation-free. Considering the potential of extensive deployment of autonomous examination systems in hospitals, robotic US imaging has attracted increased attention. However, due to the inter-patient variations, it is still challenging to have an optimal path for each patient, particularly for thoracic applications with limited acoustic windows, e.g., intercostal liver imaging. To address this problem, a class-aware cartilage bone segmentation network with geometry-constraint post-processing is presented to capture patient-specific rib skeletons. Then, a dense skeleton graph-based non-rigid registration is presented to map the intercostal scanning path from a generic template to individual patients. By explicitly considering the high-acoustic impedance bone structures, the transferred scanning path can be precisely located in the intercostal space, enhancing the visibility of internal organs by reducing the acoustic shadow. To evaluate the proposed approach, the final path mapping performance is validated on five distinct CTs and two volunteer US data, resulting in ten pairs of CT-US combinations. Results demonstrate that the proposed graph-based registration method can robustly and precisely map the path from CT template to individual patients (Euclidean error: $2.21\pm1.11~mm$).
CVApr 17, 2024
MMCBE: Multi-modality Dataset for Crop Biomass Prediction and BeyondXuesong Li, Zeeshan Hayder, Ali Zia et al.
Crop biomass, a critical indicator of plant growth, health, and productivity, is invaluable for crop breeding programs and agronomic research. However, the accurate and scalable quantification of crop biomass remains inaccessible due to limitations in existing measurement methods. One of the obstacles impeding the advancement of current crop biomass prediction methodologies is the scarcity of publicly available datasets. Addressing this gap, we introduce a new dataset in this domain, i.e. Multi-modality dataset for crop biomass estimation (MMCBE). Comprising 216 sets of multi-view drone images, coupled with LiDAR point clouds, and hand-labelled ground truth, MMCBE represents the first multi-modality one in the field. This dataset aims to establish benchmark methods for crop biomass quantification and foster the development of vision-based approaches. We have rigorously evaluated state-of-the-art crop biomass estimation methods using MMCBE and ventured into additional potential applications, such as 3D crop reconstruction from drone imagery and novel-view rendering. With this publication, we are making our comprehensive dataset available to the broader community.
CVJan 26, 2024
Spatial Transcriptomics Analysis of Zero-shot Gene Expression PredictionYan Yang, Md Zakir Hossain, Xuesong Li et al.
Spatial transcriptomics (ST) captures gene expression within distinct regions (i.e., windows) of a tissue slide. Traditional supervised learning frameworks applied to model ST are constrained to predicting expression from slide image windows for gene types seen during training, failing to generalize to unseen gene types. To overcome this limitation, we propose a semantic guided network (SGN), a pioneering zero-shot framework for predicting gene expression from slide image windows. Considering a gene type can be described by functionality and phenotype, we dynamically embed a gene type to a vector per its functionality and phenotype, and employ this vector to project slide image windows to gene expression in feature space, unleashing zero-shot expression prediction for unseen gene types. The gene type functionality and phenotype are queried with a carefully designed prompt from a pre-trained large language model (LLM). On standard benchmark datasets, we demonstrate competitive zero-shot performance compared to past state-of-the-art supervised learning approaches.
IVMay 14, 2023
Skeleton Graph-based Ultrasound-CT Non-rigid RegistrationZhongliang Jiang, Xuesong Li, Chenyu Zhang et al.
Autonomous ultrasound (US) scanning has attracted increased attention, and it has been seen as a potential solution to overcome the limitations of conventional US examinations, such as inter-operator variations. However, it is still challenging to autonomously and accurately transfer a planned scan trajectory on a generic atlas to the current setup for different patients, particularly for thorax applications with limited acoustic windows. To address this challenge, we proposed a skeleton graph-based non-rigid registration to adapt patient-specific properties using subcutaneous bone surface features rather than the skin surface. To this end, the self-organization mapping is successively used twice to unify the input point cloud and extract the key points, respectively. Afterward, the minimal spanning tree is employed to generate a tree graph to connect all extracted key points. To appropriately characterize the rib cartilage outline to match the source and target point cloud, the path extracted from the tree graph is optimized by maximally maintaining continuity throughout each rib. To validate the proposed approach, we manually extract the US cartilage point cloud from one volunteer and seven CT cartilage point clouds from different patients. The results demonstrate that the proposed graph-based registration is more effective and robust in adapting to the inter-patient variations than the ICP (distance error mean/SD: 5.0/1.9 mm vs 8.6/6.7 mm on seven CTs).
LGOct 16, 2021
Heterogeneous Graph-Based Multimodal Brain Network LearningGen Shi, Yifan Zhu, Wenjin Liu et al.
Graph neural networks (GNNs) provide powerful insights for brain neuroimaging technology from the view of graphical networks. However, most existing GNN-based models assume that the neuroimaging-produced brain connectome network is a homogeneous graph with single types of nodes and edges. In fact, emerging studies have reported and emphasized the significance of heterogeneity among human brain activities, especially between the two cerebral hemispheres. Thus, homogeneous-structured brain network-based graph methods are insufficient for modelling complicated cerebral activity states. To overcome this problem, in this paper, we present a heterogeneous graph neural network (HebrainGNN) for multimodal brain neuroimaging fusion learning. We first model the brain network as a heterogeneous graph with multitype nodes (i.e., left and right hemispheric nodes) and multitype edges (i.e., intra- and interhemispheric edges). Then, we propose a self-supervised pretraining strategy based on a heterogeneous brain network to address the potential overfitting problem caused by the conflict between a large parameter size and a small medical data sample size. Our results show the superiority of the proposed model over other existing methods in brain-related disease prediction tasks. Ablation experiments show that our heterogeneous graph-based model attaches more importance to hemispheric connections that may be neglected due to their low strength by previous homogeneous graph models. Other experiments also indicate that our proposed model with a pretraining strategy alleviates the problem of limited labelled data and yields a significant improvement in accuracy.
IVOct 2, 2020
Weight Encode Reconstruction Network for Computed Tomography in a Semi-Case-Wise and Learning-Based WayHujie Pan, Xuesong Li, Min Xu
Classic algebraic reconstruction technology (ART) for computed tomography requires pre-determined weights of the voxels for projecting pixel values. However, such weight cannot be accurately obtained due to the limitation of the physical understanding and computation resources. In this study, we propose a semi-case-wise learning-based method named Weight Encode Reconstruction Network (WERNet) to tackle the issues mentioned above. The model is trained in a self-supervised manner without the label of a voxel set. It contains two branches, including the voxel weight encoder and the voxel attention part. Using gradient normalization, we are able to co-train the encoder and voxel set numerically stably. With WERNet, the reconstructed result was obtained with a cosine similarity greater than 0.999 with the ground truth. Moreover, the model shows the extraordinary capability of denoising comparing to the classic ART method. In the generalization test of the model, the encoder is transferable from a voxel set with complex structure to the unseen cases without the deduction of the accuracy.
CVJul 4, 2020
Efficient and accurate object detection with simultaneous classification and trackingXuesong Li, Jose Guivant
Interacting with the environment, such as object detection and tracking, is a crucial ability of mobile robots. Besides high accuracy, efficiency in terms of processing effort and energy consumption are also desirable. To satisfy both requirements, we propose a detection framework based on simultaneous classification and tracking in the point stream. In this framework, a tracker performs data association in sequences of the point cloud, guiding the detector to avoid redundant processing (i.e. classifying already-known objects). For objects whose classification is not sufficiently certain, a fusion model is designed to fuse selected key observations that provide different perspectives across the tracking span. Therefore, performance (accuracy and efficiency of detection) can be enhanced. This method is particularly suitable for detecting and tracking moving objects, a process that would require expensive computations if solved using conventional procedures. Experiments were conducted on the benchmark dataset, and the results showed that the proposed method outperforms original tracking-by-detection approaches in both efficiency and accuracy.
CVMar 24, 2020
Real-time 3D object proposal generation and classification under limited processing resourcesXuesong Li, Jose Guivant, Subhan Khan
The task of detecting 3D objects is important to various robotic applications. The existing deep learning-based detection techniques have achieved impressive performance. However, these techniques are limited to run with a graphics processing unit (GPU) in a real-time environment. To achieve real-time 3D object detection with limited computational resources for robots, we propose an efficient detection method consisting of 3D proposal generation and classification. The proposal generation is mainly based on point segmentation, while the proposal classification is performed by a lightweight convolution neural network (CNN) model. To validate our method, KITTI datasets are utilized. The experimental results demonstrate the capability of proposed real-time 3D object detection method from the point cloud with a competitive performance of object recall and classification.
CVDec 10, 2019
Context-Aware Dynamic Feature Extraction for 3D Object Detection in Point CloudsYonglin Tian, Lichao Huang, Xuesong Li et al.
Varying density of point clouds increases the difficulty of 3D detection. In this paper, we present a context-aware dynamic network (CADNet) to capture the variance of density by considering both point context and semantic context. Point-level contexts are generated from original point clouds to enlarge the effective receptive filed. They are extracted around the voxelized pillars based on our extended voxelization method and processed with the context encoder in parallel with the pillar features. With a large perception range, we are able to capture the variance of features for potential objects and generate attentive spatial guidance to help adjust the strengths for different regions. In the region proposal network, considering the limited representation ability of traditional convolution where same kernels are shared among different samples and positions, we propose a decomposable dynamic convolutional layer to adapt to the variance of input features by learning from local semantic context. It adaptively generates the position-dependent coefficients for multiple fixed kernels and combines them to convolve with local feature windows. Based on our dynamic convolution, we design a dual-path convolution block to further improve the representation ability. We conduct experiments with our Network on KITTI dataset and achieve good performance on 3D detection task for both precision and speed. Our one-stage detector outperforms SECOND and PointPillars by a large margin and achieves the speed of 30 FPS.
CVJan 24, 2019
Three-dimensional Backbone Network for 3D Object Detection in Traffic ScenesXuesong Li, Jose Guivant, Ngaiming Kwok et al.
The task of detecting 3D objects in traffic scenes has a pivotal role in many real-world applications. However, the performance of 3D object detection is lower than that of 2D object detection due to the lack of powerful 3D feature extraction methods. To address this issue, this study proposes a 3D backbone network to acquire comprehensive 3D feature maps for 3D object detection. It primarily consists of sparse 3D convolutional neural network operations in the point cloud. The 3D backbone network can inherently learn 3D features from the raw data without compressing the point cloud into multiple 2D images. The sparse 3D convolutional neural network takes full advantage of the sparsity in the 3D point cloud to accelerate computation and save memory, which makes the 3D backbone network feasible in a real-world application. Empirical experiments were conducted on the KITTI benchmark and comparable results were obtained with respect to the state-of-the-art performance for 3D object detection.