IVAug 5, 2023
Landmark Detection using Transformer Toward Robot-assisted Nasal Airway IntubationTianhang Liu, Hechen Li, Long Bai et al.
Robot-assisted airway intubation application needs high accuracy in locating targets and organs. Two vital landmarks, nostrils and glottis, can be detected during the intubation to accommodate the stages of nasal intubation. Automated landmark detection can provide accurate localization and quantitative evaluation. The Detection Transformer (DeTR) leads object detectors to a new paradigm with long-range dependence. However, current DeTR requires long iterations to converge, and does not perform well in detecting small objects. This paper proposes a transformer-based landmark detection solution with deformable DeTR and the semantic-aligned-matching module for detecting landmarks in robot-assisted intubation. The semantics aligner can effectively align the semantics of object queries and image features in the same embedding space using the most discriminative features. To evaluate the performance of our solution, we utilize a publicly accessible glottis dataset and automatically annotate a nostril detection dataset. The experimental results demonstrate our competitive performance in detection accuracy. Our code is publicly accessible.
NAMay 13
A refined CJ--SS--RR method with a reliable removal approach of spurious Ritz values for the Hermitian eigenvalue problemZhongxiao Jia, Tianhang Liu
Under the hypothesis that the deviations of the desired eigenvectors of the matrix $A$ from the underlying subspace tend to zero, the Ritz vectors may not converge and have poor or little accuracy. This phenomenon is not unusual and particularly occurs when the associated Ritz values are close, which is independent of the eigenvalue distribution of $A$. For the (block) SS--RR methods, there are possibly {\em more} Ritz values that converge to the same desired eigenvalue(s) counting multiplicity in the region of interest, meaning that some of the Ritz values must be spurious and the corresponding residual norms of the Ritz pairs may not be small. Consequently, the (block) SS--RR methods including the CJ--SS--RR method cannot base on the corresponding residual norms to effectively identify if the Ritz values in the region are genuine or spurious. This paper proposes refined SS--RR, abbreviated as SS--RRR, methods based on the refined Rayleigh--Ritz projection that compute the eigenpairs of large matrices with the eigenvalues located in the given region. We present a new approach to accurately implement the RRR methods more efficiently than ever before for a general subspace.Exploiting the unconditional convergence of the refined Ritz vectors when the subspace is sufficiently accurate, we propose a tune-free removal approach to effectively remove spurious Ritz values with a rigorous theory supported, and develop a restarted CJ--SS--RRR algorithm. Numerical experiments show that the restarted CJ--SS--RRR algorithm is more efficient and effective than the restarted CJ--SS--RR algorithm.
CVAug 12, 2025Code
Adaptive Confidence-Wise Loss for Improved Lens Structure Segmentation in AS-OCTZunjie Xiao, Xiao Wu, Tianhang Liu et al.
Precise lens structure segmentation is essential for the design of intraocular lenses (IOLs) in cataract surgery. Existing deep segmentation networks typically weight all pixels equally under cross-entropy (CE) loss, overlooking the fact that sub-regions of lens structures are inhomogeneous (e.g., some regions perform better than others) and that boundary regions often suffer from poor segmentation calibration at the pixel level. Clinically, experts annotate different sub-regions of lens structures with varying confidence levels, considering factors such as sub-region proportions, ambiguous boundaries, and lens structure shapes. Motivated by this observation, we propose an Adaptive Confidence-Wise (ACW) loss to group each lens structure sub-region into different confidence sub-regions via a confidence threshold from the unique region aspect, aiming to exploit the potential of expert annotation confidence prior. Specifically, ACW clusters each target region into low-confidence and high-confidence groups and then applies a region-weighted loss to reweigh each confidence group. Moreover, we design an adaptive confidence threshold optimization algorithm to adjust the confidence threshold of ACW dynamically. Additionally, to better quantify the miscalibration errors in boundary region segmentation, we propose a new metric, termed Boundary Expected Calibration Error (BECE). Extensive experiments on a clinical lens structure AS-OCT dataset and other multi-structure datasets demonstrate that our ACW significantly outperforms competitive segmentation loss methods across different deep segmentation networks (e.g., MedSAM). Notably, our method surpasses CE with 6.13% IoU gain, 4.33% DSC increase, and 4.79% BECE reduction in lens structure segmentation under U-Net. The code of this paper is available at https://github.com/XiaoLing12138/Adaptive-Confidence-Wise-Loss.
CVOct 11, 2025
VividAnimator: An End-to-End Audio and Pose-driven Half-Body Human Animation FrameworkDonglin Huang, Yongyuan Li, Tianhang Liu et al.
Existing for audio- and pose-driven human animation methods often struggle with stiff head movements and blurry hands, primarily due to the weak correlation between audio and head movements and the structural complexity of hands. To address these issues, we propose VividAnimator, an end-to-end framework for generating high-quality, half-body human animations driven by audio and sparse hand pose conditions. Our framework introduces three key innovations. First, to overcome the instability and high cost of online codebook training, we pre-train a Hand Clarity Codebook (HCC) that encodes rich, high-fidelity hand texture priors, significantly mitigating hand degradation. Second, we design a Dual-Stream Audio-Aware Module (DSAA) to model lip synchronization and natural head pose dynamics separately while enabling interaction. Third, we introduce a Pose Calibration Trick (PCT) that refines and aligns pose conditions by relaxing rigid constraints, ensuring smooth and natural gesture transitions. Extensive experiments demonstrate that Vivid Animator achieves state-of-the-art performance, producing videos with superior hand detail, gesture realism, and identity consistency, validated by both quantitative metrics and qualitative evaluations.
CVSep 25, 2025
UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep DecompositionGuojun Lei, Rong Zhang, Chi Wang et al.
We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/