CVMar 27
Beyond Language: Grounding Referring Expressions with Hand Pointing in Egocentric VisionLing Li, Bowen Liu, Zinuo Zhan et al.
Traditional Visual Grounding (VG) predominantly relies on textual descriptions to localize objects, a paradigm that inherently struggles with linguistic ambiguity and often ignores non-verbal deictic cues prevalent in real-world interactions. In natural egocentric engagements, hand-pointing combined with speech forms the most intuitive referring mechanism. To bridge this gap, we introduce EgoPoint-Ground, the first large-scale multimodal dataset dedicated to egocentric deictic visual grounding. Comprising over \textbf{15k} interactive samples in complex scenes, the dataset provides rich, multi-grained annotations including hand-target bounding box pairs and dense semantic captions. We establish a comprehensive benchmark for hand-pointing referring expression resolution, evaluating a wide spectrum of mainstream Multimodal Large Language Models (MLLMs) and state-of-the-art VG architectures. Furthermore, we propose SV-CoT, a novel baseline framework that reformulates grounding as a structured inference process, synergizing gestural and linguistic cues through a Visual Chain-of-Thought paradigm. Extensive experiments demonstrate that SV-CoT achieves an $\textbf{11.7\%}$ absolute improvement over existing methods, effectively mitigating semantic ambiguity and advancing the capability of agents to comprehend multimodal physical intents. The dataset and code will be made publicly available.
MED-PHFeb 2, 2021
Single-Shell NODDI Using Dictionary Learner Estimated Isotropic Volume FractionAbrar Faiyaz, Marvin Doyley, Giovanni Schifitto et al.
Neurite orientation dispersion and density imaging (NODDI) enables the assessment of intracellular, extracellular and free water signals from multi-shell diffusion MRI data. It is an insightful approach to characterize brain tissue microstructure. Single-shell reconstruction for NODDI parameters has been discouraged in previous studies caused by failure when fitting, especially for the neurite density index (NDI). Here, we investigated the possibility of creating robust NODDI parameter maps with single-shell data, using the isotropic volume fraction (fISO) as prior. Prior estimation was made independent of the NODDI model constraint using a dictionary learning approach. First, we used a stochastic sparse dictionary-based network (DictNet) in predicting fISO which is trained with data obtained from in vivo and simulated diffusion MRI data. In single-shell cases, the mean diffusivity (MD) and raw T2 signal with no diffusion weighting (S0) was incorporated in the dictionary for the fISO estimation. Then, the NODDI framework was used with the known fISO to estimate the NDI and orientation dispersion index (ODI). The fISO estimated by our model was compared with other fISO estimators in the simulation. Further, using both synthetic data simulation and human data collected on a 3T scanner, we compared the performance of our dictionary-based learning prior NODDI (DLpN) with the original NODDI for both single-shell and multi-shell data. Our results suggest that DLpN derived NDI and ODI parameters for single-shell protocols are comparable with original multi-shell NODDI, and protocol with b=2000 s/mm2 performs the best (error ~5% in white and grey matter). This may allow NODDI evaluation of studies on single-shell data by multi-shell scanning of two subjects for DictNet fISO training.
CVAug 17, 2017
High Efficient Reconstruction of Single-shot T2 Mapping from OverLapping-Echo Detachment Planar Imaging Based on Deep Residual NetworkCongbo Cai, Yiqing Zeng, Chao Wang et al.
Purpose: An end-to-end deep convolutional neural network (CNN) based on deep residual network (ResNet) was proposed to efficiently reconstruct reliable T2 mapping from single-shot OverLapping-Echo Detachment (OLED) planar imaging. Methods: The training dataset was obtained from simulations carried out on SPROM software developed by our group. The relationship between the original OLED image containing two echo signals and the corresponded T2 mapping was learned by ResNet training. After the ResNet was trained, it was applied to reconstruct the T2 mapping from simulation and in vivo human brain data. Results: Though the ResNet was trained entirely on simulated data, the trained network was generalized well to real human brain data. The results from simulation and in vivo human brain experiments show that the proposed method significantly outperformed the echo-detachment-based method. Reliable T2 mapping was achieved within tens of milliseconds after the network had been trained while the echo-detachment-based OLED reconstruction method took minutes. Conclusion: The proposed method will greatly facilitate real-time dynamic and quantitative MR imaging via OLED sequence, and ResNet has the potential to reconstruct images from complex MRI sequence efficiently.