Yanguang Zhao

CV
h-index29
3papers
4citations
Novelty42%
AI Score32

3 Papers

CVMar 4
Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang, Shengqiong Wu et al.

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

IVOct 24, 2024
Transferring Knowledge from High-Quality to Low-Quality MRI for Adult Glioma Diagnosis

Yanguang Zhao, Long Bai, Zhaoxi Zhang et al.

Glioma, a common and deadly brain tumor, requires early diagnosis for improved prognosis. However, low-quality Magnetic Resonance Imaging (MRI) technology in Sub-Saharan Africa (SSA) hinders accurate diagnosis. This paper presents our work in the BraTS Challenge on SSA Adult Glioma. We adopt the model from the BraTS-GLI 2021 winning solution and utilize it with three training strategies: (1) initially training on the BraTS-GLI 2021 dataset with fine-tuning on the BraTS-Africa dataset, (2) training solely on the BraTS-Africa dataset, and (3) training solely on the BraTS-Africa dataset with 2x super-resolution enhancement. Results show that initial training on the BraTS-GLI 2021 dataset followed by fine-tuning on the BraTS-Africa dataset has yielded the best results. This suggests the importance of high-quality datasets in providing prior knowledge during training. Our top-performing model achieves Dice scores of 0.882, 0.840, and 0.926, and Hausdorff Distance (95%) scores of 15.324, 37.518, and 13.971 for enhancing tumor, tumor core, and whole tumor, respectively, in the validation phase. In the final phase of the competition, our approach successfully secured second place overall, reflecting the strength and effectiveness of our model and training strategies. Our approach provides insights into improving glioma diagnosis in SSA, showing the potential of deep learning in resource-limited settings and the importance of transfer learning from high-quality datasets.

CVMar 29, 2025
Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery

Boyi Ma, Yanguang Zhao, Jie Wang et al.

The DeepSeek models have shown exceptional performance in general scene understanding, question-answering (QA), and text generation tasks, owing to their efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our empirical study shows that, compared to existing general-purpose multimodal large language models, DeepSeek-VL2 performs better on complex understanding tasks in surgical scenes. Additionally, although DeepSeek-V3 is purely a language model, we find that when image tokens are directly inputted, the model demonstrates better performance on single-sentence QA tasks. However, overall, the DeepSeek models still fall short of meeting the clinical requirements for understanding surgical scenes. Under general prompts, DeepSeek models lack the ability to effectively analyze global surgical concepts and fail to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek models are not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.