Shijie Guo

h-index28

3papers

44citations

Novelty40%

AI Score38

Ranked #87,039 of 194,257 authors (top 45%)#16,405 in CL (top 53%)

3 Papers

23.8CVAug 16, 2025Code

Simple o3: Towards Interleaved Vision-Language Reasoning

Ye Wang, Qianglong Chen, Zejun Li et al.

Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI's o3 model, which emulates human-like ''thinking with image'' through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ''observe-reason-act'' cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3's superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model's visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.

12.6CLDec 13, 2024

Enhancing Nursing and Elderly Care with Large Language Models: An AI-Driven Framework

Qiao Sun, Jiexin Xie, Nanyang Ye et al.

This paper explores the application of large language models (LLMs) in nursing and elderly care, focusing on AI-driven patient monitoring and interaction. We introduce a novel Chinese nursing dataset and implement incremental pre-training (IPT) and supervised fine-tuning (SFT) techniques to enhance LLM performance in specialized tasks. Using LangChain, we develop a dynamic nursing assistant capable of real-time care and personalized interventions. Experimental results demonstrate significant improvements, paving the way for AI-driven solutions to meet the growing demands of healthcare in aging populations.

3.0ROAug 9, 2021

Organization and Understanding of a Tactile Information Dataset TacAct During Physical Human-Robot Interactions

Peng Wang, Jixiao Liu, Funing Hou et al.

Advanced service robots require superior tactile intelligence to guarantee human-contact safety and to provide essential supplements to visual and auditory information for human-robot interaction, especially when a robot is in physical contact with a human. Tactile intelligence is an essential capability of perception and recognition from tactile information, based on the learning from a large amount of tactile data and the understanding of the physical meaning behind the data. This report introduces a recently collected and organized dataset "TacAct" that encloses real-time pressure distribution when a human subject touches the arms of a nursing-care robot. The dataset consists of information from 50 subjects who performed a total of 24,000 touch actions. Furthermore, the details of the dataset are described, the data are preliminarily analyzed, and the validity of the collected information is tested through a convolutional neural network LeNet-5 classifying different types of touch actions. We believe that the TacAct dataset would be more than beneficial for the community of human interactive robots to understand the tactile profile under various circumstances.