96.9CLApr 9Code
OralAgent: Integrating Reasoning, Tools, and Knowledge for Interactive Dental Image AnalysisJing Hao, Siyuan Dai, Yongxin Zhang et al.
Dental image analysis plays a pivotal role in supporting accurate diagnosis and treatment planning in oral healthcare. Although recent advances have produced dental AI models for specific tasks and individual imaging modalities, their isolated designs limit practical use in real-world clinical workflows. In this paper, we present OralAgent, the first dental-specialized AI agent that unifies multimodal reasoning, tool-based decision-making, and knowledge-grounded retrieval within an end-to-end automated framework. It integrates 22 visual analysis tools and 368 widely-used classical dental textbooks, enabling autonomous reasoning, planning, tool use, knowledge retrieval, and multi-step workflow execution. Furthermore, we introduce OralCorpus, a large-scale, high-quality bilingual textual resource containing 134.8M tokens curated for dental retrieval-augmented generation (RAG). To evaluate models' multidisciplinary dental knowledge, we construct OralQA-ZH, a Chinese multiple-choice question benchmark consisting of 798 items across eleven oral subspecialties. Extensive experiments demonstrate that OralAgent achieves state-of-the-art performance on the MMOral-Uni, MMOral-OPG, and OralQA-ZH benchmarks, highlighting its effectiveness, interpretability, and adaptability in real-world clinical settings. The code and models are publicly available at https://github.com/isjinghao/OralAgent.
CVApr 1, 2024Code
T-Mamba: A unified framework with Long-Range Dependency in dual-domain for 2D & 3D Tooth SegmentationJing Hao, Yonghui Zhu, Lei He et al.
Tooth segmentation is a pivotal step in modern digital dentistry, essential for applications across orthodontic diagnosis and treatment planning. Despite its importance, this process is fraught with challenges due to the high noise and low contrast inherent in 2D and 3D tooth data. Both Convolutional Neural Networks (CNNs) and Transformers has shown promise in medical image segmentation, yet each method has limitations in handling long-range dependencies and computational complexity. To address this issue, this paper introduces T-Mamba, integrating frequency-based features and shared bi-positional encoding into vision mamba to address limitations in efficient global feature modeling. Besides, we design a gate selection unit to integrate two features in spatial domain and one feature in frequency domain adaptively. T-Mamba is the first work to introduce frequency-based features into vision mamba, and its flexibility allows it to process both 2D and 3D tooth data without the need for separate modules. Also, the TED3, a large-scale public tooth 2D dental X-ray dataset, has been presented in this paper. Extensive experiments demonstrate that T-Mamba achieves new SOTA results on a public tooth CBCT dataset and outperforms previous SOTA methods on TED3 dataset. The code and models are publicly available at: https://github.com/isbrycee/T-Mamba.
30.2CVApr 16
MetaDent: Labeling Clinical Images for Vision-Language Models in DentistryMeng-Xun Li, Wen-Hui Deng, Zhi-Xing Wu et al.
Vision-Language Models (VLMs) have demonstrated significant potential in medical image analysis, yet their application in intraoral photography remains largely underexplored due to the lack of fine-grained, annotated datasets and comprehensive benchmarks. To address this, we present MetaDent, a comprehensive resource that includes (1) a novel and large-scale dentistry image dataset collected from clinical, public, and web sources; (2) a semi-structured annotation framework designed to capture the hierarchical and clinically nuanced nature of dental photography; and (3) comprehensive benchmark suites for evaluating state-of-the-art VLMs on clinical image understanding. Our labeling approach combines a high-level image summary with point-by-point, free-text descriptions of abnormalities. This method enables rich, scalable, and task-agnostic representations. We curated 60,669 dental images from diverse sources and annotated a representative subset of 2,588 images using this meta-labeling scheme. Leveraging Large Language Models (LLMs), we derive standardized benchmarks: approximately 15K Visual Question Answering (VQA) pairs and an 18-class multi-label classification dataset, which we validated with human review and error analysis to justify that the LLM-driven transition reliably preserves fidelity and semantic accuracy. We then evaluate state-of-the-art VLMs across VQA, classification, and image captioning tasks. Quantitative results reveal that even the most advanced models struggle with a fine-grained understanding of intraoral scenes, achieving moderate accuracy and producing inconsistent or incomplete descriptions in image captioning. We publicly release our dataset, annotations, and tools to foster reproducible research and accelerate the development of vision-language systems for dental applications.