Juyoun Park

CV
h-index9
6papers
10citations
Novelty36%
AI Score42

6 Papers

CVNov 26, 2025
SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi, Tae Kyeong Jeong, Garam Kim et al.

Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

ROApr 15, 2025Code
A real-time anomaly detection method for robots based on a flexible and sparse latent space

Taewook Kang, Bum-Jae You, Juyoun Park et al.

The growing demand for robots to operate effectively in diverse environments necessitates the need for robust real-time anomaly detection techniques during robotic operations. However, deep learning-based models in robotics face significant challenges due to limited training data and highly noisy signal features. In this paper, we present Sparse Masked Autoregressive Flow-based Adversarial AutoEncoder model to address these problems. This approach integrates Masked Autoregressive Flow model into Adversarial AutoEncoders to construct a flexible latent space and utilize Sparse autoencoder to efficiently focus on important features, even in scenarios with limited feature space. Our experiments demonstrate that the proposed model achieves a 4.96% to 9.75% higher area under the receiver operating characteristic curve for pick-and-place robotic operations with randomly placed cans, compared to existing state-of-the-art methods. Notably, it showed up to 19.67% better performance in scenarios involving collisions with lightweight objects. Additionally, unlike the existing state-of-the-art model, our model performs inferences within 1 millisecond, ensuring real-time anomaly detection. These capabilities make our model highly applicable to machine learning-based robotic safety systems in dynamic environments. The code is available at https://github.com/twkang43/sparse-maf-aae.

CVSep 15, 2025Code
Microsurgical Instrument Segmentation for Robot-Assisted Surgery

Tae Kyeong Jeong, Garam Kim, Juyoun Park

Accurate segmentation of thin structures is critical for microsurgical scene understanding but remains challenging due to resolution loss, low contrast, and class imbalance. We propose Microsurgery Instrument Segmentation for Robotic Assistance(MISRA), a segmentation framework that augments RGB input with luminance channels, integrates skip attention to preserve elongated features, and employs an Iterative Feedback Module(IFM) for continuity restoration across multiple passes. In addition, we introduce a dedicated microsurgical dataset with fine-grained annotations of surgical instruments including thin objects, providing a benchmark for robust evaluation Dataset available at https://huggingface.co/datasets/KIST-HARILAB/MISAW-Seg. Experiments demonstrate that MISRA achieves competitive performance, improving the mean class IoU by 5.37% over competing methods, while delivering more stable predictions at instrument contacts and overlaps. These results position MISRA as a promising step toward reliable scene parsing for computer-assisted and robotic microsurgery.

CVSep 23, 2025
Surgical Video Understanding with Label Interpolation

Garam Kim, Tae Kyeong Jeong, Juyoun Park

Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.

CVJul 22, 2025
Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge

Tobias Rueckert, David Rauber, Raphaela Maerkl et al.

Reliable recognition and localization of surgical instruments in endoscopic video recordings are foundational for a wide range of applications in computer- and robot-assisted minimally invasive surgery (RAMIS), including surgical training, skill assessment, and autonomous assistance. However, robust performance under real-world conditions remains a significant challenge. Incorporating surgical context - such as the current procedural phase - has emerged as a promising strategy to improve robustness and interpretability. To address these challenges, we organized the Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) sub-challenge as part of the Endoscopic Vision (EndoVis) challenge at MICCAI 2024. We introduced a novel, multi-center dataset comprising thirteen full-length laparoscopic cholecystectomy videos collected from three distinct medical institutions, with unified annotations for three interrelated tasks: surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation. Unlike existing datasets, ours enables joint investigation of instrument localization and procedural context within the same data while supporting the integration of temporal information across entire procedures. We report results and findings in accordance with the BIAS guidelines for biomedical image analysis challenges. The PhaKIR sub-challenge advances the field by providing a unique benchmark for developing temporally aware, context-driven methods in RAMIS and offers a high-quality resource to support future research in surgical scene understanding.

CVNov 19, 2024
Rethinking Text-Promptable Surgical Instrument Segmentation with Robust Framework

Tae-Min Choi, Juyoun Park

Surgical instrument segmentation is an essential component of computer-assisted and robotic surgery systems. Vision-based segmentation models typically produce outputs limited to a predefined set of instrument categories, which restricts their applicability in interactive systems and robotic task automation. Promptable segmentation methods allow selective predictions based on textual prompts. However, they often rely on the assumption that the instruments present in the scene are already known, and prompts are generated accordingly, limiting their ability to generalize to unseen or dynamically emerging instruments. In practical surgical environments, where instrument existence information is not provided, this assumption does not hold consistently, resulting in false-positive segmentation. To address these limitations, we formulate a new task called Robust text-promptable Surgical Instrument Segmentation (R-SIS). Under this setting, prompts are issued for all candidate categories without access to instrument presence information. R-SIS requires distinguishing which prompts refer to visible instruments and generating masks only when such instruments are explicitly present in the scene. This setting reflects practical conditions where uncertainty in instrument presence is inherent. We evaluate existing segmentation methods under the R-SIS protocol using surgical video datasets and observe substantial false-positive predictions in the absence of ground-truth instruments. These findings demonstrate a mismatch between current evaluation protocols and real-world use cases, and support the need for benchmarks that explicitly account for prompt uncertainty and instrument absence.