Roy Luo

h-index28
2papers

2 Papers

CVOct 30, 2025
CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun et al.

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

26.2LGMar 25
Memory-Guided Trust-Region Bayesian Optimization (MG-TuRBO) for High Dimensions

Abhilasha Saroj, Shaked Regev, Guanhao Xu et al.

Traffic simulation and digital-twin calibration is a challenging optimization problem with a limited simulation budget. Each trial requires an expensive simulation run, and the relationship between calibration inputs and model error is often nonconvex, and noisy. The problem becomes more difficult as the number of calibration parameters increases. We compare a commonly used automatic calibration method, a genetic algorithm (GA), with Bayesian optimization methods (BOMs): classical Bayesian optimization (BO), Trust-Region BO (TuRBO), Multi-TuRBO, and a proposed Memory-Guided TuRBO (MG-TuRBO) method. We compare performance on 2 real-world traffic simulation calibration problems with 14 and 84 decision variables, representing lower- and higher-dimensional (14D and 84D) settings. For BOMs, we study two acquisition strategies, Thompson sampling and a novel adaptive strategy. We evaluate performance using final calibration quality, convergence behavior, and consistency across runs. The results show that BOMs reach good calibration targets much faster than GA in the lower-D problem. MG-TuRBO performs comparably in our 14D setting, it demonstrates noticeable advantages in the 84D problem, particularly when paired with our adaptive strategy. Our results suggest that MG-TuRBO is especially useful for high-D traffic simulation calibration and potentially for high-D problems in general.