LGJul 1, 2024Code
DistML.js: Installation-free Distributed Deep Learning Framework for Web BrowsersMasatoshi Hidaka, Tomohiro Hashimoto, Yuto Nishizawa et al.
We present "DistML.js", a library designed for training and inference of machine learning models within web browsers. Not only does DistML.js facilitate model training on local devices, but it also supports distributed learning through communication with servers. Its design and define-by-run API for deep learning model construction resemble PyTorch, thereby reducing the learning curve for prototyping. Matrix computations involved in model training and inference are executed on the backend utilizing WebGL, enabling high-speed calculations. We provide a comprehensive explanation of DistML.js's design, API, and implementation, alongside practical applications including data parallelism in learning. The source code is publicly available at https://github.com/mil-tokyo/distmljs.
CVDec 29, 2025
RealX3D: A Physically-Degraded 3D Benchmark for Multi-view Visual Restoration and ReconstructionShuhong Liu, Chenyu Bao, Ziteng Cui et al.
We introduce RealX3D, a real-capture benchmark for multi-view visual restoration and 3D reconstruction under diverse physical degradations. RealX3D groups corruptions into four families, including illumination, scattering, occlusion, and blurring, and captures each at multiple severity levels using a unified acquisition protocol that yields pixel-aligned LQ/GT views. Each scene includes high-resolution capture, RAW images, and dense laser scans, from which we derive world-scale meshes and metric depth. Benchmarking a broad range of optimization-based and feed-forward methods shows substantial degradation in reconstruction quality under physical corruptions, underscoring the fragility of current multi-view pipelines in real-world challenging environments.
84.6CVApr 5
NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge ResultsShuhong Liu, Chenyu Bao, Ziteng Cui et al.
This paper presents a comprehensive review of the NTIRE 2026 3D Restoration and Reconstruction (3DRR) Challenge, detailing the proposed methods and results. The challenge seeks to identify robust reconstruction pipelines that are robust under real-world adverse conditions, specifically extreme low-light and smoke-degraded environments, as captured by our RealX3D benchmark. A total of 279 participants registered for the competition, of whom 33 teams submitted valid results. We thoroughly evaluate the submitted approaches against state-of-the-art baselines, revealing significant progress in 3D reconstruction under adverse conditions. Our analysis highlights shared design principles among top-performing methods and provides insights into effective strategies for handling 3D scene degradation.
CVJan 18, 2024
Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question GenerationKohei Uehara, Nabarun Goswami, Hanqin Wang et al.
The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input.