CVJul 19, 2024
Double-Shot 3D Shape Measurement with a Dual-Branch Network for Structured Light Projection ProfilometryMingyang Lei, Jingfan Fan, Long Shao et al.
The structured light (SL)-based three-dimensional (3D) measurement techniques with deep learning have been widely studied to improve measurement efficiency, among which fringe projection profilometry (FPP) and speckle projection profilometry (SPP) are two popular methods. However, they generally use a single projection pattern for reconstruction, resulting in fringe order ambiguity or poor reconstruction accuracy. To alleviate these problems, we propose a parallel dual-branch Convolutional Neural Network (CNN)-Transformer network (PDCNet), to take advantage of convolutional operations and self-attention mechanisms for processing different SL modalities. Within PDCNet, a Transformer branch is used to capture global perception in the fringe images, while a CNN branch is designed to collect local details in the speckle images. To fully integrate complementary features, we design a double-stream attention aggregation module (DAAM) that consists of a parallel attention subnetwork for aggregating multi-scale spatial structure information. This module can dynamically retain local and global representations to the maximum extent. Moreover, an adaptive mixture density head with bimodal Gaussian distribution is proposed for learning a representation that is precise near discontinuities. Compared to the standard disparity regression strategy, this adaptive mixture head can effectively improve performance at object boundaries. Extensive experiments demonstrate that our method can reduce fringe order ambiguity while producing high-accuracy results on self-made datasets.
AIMar 12Code
Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot TasksMei Chee Leong, Ying Gu, Hui Li Tan et al.
Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.
AIMay 7Code
Towards Annotation-Free Validation of MLLMs: A Vision-Language Logical Consistency MetricYing Gu, Mei Chee Leong, Hui Li Tan et al.
Dominant accuracy evaluation might reward unwarranted guessing of Large Language Models, and it might not be applicable to novel tasks for model validation without ground-truth (gt) annotation. Based on basic logic principle, we propose a novel framework to evaluate the vision-language logical consistency of MLLMs on both sufficient and necessary cause-effect relations. We define Vision-Language Logical Consistency Metric (VL-LCM) on traditional MC-VQA tests, and recent NaturalBench tests without the need for gt annotation. Through systematic experiments on representative VL benchmark MMMU and recent VL challenges like NaturalBench, we evaluated 11 recent open-source MLLMs from 4 frontier families. Our findings reveal that, despite significant progress of recent MLLMs on accuracy, logical consistency lags behind significantly. Extensive evaluations on the correlations of VL-LCM with metrics on gt, the reliability of LCM, and the relation of VL-LCM with response distribution justify the validity and applicability of VL-LCM even without gt annotation. Our findings suggest that, beyond accuracy, logical consistency could be employed for both accuracy and reliability. VL-LCM can also be employed for MLLM selection, validation, and reliable answer justification in novel tasks without gt annotation.
CLMar 29
PRBench: End-to-end Paper Reproduction in Physics ResearchShi Qiu, Junyi Deng, Yiwei Deng et al.
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.
SEMar 19
From Human Interfaces to Agent Interfaces: Rethinking Software Design in the Age of AI-Native SystemsShaolin Wang, Yi Mei, Haoyang Che et al.
Software systems have traditionally been designed for human interaction, emphasizing graphical user interfaces, usability, and cognitive alignment with end users. However, recent advances in large language model (LLM)-based agents are changing the primary consumers of software systems. Increasingly, software is no longer only used by humans, but also invoked autonomously by AI agents through structured interfaces. In this paper, we argue that software engineering is undergoing a paradigm shift from human-oriented interfaces to agent-oriented invocation systems. We formalize the notion of agent interfaces, introduce invocable capabilities as the fundamental building blocks of AI-oriented software, and outline design principles for such systems, including machine interpretability, composability, and invocation reliability. We then discuss architectural and organizational implications of this shift, highlighting a transition from monolithic applications to capability-based systems that can be dynamically composed by AI agents. The paper aims to provide a conceptual foundation for the emerging paradigm of AI-native software design.