91.4CVApr 16
Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily ScenariosXiaomin Li, Tala Wang, Zichen Zhong et al.
Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.
29.8ITApr 15
Towards Autonomous Driving with Short-Packet Rate Splitting: Age of Information Analysis and OptimizationZirui Zheng, Yingyang Chen, Xinyue Pei et al.
To address the high mobility impacts and the ultra-reliable and low-latency communication (URLLC) requirements in autonomous driving scenarios, rate-splitting multiple access (RSMA) combined with short-packet communication (SPC) emerges as a promising solution.Autonomous vehicles rely on real-time information exchange to ensure safety and coordination, making information freshness essential.By jointly capturing transmission delays and packet errors, age of information (AoI) serves as a comprehensive metric for freshness.In this paper, we investigate short-packet rate splitting to enhance information freshness measured by the AoI.By splitting the unicast messages into common and private parts, encoding all common parts together with the multicast message into a common stream, and encoding each private part into a private stream, RSMA effectively manages interference and enables achieving lower AoI.By considering critical factors such as transmit power, vehicle velocity, blocklength, and the number of transmit antennas, we derive closed-form expressions for the average AoI (AAoI) of the common stream under partial decoding and the overall AAoI under complete decoding.To enhance the AAoI performance, we propose the multi-start two-step successive convex approximation (SCA) algorithm.This algorithm first optimizes the power allocation and subsequently optimizes the rate splitting under the quality of service (QoS) trade-off constraint.Simulation results demonstrate that our short-packet rate-splitting scheme significantly improves the AAoI performance while ensuring system fairness and enabling ultra-low AAoI through the common stream, meeting the requirements of autonomous driving applications.Moreover, the trade-off between the common and overall performance is revealed, indicating that the overall performance can be further enhanced while maintaining the advantages of the common stream.
CLNov 18, 2024
OASIS: Open Agent Social Interaction Simulations with One Million AgentsZiyi Yang, Zaibin Zhang, Zirui Zheng et al.
There has been a growing interest in enhancing rule-based agent-based models (ABMs) for social media platforms (i.e., X, Reddit) with more realistic large language model (LLM) agents, thereby allowing for a more nuanced study of complex systems. As a result, several LLM-based ABMs have been proposed in the past year. While they hold promise, each simulator is specifically designed to study a particular scenario, making it time-consuming and resource-intensive to explore other phenomena using the same ABM. Additionally, these models simulate only a limited number of agents, whereas real-world social media platforms involve millions of users. To this end, we propose OASIS, a generalizable and scalable social media simulator. OASIS is designed based on real-world social media platforms, incorporating dynamically updated environments (i.e., dynamic social networks and post information), diverse action spaces (i.e., following, commenting), and recommendation systems (i.e., interest-based and hot-score-based). Additionally, OASIS supports large-scale user simulations, capable of modeling up to one million users. With these features, OASIS can be easily extended to different social media platforms to study large-scale group phenomena and behaviors. We replicate various social phenomena, including information spreading, group polarization, and herd effects across X and Reddit platforms. Moreover, we provide observations of social phenomena at different agent group scales. We observe that the larger agent group scale leads to more enhanced group dynamics and more diverse and helpful agents' opinions. These findings demonstrate OASIS's potential as a powerful tool for studying complex systems in digital environments.
CVSep 15, 2025
Layout-Conditioned Autoregressive Text-to-Image Generation via Structured MaskingZirui Zheng, Takashi Isobe, Tong Shen et al.
While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.