98.1AIMar 16
VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool ChainingXuanyu Zhu, Yuhao Dong, Rundong Wang et al.
Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench~(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models' visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51\% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.
CVApr 25, 2021
A Novel Binocular Eye-Tracking SystemWith Stereo Stimuli for 3D Gaze EstimationJinglin Sun, Zhipeng Wu, Han Wang et al.
Eye-tracking technologies have been widely used in applications like psychological studies and human computer interactions (HCI). However, most current eye trackers focus on 2D point of gaze (PoG) estimation and cannot provide accurate gaze depth.Concerning future applications such as HCI with 3D displays, we propose a novel binocular eye tracking device with stereo stimuli to provide highly accurate 3D PoG estimation. In our device, the 3D stereo imaging system can provide users with a friendly and immersive 3D visual experience without wearing any accessories. The eye capturing system can directly record the users eye movements under 3D stimuli without disturbance. A regression based 3D eye tracking model is built based on collected eye movement data under stereo stimuli. Our model estimates users 2D gaze with features defined by eye region landmarks and further estimates 3D PoG with a multi source feature set constructed by comprehensive eye movement features and disparity features from stereo stimuli. Two test stereo scenes with different depths of field are designed to verify the model effectiveness. Experimental results show that the average error for 2D gaze estimation was 0.66\degree and for 3D PoG estimation, the average errors are 1.85~cm/0.15~m over the workspace volume 50~cm $\times$ 30~cm $\times$ 75~cm/2.4~m $\times$ 4.0~m $\times$ 7.9~m separately.
IRJun 20, 2018
Optimization Matrix Factorization Recommendation Algorithm Based on Rating CentralityZhipeng Wu, Hui Tian, Xuzhen Zhu et al.
Matrix factorization (MF) is extensively used to mine the user preference from explicit ratings in recommender systems. However, the reliability of explicit ratings is not always consistent, because many factors may affect the user's final evaluation on an item, including commercial advertising and a friend's recommendation. Therefore, mining the reliable ratings of user is critical to further improve the performance of the recommender system. In this work, we analyze the deviation degree of each rating in overall rating distribution of user and item, and propose the notion of user-based rating centrality and item-based rating centrality, respectively. Moreover, based on the rating centrality, we measure the reliability of each user rating and provide an optimized matrix factorization recommendation algorithm. Experimental results on two popular recommendation datasets reveal that our method gets better performance compared with other matrix factorization recommendation algorithms, especially on sparse datasets.
CVMar 4, 2017
Looking at Outfit to Parse ClothingPongsate Tangseng, Zhipeng Wu, Kota Yamaguchi
This paper extends fully-convolutional neural networks (FCN) for the clothing parsing problem. Clothing parsing requires higher-level knowledge on clothing semantics and contextual cues to disambiguate fine-grained categories. We extend FCN architecture with a side-branch network which we refer outfit encoder to predict a consistent set of clothing labels to encourage combinatorial preference, and with conditional random field (CRF) to explicitly consider coherent label assignment to the given image. The empirical results using Fashionista and CFPD datasets show that our model achieves state-of-the-art performance in clothing parsing, without additional supervision during training. We also study the qualitative influence of annotation on the current clothing parsing benchmarks, with our Web-based tool for multi-scale pixel-wise annotation and manual refinement effort to the Fashionista dataset. Finally, we show that the image representation of the outfit encoder is useful for dress-up image retrieval application.