CLMay 17, 2025Code
ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart EditingXuanle Zhao, Xuexin Liu, Haoyue Yang et al.
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, editing charts via code presents a greater challenge. This task demands MLLMs to integrate chart understanding and reasoning capacities, which are labor-intensive. While many MLLMs claim such editing capabilities, current evaluations rely on limited case studies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose \textsc{ChartEdit}, a novel benchmark designed for chart editing tasks, featuring $1405$ diverse editing instructions applied to $233$ real-world charts, each manually annotated and validated for accuracy. Utilizing \textsc{ChartEdit}, we evaluate the performance of 10 mainstream MLLMs across two types of experiments at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.
LGNov 24, 2025Code
TouchFormer: A Robust Transformer-based Framework for Multimodal Material PerceptionKailin Lyu, Long Xiao, Jianing Zeng et al.
Traditional vision-based material perception methods often experience substantial performance degradation under visually impaired conditions, thereby motivating the shift toward non-visual multimodal material perception. Despite this, existing approaches frequently perform naive fusion of multimodal inputs, overlooking key challenges such as modality-specific noise, missing modalities common in real-world scenarios, and the dynamically varying importance of each modality depending on the task. These limitations lead to suboptimal performance across several benchmark tasks. In this paper, we propose a robust multimodal fusion framework, TouchFormer. Specifically, we employ a Modality-Adaptive Gating (MAG) mechanism and intra- and inter-modality attention mechanisms to adaptively integrate cross-modal features, enhancing model robustness. Additionally, we introduce a Cross-Instance Embedding Regularization(CER) strategy, which significantly improves classification accuracy in fine-grained subcategory material recognition tasks. Experimental results demonstrate that, compared to existing non-visual methods, the proposed TouchFormer framework achieves classification accuracy improvements of 2.48% and 6.83% on SSMC and USMC tasks, respectively. Furthermore, real-world robotic experiments validate TouchFormer's effectiveness in enabling robots to better perceive and interpret their environment, paving the way for its deployment in safety-critical applications such as emergency response and industrial automation. The code and datasets will be open-source, and the videos are available in the supplementary materials.
CVJan 27
DiffStyle3D: Consistent 3D Gaussian Stylization via Attention OptimizationYitong Yang, Xuexin Liu, Yinglin Wang et al.
3D style transfer enables the creation of visually expressive 3D content, enriching the visual appearance of 3D scenes and objects. However, existing VGG- and CLIP-based methods struggle to model multi-view consistency within the model itself, while diffusion-based approaches can capture such consistency but rely on denoising directions, leading to unstable training. To address these limitations, we propose DiffStyle3D, a novel diffusion-based paradigm for 3DGS style transfer that directly optimizes in the latent space. Specifically, we introduce an Attention-Aware Loss that performs style transfer by aligning style features in the self-attention space, while preserving original content through content feature alignment. Inspired by the geometric invariance of 3D stylization, we propose a Geometry-Guided Multi-View Consistency method that integrates geometric information into self-attention to enable cross-view correspondence modeling. Based on geometric information, we additionally construct a geometry-aware mask to prevent redundant optimization in overlapping regions across views, which further improves multi-view consistency. Extensive experiments show that DiffStyle3D outperforms state-of-the-art methods, achieving higher stylization quality and visual realism.
65.0AIApr 7
OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation RewardHaoyue Yang, Xuanle Zhao, Xuexin Liu et al.
The paradigm of programmable diagram generation is evolving rapidly, playing a crucial role in structured visualization. However, most existing studies are confined to a narrow range of task formulations and language support, constraining their applicability to diverse diagram types. In this work, we propose OmniDiagram, a unified framework that incorporates diverse diagram code languages and task definitions. To address the challenge of aligning code logic with visual fidelity in Reinforcement Learning (RL), we introduce a novel visual feedback strategy named Visual Interrogation Verifies All (\textsc{Viva}). Unlike brittle syntax-based rules or pixel-level matching, \textsc{Viva} rewards the visual structure of rendered diagrams through a generative approach. Specifically, \textsc{Viva} actively generates targeted visual inquiries to scrutinize diagram visual fidelity and provides fine-grained feedback for optimization. This mechanism facilitates a self-evolving training process, effectively obviating the need for manually annotated ground truth code. Furthermore, we construct M3$^2$Diagram, the first large-scale diagram code generation dataset, containing over 196k high-quality instances. Experimental results confirm that the combination of SFT and our \textsc{Viva}-based RL allows OmniDiagram to establish a new state-of-the-art (SOTA) across diagram code generation benchmarks.
CVMay 13, 2021
Model Pruning Based on Quantified Similarity of Feature MapsZidu Wang, Xuexin Liu, Long Huang et al.
Convolutional Neural Networks (CNNs) has been applied in numerous Internet of Things (IoT) devices for multifarious downstream tasks. However, with the increasing amount of data on edge devices, CNNs can hardly complete some tasks in time with limited computing and storage resources. Recently, filter pruning has been regarded as an effective technique to compress and accelerate CNNs, but existing methods rarely prune CNNs from the perspective of compressing high-dimensional tensors. In this paper, we propose a novel theory to find redundant information in three-dimensional tensors, namely Quantified Similarity between Feature Maps (QSFM), and utilize this theory to guide the filter pruning procedure. We perform QSFM on datasets (CIFAR-10, CIFAR-100 and ILSVRC-12) and edge devices, demonstrate that the proposed method can find the redundant information in the neural networks effectively with comparable compression and tolerable drop of accuracy. Without any fine-tuning operation, QSFM can compress ResNet-56 on CIFAR-10 significantly (48.7% FLOPs and 57.9% parameters are reduced) with only a loss of 0.54% in the top-1 accuracy. For the practical application of edge devices, QSFM can accelerate MobileNet-V2 inference speed by 1.53 times with only a loss of 1.23% in the ILSVRC-12 top-1 accuracy.