CVFeb 9Code
MOVA: Towards Scalable and Synchronized Video-Audio GenerationSII-OpenMOSS Team, Donghua Yu, Mingshu Chen et al.
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
76.3CVMay 19Code
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-DistillationYuanpei Zhao, Jie Lin, Chao Zhang et al.
Pairwise preferences and pointwise ratings are the two dominant annotation protocols in image aesthetic assessment (IAA), yet existing benchmarks adopt only one, leaving their complementarity unmeasured under controlled conditions. We introduce PPaint, a matched dual-protocol benchmark in which 15 domain experts, 5 per category, annotate 150 Chinese paintings under both protocols across five aesthetic dimensions, collecting 45,900 pairwise expert judgments through a locally dense preference design alongside the matched ratings. The matched design reveals complementary strengths: preferences yield more consistent ordinal rankings, while ratings anchor the absolute score scale. Fusing both signals via two independent preference-to-score methods yields a fused expert ground truth on which the two constructions converge to nearly identical scores. The same preference-to-score principle extends to label-free VLM training. PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via an Elo reference pool, and trains the same VLM with confidence-weighted ranking optimization to produce a single-pass aesthetic scorer. Trained on a single painting category, the distilled Qwen3-VL-8B improves mean SRCC from 0.504 to 0.709 across all three categories, outperforming all open-source baselines including the dedicated aesthetic model ArtiMuse and matching closed-source Gemini-3.1-Pro within 0.04 SRCC at single-pass inference cost, with cross-domain transfer further validated on APDDv2. We will release the full PPaint dataset and training code.
CVJul 14, 2023
CeRF: Convolutional Neural Radiance Fields for New View Synthesis with Derivatives of Ray ModelingXiaoyan Yang, Dingbo Lu, Yang Li et al.
In recent years, novel view synthesis has gained popularity in generating high-fidelity images. While demonstrating superior performance in the task of synthesizing novel views, the majority of these methods are still based on the conventional multi-layer perceptron for scene embedding. Furthermore, light field models suffer from geometric blurring during pixel rendering, while radiance field-based volume rendering methods have multiple solutions for a certain target of density distribution integration. To address these issues, we introduce the Convolutional Neural Radiance Fields to model the derivatives of radiance along rays. Based on 1D convolutional operations, our proposed method effectively extracts potential ray representations through a structured neural network architecture. Besides, with the proposed ray modeling, a proposed recurrent module is employed to solve geometric ambiguity in the fully neural rendering process. Extensive experiments demonstrate the promising results of our proposed model compared with existing state-of-the-art methods.
CVJul 30, 2023
InvVis: Large-Scale Data Embedding for Invertible VisualizationHuayuan Ye, Chenhui Li, Yang Li et al.
We present InvVis, a new approach for invertible visualization, which is reconstructing or further modifying a visualization from an image. InvVis allows the embedding of a significant amount of data, such as chart data, chart information, source code, etc., into visualization images. The encoded image is perceptually indistinguishable from the original one. We propose a new method to efficiently express chart data in the form of images, enabling large-capacity data embedding. We also outline a model based on the invertible neural network to achieve high-quality data concealing and revealing. We explore and implement a variety of application scenarios of InvVis. Additionally, we conduct a series of evaluation experiments to assess our method from multiple perspectives, including data embedding quality, data restoration accuracy, data encoding capacity, etc. The result of our experiments demonstrates the great potential of InvVis in invertible visualization.
AIDec 18, 2025
Probing Scientific General Intelligence of LLMs with Scientist-Aligned WorkflowsWanghan Xu, Yuhao Zhou, Yifan Zhou et al.
Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.
CVFeb 25, 2025Code
OpenFly: A Comprehensive Platform for Aerial Vision-Language NavigationYunpeng Gao, Chenhui Li, Zhongrui You et al.
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN. Firstly, we integrate diverse rendering engines and advanced techniques for environment simulation, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of our environments. Secondly, we develop a highly automated toolchain for aerial VLN data collection, streamlining point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Thirdly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. Moreover, we propose OpenFly-Agent, a keyframe-aware VLN model emphasizing key observations during flight. For benchmarking, extensive experiments and analyses are conducted, evaluating several recent VLN methods and showcasing the superiority of our OpenFly platform and agent. The toolchain, dataset, and codes will be open-sourced.
CVDec 21, 2025
VizDefender: Unmasking Visualization Tampering through Proactive Localization and Intent InferenceSicheng Song, Yanjie Zhang, Zixin Chen et al.
The integrity of data visualizations is increasingly threatened by image editing techniques that enable subtle yet deceptive tampering. Through a formative study, we define this challenge and categorize tampering techniques into two primary types: data manipulation and visual encoding manipulation. To address this, we present VizDefender, a framework for tampering detection and analysis. The framework integrates two core components: 1) a semi-fragile watermark module that protects the visualization by embedding a location map to images, which allows for the precise localization of tampered regions while preserving visual quality, and 2) an intent analysis module that leverages Multimodal Large Language Models (MLLMs) to interpret manipulation, inferring the attacker's intent and misleading effects. Extensive evaluations and user studies demonstrate the effectiveness of our methods.
CVNov 10, 2025
MPJudge: Towards Perceptual Assessment of Music-Induced PaintingsShiqi Jiang, Tianyi Liang, Changbo Wang et al.
Music induced painting is a unique artistic practice, where visual artworks are created under the influence of music. Evaluating whether a painting faithfully reflects the music that inspired it poses a challenging perceptual assessment task. Existing methods primarily rely on emotion recognition models to assess the similarity between music and painting, but such models introduce considerable noise and overlook broader perceptual cues beyond emotion. To address these limitations, we propose a novel framework for music induced painting assessment that directly models perceptual coherence between music and visual art. We introduce MPD, the first large scale dataset of music painting pairs annotated by domain experts based on perceptual coherence. To better handle ambiguous cases, we further collect pairwise preference annotations. Building on this dataset, we present MPJudge, a model that integrates music features into a visual encoder via a modulation based fusion mechanism. To effectively learn from ambiguous cases, we adopt Direct Preference Optimization for training. Extensive experiments demonstrate that our method outperforms existing approaches. Qualitative results further show that our model more accurately identifies music relevant regions in paintings.
59.4GRApr 30
SandSim: Curve-Guided Gaussian Splatting for Reconstructing Sand Painting ProcessesYilin Wang, Haojie Huang, Chen Li et al.
Sand painting is a process-driven art where visual appearance emerges from granular accumulation. Given a single image, reconstructing a plausible sand painting process requires modeling coherent stroke structures and material-dependent effects. Existing methods, including stroke-based optimization and diffusion-based video synthesis, often lack structural coherence and material consistency, leading to unrealistic drawing sequences. We present SandSim, a framework that reconstructs a sand painting process from a single image. We introduce a curve-guided Gaussian representation that models strokes as sequences of anisotropic primitives along continuous trajectories, whose smooth kernels capture the soft boundaries of sand strokes and enable coherent stroke formation. We further adopt a subtractive compositing scheme to model light attenuation during sand accumulation. We incorporate a semantic-guided planning module for scene decomposition and drawing order inference. Our framework jointly optimizes stroke geometry and appearance and can be integrated with a physics-based simulator for interactive sand dynamics and editing. Experiments show that our method produces temporally coherent and visually realistic results, achieving improved reconstruction quality and perceptual fidelity compared to existing approaches.
AIJun 12, 2025
Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and ReasoningYuhao Zhou, Yiheng Wang, Xuming He et al.
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
CVNov 4, 2024
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language ModelYiming Sun, Fan Yu, Shaoxiang Chen et al.
Visual object tracking aims to locate a targeted object in a video sequence based on an initial bounding box. Recently, Vision-Language~(VL) trackers have proposed to utilize additional natural language descriptions to enhance versatility in various applications. However, VL trackers are still inferior to State-of-The-Art (SoTA) visual trackers in terms of tracking performance. We found that this inferiority primarily results from their heavy reliance on manual textual annotations, which include the frequent provision of ambiguous language descriptions. In this paper, we propose ChatTracker to leverage the wealth of world knowledge in the Multimodal Large Language Model (MLLM) to generate high-quality language descriptions and enhance tracking performance. To this end, we propose a novel reflection-based prompt optimization module to iteratively refine the ambiguous and inaccurate descriptions of the target with tracking feedback. To further utilize semantic information produced by MLLM, a simple yet effective VL tracking framework is proposed and can be easily integrated as a plug-and-play module to boost the performance of both VL and visual trackers. Experimental results show that our proposed ChatTracker achieves a performance comparable to existing methods.
CVApr 15, 2025
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCRYulong Zhang, Tianyi Liang, Xinyue Huang et al.
The Optical Character Recognition (OCR) task is important for evaluating Vision-Language Models (VLMs) and providing high-quality data sources for LLM training data. While state-of-the-art VLMs show improved average OCR accuracy, they still struggle with sample-level quality degradation and lack reliable automatic detection of low-quality outputs. We introduce Consensus Entropy (CE), a training-free post-inference method that quantifies OCR uncertainty by aggregating outputs from multiple VLMs. Our approach exploits a key insight: correct VLM OCR predictions converge in output space while errors diverge. We develop a lightweight multi-model framework that effectively identifies problematic samples, selects the best outputs and combines model strengths. Experiments across multiple OCR benchmarks and VLMs demonstrate that CE outperforms VLM-as-judge approaches and single-model baselines at the same cost and achieves state-of-the-art results across multiple metrics. For instance, our solution demonstrates: achieving 15.2% higher F1 scores than VLM-as-judge methods in quality verification, delivering 6.0% accuracy gains on mathematical calculation tasks, and requiring rephrasing only 7.3% of inputs while maintaining overall performance. Notably, the entire process requires neither training nor supervision while maintaining plug-and-play functionality throughout.
CVNov 25, 2024
Open-Vocabulary Octree-Graph for 3D Scene UnderstandingZhigang Wang, Yifei Su, Chenhui Li et al.
Open-vocabulary 3D scene understanding is indispensable for embodied agents. Recent works leverage pretrained vision-language models (VLMs) for object segmentation and project them to point clouds to build 3D maps. Despite progress, a point cloud is a set of unordered coordinates that requires substantial storage space and does not directly convey occupancy information or spatial relation, making existing methods inefficient for downstream tasks, e.g., path planning and complex text-based object retrieval. To address these issues, we propose Octree-Graph, a novel scene representation for open-vocabulary 3D scene understanding. Specifically, a Chronological Group-wise Segment Merging (CGSM) strategy and an Instance Feature Aggregation (IFA) algorithm are first designed to get 3D instances and corresponding semantic features. Subsequently, an adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape. Finally, the Octree-Graph is constructed where each adaptive-octree acts as a graph node, and edges describe the spatial relations among nodes. Extensive experiments on various tasks are conducted on several widely-used datasets, demonstrating the versatility and effectiveness of our method.
HCMar 6, 2024
SalienTime: User-driven Selection of Salient Time Steps for Large-Scale Geospatial Data VisualizationJuntong Chen, Haiwen Huang, Huayuan Ye et al.
The voluminous nature of geospatial temporal data from physical monitors and simulation models poses challenges to efficient data access, often resulting in cumbersome temporal selection experiences in web-based data portals. Thus, selecting a subset of time steps for prioritized visualization and pre-loading is highly desirable. Addressing this issue, this paper establishes a multifaceted definition of salient time steps via extensive need-finding studies with domain experts to understand their workflows. Building on this, we propose a novel approach that leverages autoencoders and dynamic programming to facilitate user-driven temporal selections. Structural features, statistical variations, and distance penalties are incorporated to make more flexible selections. User-specified priorities, spatial regions, and aggregations are used to combine different perspectives. We design and implement a web-based interface to enable efficient and context-aware selection of time steps and evaluate its efficacy and usability through case studies, quantitative evaluations, and expert interviews.
CVAug 25, 2025
VQualA 2025 Challenge on Face Image Quality Assessment: Methods and ResultsSizhuo Ma, Wei-Ting Chen, Qiang Gao et al.
Face images play a crucial role in numerous applications; however, real-world conditions frequently introduce degradations such as noise, blur, and compression artifacts, affecting overall image quality and hindering subsequent tasks. To address this challenge, we organized the VQualA 2025 Challenge on Face Image Quality Assessment (FIQA) as part of the ICCV 2025 Workshops. Participants created lightweight and efficient models (limited to 0.5 GFLOPs and 5 million parameters) for the prediction of Mean Opinion Scores (MOS) on face images with arbitrary resolutions and realistic degradations. Submissions underwent comprehensive evaluations through correlation metrics on a dataset of in-the-wild face images. This challenge attracted 127 participants, with 1519 final submissions. This report summarizes the methodologies and findings for advancing the development of practical FIQA approaches.
CLAug 6, 2025
IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable RewardsXu Guo, Tianyi Liang, Tong Jian et al.
Reinforcement Learning with Verifiable Rewards (RLVR) improves instruction following capabilities of large language models (LLMs), but suffers from training inefficiency due to inadequate difficulty assessment. Moreover, RLVR is prone to over-optimization, where LLMs exploit verification shortcuts without aligning to the actual intent of user instructions. We introduce Instruction Following Decorator (IFDecorator}, a framework that wraps RLVR training into a robust and sample-efficient pipeline. It consists of three components: (1) a cooperative-adversarial data flywheel that co-evolves instructions and hybrid verifications, generating progressively more challenging instruction-verification pairs; (2) IntentCheck, a bypass module enforcing intent alignment; and (3) trip wires, a diagnostic mechanism that detects reward hacking via trap instructions, which trigger and capture shortcut exploitation behaviors. Our Qwen2.5-32B-Instruct-IFDecorator achieves 87.43% accuracy on IFEval, outperforming larger proprietary models such as GPT-4o. Additionally, we demonstrate substantial improvements on FollowBench while preserving general capabilities. Our trip wires show significant reductions in reward hacking rates. We will release models, code, and data for future research.
CVMar 12, 2024
AACP: Aesthetics assessment of children's paintings based on self-supervised learningShiqi Jiang, Ning Li, Chen Shi et al.
The Aesthetics Assessment of Children's Paintings (AACP) is an important branch of the image aesthetics assessment (IAA), playing a significant role in children's education. This task presents unique challenges, such as limited available data and the requirement for evaluation metrics from multiple perspectives. However, previous approaches have relied on training large datasets and subsequently providing an aesthetics score to the image, which is not applicable to AACP. To solve this problem, we construct an aesthetics assessment dataset of children's paintings and a model based on self-supervised learning. 1) We build a novel dataset composed of two parts: the first part contains more than 20k unlabeled images of children's paintings; the second part contains 1.2k images of children's paintings, and each image contains eight attributes labeled by multiple design experts. 2) We design a pipeline that includes a feature extraction module, perception modules and a disentangled evaluation module. 3) We conduct both qualitative and quantitative experiments to compare our model's performance with five other methods using the AACP dataset. Our experiments reveal that our method can accurately capture aesthetic features and achieve state-of-the-art performance.
CVJul 12, 2025
PPJudge: Towards Human-Aligned Assessment of Artistic Painting ProcessShiqi Jiang, Xinpeng Li, Xi Mao et al.
Artistic image assessment has become a prominent research area in computer vision. In recent years, the field has witnessed a proliferation of datasets and methods designed to evaluate the aesthetic quality of paintings. However, most existing approaches focus solely on static final images, overlooking the dynamic and multi-stage nature of the artistic painting process. To address this gap, we propose a novel framework for human-aligned assessment of painting processes. Specifically, we introduce the Painting Process Assessment Dataset (PPAD), the first large-scale dataset comprising real and synthetic painting process images, annotated by domain experts across eight detailed attributes. Furthermore, we present PPJudge (Painting Process Judge), a Transformer-based model enhanced with temporally-aware positional encoding and a heterogeneous mixture-of-experts architecture, enabling effective assessment of the painting process. Experimental results demonstrate that our method outperforms existing baselines in accuracy, robustness, and alignment with human judgment, offering new insights into computational creativity and art education.
CVApr 18, 2024
TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image GenerationTianyi Liang, Jiangqi Liu, Yifei Huang et al.
Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential. Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds. We present TextCenGen, a training-free dynamic background adaptation in the blank region for text-friendly image generation. Instead of directly reducing attention in text areas, which degrades image quality, we relocate conflicting objects before background optimization. Our method analyzes cross-attention maps to identify conflicting objects overlapping with text regions and uses a force-directed graph approach to guide their relocation, followed by attention excluding constraints to ensure smooth backgrounds. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality. Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across four seed datasets, TextCenGen outperforms existing methods by achieving 23% lower saliency overlap in text regions while maintaining 98% of the semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).
LGAug 2, 2025
RelMap: Reliable Spatiotemporal Sensor Data Visualization via Imputative Spatial InterpolationJuntong Chen, Huayuan Ye, He Zhu et al.
Accurate and reliable visualization of spatiotemporal sensor data such as environmental parameters and meteorological conditions is crucial for informed decision-making. Traditional spatial interpolation methods, however, often fall short of producing reliable interpolation results due to the limited and irregular sensor coverage. This paper introduces a novel spatial interpolation pipeline that achieves reliable interpolation results and produces a novel heatmap representation with uncertainty information encoded. We leverage imputation reference data from Graph Neural Networks (GNNs) to enhance visualization reliability and temporal resolution. By integrating Principal Neighborhood Aggregation (PNA) and Geographical Positional Encoding (GPE), our model effectively learns the spatiotemporal dependencies. Furthermore, we propose an extrinsic, static visualization technique for interpolation-based heatmaps that effectively communicates the uncertainties arising from various sources in the interpolated map. Through a set of use cases, extensive evaluations on real-world datasets, and user studies, we demonstrate our model's superior performance for data imputation, the improvements to the interpolant with reference data, and the effectiveness of our visualization design in communicating uncertainties.
CVJul 19, 2025
VisGuard: Securing Visualization Dissemination through Tamper-Resistant Data RetrievalHuayuan Ye, Juntong Chen, Shenzhuo Zhang et al.
The dissemination of visualizations is primarily in the form of raster images, which often results in the loss of critical information such as source code, interactive features, and metadata. While previous methods have proposed embedding metadata into images to facilitate Visualization Image Data Retrieval (VIDR), most existing methods lack practicability since they are fragile to common image tampering during online distribution such as cropping and editing. To address this issue, we propose VisGuard, a tamper-resistant VIDR framework that reliably embeds metadata link into visualization images. The embedded data link remains recoverable even after substantial tampering upon images. We propose several techniques to enhance robustness, including repetitive data tiling, invertible information broadcasting, and an anchor-based scheme for crop localization. VisGuard enables various applications, including interactive chart reconstruction, tampering detection, and copyright protection. We conduct comprehensive experiments on VisGuard's superior performance in data retrieval accuracy, embedding capacity, and security against tampering and steganalysis, demonstrating VisGuard's competence in facilitating and safeguarding visualization dissemination and information conveyance.
CVSep 7, 2020
VisCode: Embedding Information in Visualization Images using Encoder-Decoder NetworkPeiying Zhang, Chenhui Li, Changbo Wang
We present an approach called VisCode for embedding information into visualization images. This technology can implicitly embed data information specified by the user into a visualization while ensuring that the encoded visualization image is not distorted. The VisCode framework is based on a deep neural network. We propose to use visualization images and QR codes data as training data and design a robust deep encoder-decoder network. The designed model considers the salient features of visualization images to reduce the explicit visual loss caused by encoding. To further support large-scale encoding and decoding, we consider the characteristics of information visualization and propose a saliency-based QR code layout algorithm. We present a variety of practical applications of VisCode in the context of information visualization and conduct a comprehensive evaluation of the perceptual quality of encoding, decoding success rate, anti-attack capability, time performance, etc. The evaluation results demonstrate the effectiveness of VisCode.