Wen Zhong

CV
h-index17
6papers
107citations
Novelty50%
AI Score51

6 Papers

18.5CVMay 26
CoilDrop-MRI: Self-supervised physics-guided MRI reconstruction with coil dropout

Tongxi Song, Ziyu Li, Zihan Li et al.

Self-supervised deep learning-based methods have shown great promise for accelerated magnetic resonance imaging (MRI) reconstruction, achieving high image quality without requiring fully sampled data for training. These methods typically partition the acquired data into two disjoint subsets to construct input-target pairs for optimizing the reconstruction network. However, existing approaches perform this partition exclusively within the spatial frequency (k-space) domain, leaving the coil dimension unexplored. To enforce full exploitation of signal correlation across receiver coils, we propose CoilDrop-MRI, which applies coil-wise dropout to the input and uses the dropped data as training targets in a self-supervised framework. This method is integrated into unrolled architectures in both image-domain (SENSE) and k-space (SPIRiT) formulations. We further demonstrate its versatility by extending CoilDrop-MRI to multi-shot, phase-corrected diffusion MRI (dMRI) reconstruction. CoilDrop-MRI is extensively validated on multi-site, multi-field-strength (0.3T, 0.55T, and 3T), and multi-modality (T1-weighted, T2-weighted, T2-FLAIR, and dMRI) datasets and consistently outperforms state-of-the-art self-supervised methods, achieving quality comparable to supervised reconstruction methods without requiring fully sampled reference training data. Moreover, CoilDrop-MRI exhibits strong data efficiency and robust generalization across imaging conditions, establishing it as a practical and versatile framework for self-supervised parallel MRI reconstruction.

89.4CVMay 5Code
VEBench:Benchmarking Large Multimodal Models for Real-World Video Editing

Andong Deng, Dawei Du, Zhenfang Chen et al.

Real-world video editing demands not only expert knowledge of cinematic techniques but also multimodal reasoning to select, align, and combine footage into coherent narratives. While recent Large Multimodal Models (LMMs) have shown remarkable progress in general video understanding, their abilities in multi-video reasoning and operational editing workflows remain largely unexplored. We introduce VEBENCH, the first comprehensive benchmark designed to evaluate both editing knowledge understanding and operational reasoning in realistic video editing scenarios. VEBENCH contains 3.9K high-quality edited videos (over 257 hours) and 3,080 human-verified QA pairs, built through a three-round human-AI collaborative annotation pipeline that ensures precise temporal labeling and semantic consistency. It features two complementary QA tasks: 1) Video Editing Technique Recognition, assessing models' ability to identify 7 editing techniques using multimodal cues; and 2) Video Editing Operation Simulation, modeling real-world editing workflows by requiring the selection and temporal localization of relevant clips from multiple candidates. Extensive experiments across proprietary (e.g., Gemini-2.5-Pro) and open-source LMMs reveal a large gap between current model performance and human-level editing cognition. These results highlight the urgent need for bridging video understanding with creative operational reasoning. We envision VEBENCH as a foundation for advancing intelligent video editing systems and driving future research on complex reasoning.

CVNov 24, 2025Code
Vidi2.5: Large Multimodal Models for Video Understanding and Creation

Vidi Team, Chia-Wen Kuo, Chuang Huang et al.

Video has emerged as the primary medium for communication and creativity on the Internet, driving strong demand for scalable, high-quality video production. Vidi models continue to evolve toward next-generation video creation and have achieved state-of-the-art performance in multimodal temporal retrieval (TR). In its second release, Vidi2 advances video understanding with fine-grained spatio-temporal grounding (STG) and extends its capability to video question answering (Video QA), enabling comprehensive multimodal reasoning. Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges. To enable comprehensive evaluation of STG, we introduce a new benchmark, VUE-STG, which offers critical improvements over existing STG datasets. In addition, we upgrade the previous VUE-TR benchmark to VUE-TR-V2, achieving a more balanced duration and query distribution. Remarkably, the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro Preview and GPT-5, on both VUE-TR-V2 and VUE-STG, while achieving competitive results with popular open-source models with similar scale on video QA benchmarks. The latest Vidi2.5 offers significantly stronger STG capability and slightly better TR and Video QA performance over Vidi2. This update also introduces a Vidi2.5-Think model to handle plot understanding with complex plot reasoning. To comprehensively evaluate the performance of plot understanding, we propose VUE-PLOT benchmark with two tracks, Character and Reasoning. Notably, Vidi2.5-Think outperforms Gemini 3 Pro Preview on fine-grained character understanding with comparable performance on complex plot reasoning. Furthermore, we demonstrate the effectiveness of Vidi2.5 on a challenging real-world application, video editing planning.

CVApr 22, 2025
Vidi: Large Multimodal Models for Video Understanding and Editing

Vidi Team, Celong Liu, Chia-Wen Kuo et al.

Humans naturally share information with those they are connected to, and video has become one of the dominant mediums for communication and expression on the Internet. To support the creation of high-quality large-scale video content, a modern pipeline requires a comprehensive understanding of both the raw input materials (e.g., the unedited footage captured by cameras) and the editing components (e.g., visual effects). In video editing scenarios, models must process multiple modalities (e.g., vision, audio, text) with strong background knowledge and handle flexible input lengths (e.g., hour-long raw videos), which poses significant challenges for traditional models. In this report, we introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios. The first release focuses on temporal retrieval, i.e., identifying the time ranges within the input videos corresponding to a given text query, which plays a critical role in intelligent editing. The model is capable of processing hour-long videos with strong temporal understanding capability, e.g., retrieve time ranges for certain queries. To support a comprehensive evaluation in real-world scenarios, we also present the VUE-TR benchmark, which introduces five key advancements. 1) Video duration: significantly longer than videos of existing temporal retrival datasets, 2) Audio support: includes audio-based queries, 3) Query format: diverse query lengths/formats, 4) Annotation quality: ground-truth time ranges are manually annotated. 5) Evaluation metric: a refined IoU metric to support evaluation over multiple time ranges. Remarkably, Vidi significantly outperforms leading proprietary models, e.g., GPT-4o and Gemini, on the temporal retrieval task, indicating its superiority in video editing scenarios.

IRNov 21, 2016
MV-RNN: A Multi-View Recurrent Neural Network for Sequential Recommendation

Qiang Cui, Shu Wu, Qiang Liu et al.

Sequential recommendation is a fundamental task for network applications, and it usually suffers from the item cold start problem due to the insufficiency of user feedbacks. There are currently three kinds of popular approaches which are respectively based on matrix factorization (MF) of collaborative filtering, Markov chain (MC), and recurrent neural network (RNN). Although widely used, they have some limitations. MF based methods could not capture dynamic user's interest. The strong Markov assumption greatly limits the performance of MC based methods. RNN based methods are still in the early stage of incorporating additional information. Based on these basic models, many methods with additional information only validate incorporating one modality in a separate way. In this work, to make the sequential recommendation and deal with the item cold start problem, we propose a Multi-View Recurrent Neural Network (MV-RNN}) model. Given the latent feature, MV-RNN can alleviate the item cold start problem by incorporating visual and textual information. First, At the input of MV-RNN, three different combinations of multi-view features are studied, like concatenation, fusion by addition and fusion by reconstructing the original multi-modal data. MV-RNN applies the recurrent structure to dynamically capture the user's interest. Second, we design a separate structure and a united structure on the hidden state of MV-RNN to explore a more effective way to handle multi-view features. Experiments on two real-world datasets show that MV-RNN can effectively generate the personalized ranking list, tackle the missing modalities problem and significantly alleviate the item cold start problem.

LGSep 18, 2016
Sequential Ensemble Learning for Outlier Detection: A Bias-Variance Perspective

Shebuti Rayana, Wen Zhong, Leman Akoglu

Ensemble methods for classification and clustering have been effectively used for decades, while ensemble learning for outlier detection has only been studied recently. In this work, we design a new ensemble approach for outlier detection in multi-dimensional point data, which provides improved accuracy by reducing error through both bias and variance. Although classification and outlier detection appear as different problems, their theoretical underpinnings are quite similar in terms of the bias-variance trade-off [1], where outlier detection is considered as a binary classification task with unobserved labels but a similar bias-variance decomposition of error. In this paper, we propose a sequential ensemble approach called CARE that employs a two-phase aggregation of the intermediate results in each iteration to reach the final outcome. Unlike existing outlier ensembles which solely incorporate a parallel framework by aggregating the outcomes of independent base detectors to reduce variance, our ensemble incorporates both the parallel and sequential building blocks to reduce bias as well as variance by ($i$) successively eliminating outliers from the original dataset to build a better data model on which outlierness is estimated (sequentially), and ($ii$) combining the results from individual base detectors and across iterations (parallelly). Through extensive experiments on sixteen real-world datasets mainly from the UCI machine learning repository [2], we show that CARE performs significantly better than or at least similar to the individual baselines. We also compare CARE with the state-of-the-art outlier ensembles where it also provides significant improvement when it is the winner and remains close otherwise.