Weihan Xu

CV
h-index15
4papers
5citations
Novelty36%
AI Score31

4 Papers

SDAug 13, 2024
A New Dataset, Notation Software, and Representation for Computational Schenkerian Analysis

Stephen Ni-Hahn, Weihan Xu, Jerry Yin et al.

Schenkerian Analysis (SchA) is a uniquely expressive method of music analysis, combining elements of melody, harmony, counterpoint, and form to describe the hierarchical structure supporting a work of music. However, despite its powerful analytical utility and potential to improve music understanding and generation, SchA has rarely been utilized by the computer music community. This is in large part due to the paucity of available high-quality data in a computer-readable format. With a larger corpus of Schenkerian data, it may be possible to infuse machine learning models with a deeper understanding of musical structure, thus leading to more "human" results. To encourage further research in Schenkerian analysis and its potential benefits for music informatics and generation, this paper presents three main contributions: 1) a new and growing dataset of SchAs, the largest in human- and computer-readable formats to date (>140 excerpts), 2) a novel software for visualization and collection of SchA data, and 3) a novel, flexible representation of SchA as a heterogeneous-edge graph data structure.

CVDec 14, 2025
Schrodinger Audio-Visual Editor: Object-Level Audiovisual Removal

Weihan Xu, Kan Jen Cheng, Koichi Saito et al.

Joint editing of audio and visual content is crucial for precise and controllable content creation. This new task poses challenges due to the limitations of paired audio-visual data before and after targeted edits, and the heterogeneity across modalities. To address the data and modeling challenges in joint audio-visual editing, we introduce SAVEBench, a paired audiovisual dataset with text and mask conditions to enable object-grounded source-to-target learning. With SAVEBench, we train the Schrodinger Audio-Visual Editor (SAVE), an end-to-end flow-matching model that edits audio and video in parallel while keeping them aligned throughout processing. SAVE incorporates a Schrodinger Bridge that learns a direct transport from source to target audiovisual mixtures. Our evaluation demonstrates that the proposed SAVE model is able to remove the target objects in audio and visual content while preserving the remaining content, with stronger temporal synchronization and audiovisual semantic correspondence compared with pairwise combinations of an audio editor and a video editor.

CVMay 24, 2025
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing

Weihan Xu, Yimeng Ma, Jingyue Huang et al.

Short videos are an effective tool for promoting contents and improving knowledge accessibility. While existing extractive video summarization methods struggle to produce a coherent narrative, existing abstractive methods cannot `quote' from the input videos, i.e., inserting short video clips in their outputs. In this work, we explore novel video editing models for generating shorts that feature a coherent narrative with embedded video insertions extracted from a long input video. We propose a novel retrieval-embedded generation framework that allows a large language model to quote multimodal resources while maintaining a coherent narrative. Our proposed REGen system first generates the output story script with quote placeholders using a finetuned large language model, and then uses a novel retrieval model to replace the quote placeholders by selecting a video clip that best supports the narrative from a pool of candidate quotable video clips. We examine the proposed method on the task of documentary teaser generation, where short interview insertions are commonly used to support the narrative of a documentary. Our objective evaluations show that the proposed method can effectively insert short video clips while maintaining a coherent narrative. In a subjective survey, we show that our proposed method outperforms existing abstractive and extractive approaches in terms of coherence, alignment, and realism in teaser generation.

AIDec 2, 2024
Recurrent Neural Network on PICTURE Model

Weihan Xu

Intensive Care Units (ICUs) provide critical care and life support for most severely ill and injured patients in the hospital. With the need for ICUs growing rapidly and unprecedentedly, especially during COVID-19, accurately identifying the most critical patients helps hospitals to allocate resources more efficiently and save more lives. The Predicting Intensive Care Transfers and Other Unforeseen Events (PICTURE) model predicts patient deterioration by separating those at high risk for imminent intensive care unit transfer, respiratory failure, or death from those at lower risk. This study aims to implement a deep learning model to benchmark the performance from the XGBoost model, an existing model which has competitive results on prediction.