Ye Tao

CV
h-index7
12papers
31citations
Novelty52%
AI Score53

12 Papers

83.4SDJun 2
Foley-Omni: A Unified Multimodal Generation Model from Task-Level Audio Synthesis to Complete Video Soundtrack Generation

Ye Tao, Lupeng Liu, Xuenan Xu et al.

Recent unified audio generation models can support diverse tasks across speech, sound effects, and music, but most of them still focus on isolated task-level synthesis. However, real video production often requires multiple components of a complete audio track to be generated jointly and consistently for the same video. We present Foley-Omni, a unified multimodal audio generation model that extends isolated task-level synthesis to complete video soundtrack generation by jointly modeling speech, sound effects, and music within a shared latent generation process. To support training and reproducible evaluation, we develop an audiovisual data curation pipeline and introduce V2ST-Bench, a benchmark for holistic video soundtrack generation evaluation. Experiments show that Foley-Omni achieves competitive performance with expert systems on individual synthesis tasks, while improving speech intelligibility, audiovisual consistency and perceptual quality for mixed soundtrack generation.

73.1CVJun 1
MotionDreamer: Universal Skeletal Motion Generation for 3D Rigged Shapes

Ye Tao, Yuxin Yao, Kendong Liu et al.

Motion generation for rigged shapes is vital for scalable 4D asset production. However, template-based methods are limited by specific topologies and fail to generalize across diverse morphologies. Conversely, per-case optimization is computationally expensive, susceptible to local optima, and highly sensitive to viewpoint-induced ambiguities. In this paper, we present MotionDreamer, a diffusion-based framework designed for category-agnostic skeletal animation generation from 2D video guidance. To overcome the scarcity of high-quality training data, we have curated a large-scale dynamic dataset comprising approximately 20,000 diverse 3D models, each featuring complete textures, skeletal rigging, and a wide array of comprehensive animation sequences. To bridge the kinematic gap between 2D visual motion cues and heterogeneous 3D skeletal structures, we propose a structural-semantic injection mechanism. Our model integrates texture and semantic attributes directly into skeletal joint representations. This allows it to map perceived visual dynamics to specific joint hierarchies and their functional roles. This enables MotionDreamer to synthesize high-fidelity animations that maintain anatomical consistency across a vast range of unseen categories, from existing biological species to fantastical beings. Extensive experiments demonstrate that our approach significantly outperforms existing methods, setting a new state-of-the-art benchmark for robust and efficient 4D asset generation. The code will be made publicly available upon acceptance.

LGJul 29, 2023
An Automata-Theoretic Approach to Synthesizing Binarized Neural Networks

Ye Tao, Wanwei Liu, Fu Song et al.

Deep neural networks, (DNNs, a.k.a. NNs), have been widely used in various tasks and have been proven to be successful. However, the accompanied expensive computing and storage costs make the deployments in resource-constrained devices a significant concern. To solve this issue, quantization has emerged as an effective way to reduce the costs of DNNs with little accuracy degradation by quantizing floating-point numbers to low-width fixed-point representations. Quantized neural networks (QNNs) have been developed, with binarized neural networks (BNNs) restricted to binary values as a special case. Another concern about neural networks is their vulnerability and lack of interpretability. Despite the active research on trustworthy of DNNs, few approaches have been proposed to QNNs. To this end, this paper presents an automata-theoretic approach to synthesizing BNNs that meet designated properties. More specifically, we define a temporal logic, called BLTL, as the specification language. We show that each BLTL formula can be transformed into an automaton on finite words. To deal with the state-explosion problem, we provide a tableau-based approach in real implementation. For the synthesis procedure, we utilize SMT solvers to detect the existence of a model (i.e., a BNN) in the construction process. Notably, synthesis provides a way to determine the hyper-parameters of the network before training.Moreover, we experimentally evaluate our approach and demonstrate its effectiveness in improving the individual fairness and local robustness of BNNs while maintaining accuracy to a great extent.

AISep 26, 2025Code
Not only a helper, but also a teacher: Interactive LLM Cascade

Yu Wu, Shuo Wu, Ye Tao et al.

Large Language Models (LLMs) vary widely in their capabilities, with larger models often having better performance but higher cost: choosing an LLM model often involves trading off performance and cost. The LLM Cascade is a paradigm that defers difficult queries from weak/cheap to strong/expensive models. This approach is nonadaptive: the deferral decision is trained offline. When confronted with similar or repeated queries, the LLM Cascade may then repeatedly consult the expensive model and incur higher cost. To improve the cascading efficiency, we propose Inter-Cascade, an online and interactive LLM Cascade that extends the role of strong model from a backup helper to a long-term teacher. In our system, when a strong model resolves a difficult query, it also distills its solution into a generalized, reusable problem-solving strategy that boosts the weak model on subsequent queries. Adding strategies to queries enables the weak model to dynamically improve its performance over time, avoiding computationally and time-intensive fine-tuning. Empirically, compared with standard LLM Cascade baselines across multiple benchmarks, the Inter-Cascade significantly improves the accuracy of the weak model (by up to 33.06 absolute percentage points) and the overall system (by up to 5.53 absolute percentage points), while reducing the calls to strong models (by up to 48.05% relative reduction) and saving the corresponding fees (by up to 49.63% relative reduction). Inter-Cascade demonstrates the effective in-context knowledge transfer between LLMs, and provides a general, scalable framework applicable to both open-source and API-based LLMs.

CLDec 18, 2023
Knowledge Graphs and Pre-trained Language Models enhanced Representation Learning for Conversational Recommender Systems

Zhangchi Qiu, Ye Tao, Shirui Pan et al.

Conversational recommender systems (CRS) utilize natural language interactions and dialogue history to infer user preferences and provide accurate recommendations. Due to the limited conversation context and background knowledge, existing CRSs rely on external sources such as knowledge graphs to enrich the context and model entities based on their inter-relations. However, these methods ignore the rich intrinsic information within entities. To address this, we introduce the Knowledge-Enhanced Entity Representation Learning (KERL) framework, which leverages both the knowledge graph and a pre-trained language model to improve the semantic understanding of entities for CRS. In our KERL framework, entity textual descriptions are encoded via a pre-trained language model, while a knowledge graph helps reinforce the representation of these entities. We also employ positional encoding to effectively capture the temporal information of entities in a conversation. The enhanced entity representation is then used to develop a recommender component that fuses both entity and contextual representations for more informed recommendations, as well as a dialogue component that generates informative entity-related information in the response text. A high-quality knowledge graph with aligned entity descriptions is constructed to facilitate our study, namely the Wiki Movie Knowledge Graph (WikiMKG). The experimental results show that KERL achieves state-of-the-art results in both recommendation and response generation tasks.

22.5DCApr 24
$O(K)$-Approximation Coflow Scheduling in $K$-Core Optical Circuit Switching Networks

Xin Wang, Hong Shen, Hui Tian et al.

Coflow has emerged as a fundamental application-layer abstraction in distributed systems, representing communication dependencies and enabling collaborative management of related flows to enhance job completion efficiency. To meet the increasing bandwidth demands of modern data center networks (DCNs), optical circuit switches are widely deployed due to their high capacity and energy efficiency. Simultaneously, DCN deployments are evolving towards heterogeneous parallel architectures, where multiple independent optical circuit switching (OCS) cores operate concurrently to facilitate bandwidth expansion and incremental upgrades. However, existing research on coflow scheduling in multi-core switching fabrics primarily focuses on electrical packet switching (EPS) networks, with a few known results on OCS networks without or with a poor performance guarantee. This paper studies the coflow scheduling problem in multi-core OCS networks under the not-all-stop (i.e., asynchronous) reconfiguration model, focusing on two major challenges of overcoming cross-core coupling for inter-core traffic allocation and satisfying the constraints of port exclusivity and reconfiguration overhead for intra-core circuit scheduling. To minimize total weighted coflow completion time (CCT), we propose an efficient algorithm by integrating linear programming-guided (LP-guided) global coflow ordering, inter-core flow allocation and intra-core circuit scheduling that achieves approximation ratios of 8K and 8K+1 for zero and arbitrary release times of coflows, respectively, where K is the number of OCS cores. This framework is also applicable to H-core EPS networks, providing approximation guarantees of 4H and 4H+1 for zero-time and arbitrary-time release, respectively.

10.2CRMar 12
Functional Approximation Methods for Differentially Private Distribution Estimation

Ye Tao, Anand D. Sarwate

The cumulative distribution function (CDF) is fundamental for characterizing random variables, making it essential in applications that require privacy-preserving data analysis. This paper introduces a novel framework for constructing differentially private CDFs inspired by functional analysis and the functional mechanism. We develop two variants: a polynomial projection method, which projects the empirical CDF into a polynomial space, and a sparse approximation method via matching pursuit, which projects it into arbitrary function spaces constructed from dictionaries. In both cases, the empirical CDF is approximated within the chosen space, and the corresponding coefficients are privatized to guarantee differential privacy. Compared with existing approaches such as histogram queries and adaptive quantiles, our methods achieve comparable or superior performance. Our methods are particularly well-suited to decentralized settings and scenarios where CDFs must be efficiently updated with newly collected or streaming data. In addition, we investigate the influence of parameters such as dictionary size and systematically evaluate different dictionary constructions, including Legendre polynomials, B-splines, and distribution-based functions. Overall, our contributions advance the development of practical and reliable methods for privacy-preserving CDF estimation.

CVApr 11, 2025
Light-YOLOv8-Flame: A Lightweight High-Performance Flame Detection Algorithm

Jiawei Lan, Ye Tao, Zhibiao Wang et al.

Fire detection algorithms, particularly those based on computer vision, encounter significant challenges such as high computational costs and delayed response times, which hinder their application in real-time systems. To address these limitations, this paper introduces Light-YOLOv8-Flame, a lightweight flame detection algorithm specifically designed for fast and efficient real-time deployment. The proposed model enhances the YOLOv8 architecture through the substitution of the original C2f module with the FasterNet Block module. This new block combines Partial Convolution (PConv) and Convolution (Conv) layers, reducing both computational complexity and model size. A dataset comprising 7,431 images, representing both flame and non-flame scenarios, was collected and augmented for training purposes. Experimental findings indicate that the modified YOLOv8 model achieves a 0.78% gain in mean average precision (mAP) and a 2.05% boost in recall, while reducing the parameter count by 25.34%, with only a marginal decrease in precision by 0.82%. These findings highlight that Light-YOLOv8-Flame offers enhanced detection performance and speed, making it well-suited for real-time fire detection on resource-constrained devices.

CVAug 26, 2025
Clustering-based Feature Representation Learning for Oracle Bone Inscriptions Detection

Ye Tao, Xinran Fu, Honglin Pang et al.

Oracle Bone Inscriptions (OBIs), play a crucial role in understanding ancient Chinese civilization. The automated detection of OBIs from rubbing images represents a fundamental yet challenging task in digital archaeology, primarily due to various degradation factors including noise and cracks that limit the effectiveness of conventional detection networks. To address these challenges, we propose a novel clustering-based feature space representation learning method. Our approach uniquely leverages the Oracle Bones Character (OBC) font library dataset as prior knowledge to enhance feature extraction in the detection network through clustering-based representation learning. The method incorporates a specialized loss function derived from clustering results to optimize feature representation, which is then integrated into the total network loss. We validate the effectiveness of our method by conducting experiments on two OBIs detection dataset using three mainstream detection frameworks: Faster R-CNN, DETR, and Sparse R-CNN. Through extensive experimentation, all frameworks demonstrate significant performance improvements.

CVAug 6, 2025
Slice or the Whole Pie? Utility Control for AI Models

Ye Tao

Training deep neural networks (DNNs) has become an increasingly resource-intensive task, requiring large volumes of labeled data, substantial computational power, and considerable fine-tuning efforts to achieve optimal performance across diverse use cases. Although pre-trained models offer a useful starting point, adapting them to meet specific user needs often demands extensive customization, and infrastructure overhead. This challenge grows when a single model must support diverse appli-cations with differing requirements for performance. Traditional solutions often involve training multiple model versions to meet varying requirements, which can be inefficient and difficult to maintain. In order to overcome this challenge, we propose NNObfuscator, a novel utility control mechanism that enables AI models to dynamically modify their performance according to predefined conditions. It is different from traditional methods that need separate models for each user. Instead, NNObfuscator allows a single model to be adapted in real time, giving you controlled access to multiple levels of performance. This mechanism enables model owners set up tiered access, ensuring that free-tier users receive a baseline level of performance while premium users benefit from enhanced capabilities. The approach improves resource allocation, reduces unnecessary computation, and supports sustainable business models in AI deployment. To validate our approach, we conducted experiments on multiple tasks, including image classification, semantic segmentation, and text to image generation, using well-established models such as ResNet, DeepLab, VGG16, FCN and Stable Diffusion. Experimental results show that NNObfuscator successfully makes model more adaptable, so that a single trained model can handle a broad range of tasks without requiring a lot of changes.

CVMar 8, 2025
GSV3D: Gaussian Splatting-based Geometric Distillation with Stable Video Diffusion for Single-Image 3D Object Generation

Ye Tao, Jiawei Zhang, Yahao Shi et al.

Image-based 3D generation has vast applications in robotics and gaming, where high-quality, diverse outputs and consistent 3D representations are crucial. However, existing methods have limitations: 3D diffusion models are limited by dataset scarcity and the absence of strong pre-trained priors, while 2D diffusion-based approaches struggle with geometric consistency. We propose a method that leverages 2D diffusion models' implicit 3D reasoning ability while ensuring 3D consistency via Gaussian-splatting-based geometric distillation. Specifically, the proposed Gaussian Splatting Decoder enforces 3D consistency by transforming SV3D latent outputs into an explicit 3D representation. Unlike SV3D, which only relies on implicit 2D representations for video generation, Gaussian Splatting explicitly encodes spatial and appearance attributes, enabling multi-view consistency through geometric constraints. These constraints correct view inconsistencies, ensuring robust geometric consistency. As a result, our approach simultaneously generates high-quality, multi-view-consistent images and accurate 3D models, providing a scalable solution for single-image-based 3D generation and bridging the gap between 2D Diffusion diversity and 3D structural coherence. Experimental results demonstrate state-of-the-art multi-view consistency and strong generalization across diverse datasets. The code will be made publicly available upon acceptance.

IRSep 11, 2020
TRec: Sequential Recommender Based On Latent Item Trend Information

Ye Tao, Can Wang, Lina Yao et al.

Recommendation system plays an important role in online web applications. Sequential recommender further models user short-term preference through exploiting information from latest user-item interaction history. Most of the sequential recommendation methods neglect the importance of ever-changing item popularity. We propose the model from the intuition that items with most user interactions may be popular in the past but could go out of fashion in recent days. To this end, this paper proposes a novel sequential recommendation approach dubbed TRec, TRec learns item trend information from implicit user interaction history and incorporates item trend information into next item recommendation tasks. Then a self-attention mechanism is used to learn better node representation. Our model is trained via pairwise rank-based optimization. We conduct extensive experiments with seven baseline methods on four benchmark datasets, The empirical result shows our approach outperforms other stateof-the-art methods while maintains a superiorly low runtime cost. Our study demonstrates the importance of item trend information in recommendation system designs, and our method also possesses great efficiency which enables it to be practical in real-world scenarios.