Xiao Sha

2papers

2 Papers

CVMar 2Code

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Benyuan Sun et al.

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

IROct 18, 2019

Hierarchical Attentive Knowledge Graph Embedding for Personalized Recommendation

Xiao Sha, Zhu Sun, Jie Zhang

Knowledge graphs (KGs) have proven to be effective for high-quality recommendation, where the connectivities between users and items provide rich and complementary information to user-item interactions. Most existing methods, however, are insufficient to exploit the KGs for capturing user preferences, as they either represent the user-item connectivities via paths with limited expressiveness or implicitly model them by propagating information over the entire KG with inevitable noise. In this paper, we design a novel hierarchical attentive knowledge graph embedding (HAKG) framework to exploit the KGs for effective recommendation. Specifically, HAKG first extracts the expressive subgraphs that link user-item pairs to characterize their connectivities, which accommodate both the semantics and topology of KGs. The subgraphs are then encoded via a hierarchical attentive subgraph encoding to generate effective subgraph embeddings for enhanced user preference prediction. Extensive experiments show the superiority of HAKG against state-of-the-art recommendation methods, as well as its potential in alleviating the data sparsity issue.