IRFeb 20, 2025
External Large Foundation Model: How to Efficiently Serve Trillions of Parameters for Online Ads RecommendationMingfu Liang, Xi Liu, Rong Jin et al.
Ads recommendation is a prominent service of online advertising systems and has been actively studied. Recent studies indicate that scaling-up and advanced design of the recommendation model can bring significant performance improvement. However, with a larger model scale, such prior studies have a significantly increasing gap from industry as they often neglect two fundamental challenges in industrial-scale applications. First, training and inference budgets are restricted for the model to be served, exceeding which may incur latency and impair user experience. Second, large-volume data arrive in a streaming mode with data distributions dynamically shifting, as new users/ads join and existing users/ads leave the system. We propose the External Large Foundation Model (ExFM) framework to address the overlooked challenges. Specifically, we develop external distillation and a data augmentation system (DAS) to control the computational cost of training/inference while maintaining high performance. We design the teacher in a way like a foundation model (FM) that can serve multiple students as vertical models (VMs) to amortize its building cost. We propose Auxiliary Head and Student Adapter to mitigate the data distribution gap between FM and VMs caused by the streaming data issue. Comprehensive experiments on internal industrial-scale applications and public datasets demonstrate significant performance gain by ExFM.
SDJun 11, 2021
Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context ModelingJingbei Li, Yi Meng, Chenyi Li et al.
Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the inter-speaker influence in conversations, and also neglect the speaking styles and the intra-speaker inertia inside each speaker. Inspired by DialogueGCN and its superiority in modeling such conversational influences than RNN based approaches, we propose a graph-based multi-modal context modeling method and adopt it to conversational TTS to enhance the speaking styles of synthesized speeches. Both the textual and speaking style information in the context are extracted and processed by DialogueGCN to model the inter- and intra-speaker influence in conversations. The outputs of DialogueGCN are then summarized by attention mechanism, and converted to the enhanced speaking style for current utterance. An English conversation corpus is collected and annotated for our research and released to public. Experiment results on this corpus demonstrate the effectiveness of our proposed approach, which outperforms the state-of-the-art context modeling method in conversational TTS in both MOS and ABX preference rate.