69.6IRMar 18
Deploying Semantic ID-based Generative Retrieval for Large-Scale Podcast Discovery at SpotifyEdoardo D'Amico, Marco De Nadai, Praveen Chandar et al.
Podcast listening is often grounded in a set of favorite shows, while listener intent can evolve over time. This combination of stable preferences and changing intent motivates recommendation approaches that support both familiarity and exploration. Traditional recommender systems typically emphasize long-term interaction patterns, and are less explicitly designed to incorporate rich contextual signals or flexible, intent-aware discovery objectives. In this setting, models that can jointly reason over semantics, context, and user state offer a promising direction. Large Language Models (LLMs) provide strong semantic reasoning and contextual conditioning for discovery-oriented recommendation, but deploying them in production introduces challenges in catalog grounding, user-level personalization, and latency-critical serving. We address these challenges with GLIDE, a production-scale generative recommender for podcast discovery at Spotify. GLIDE formulates recommendation as an instruction-following task over a discretized catalog using Semantic IDs, enabling grounded generation over a large inventory. The model conditions on recent listening history and lightweight user context, while injecting long-term user embeddings as soft prompts to capture stable preferences under strict inference constraints. We evaluate GLIDE using offline retrieval metrics, human judgments, and LLM-based evaluation, and validate its impact through large-scale online A/B testing. Across experiments involving millions of users, GLIDE increases non-habitual podcast streaming on Spotify home surface by up to 5.4% and new-show discovery by up to 14.3%, while meeting production cost and latency constraints.
IRApr 23, 2019Code
Sequential modeling of Sessions using Recurrent Neural Networks for Skip PredictionSainath Adapa
Recommender systems play an essential role in music streaming services, prominently in the form of personalized playlists. Exploring the user interactions within these listening sessions can be beneficial to understanding the user preferences in the context of a single session. In the 'Spotify Sequential Skip Prediction Challenge', WSDM, and Spotify are challenging people to understand the way users sequentially interact with music. We describe our solution approach in this paper and also state proposals for further improvements to the model. The proposed model initially generates a fixed vector representation of the session, and this additional information is incorporated into an Encoder-Decoder style architecture. This method achieved the seventh position in the competition, with a mean average accuracy of 0.604 on the test set. The solution code is available at https://github.com/sainathadapa/spotify-sequential-skip-prediction.
SDNov 16, 2019
Music theme recognition using CNN and self-attentionManoj Sukhavasi, Sainath Adapa
We present an efficient architecture to detect mood/themes in music tracks on autotagging-moodtheme subset of the MTG-Jamendo dataset. Our approach consists of two blocks, a CNN block based on MobileNetV2 architecture and a self-attention block from Transformer architecture to capture long term temporal characteristics. We show that our proposed model produces a significant improvement over the baseline model. Our model (team name: AMLAG) achieves 4th place on PR-AUC-macro Leaderboard in MediaEval 2019: Emotion and Theme Recognition in Music Using Jamendo.
SDSep 27, 2019
Urban Sound Tagging using Convolutional Neural NetworksSainath Adapa
In this paper, we propose a framework for environmental sound classification in a low-data context (less than 100 labeled examples per class). We show that using pre-trained image classification models along with the usage of data augmentation techniques results in higher performance over alternative approaches. We applied this system to the task of Urban Sound Tagging, part of the DCASE 2019. The objective was to label different sources of noise from raw audio data. A modified form of MobileNetV2, a convolutional neural network (CNN) model was trained to classify both coarse and fine tags jointly. The proposed model uses log-scaled Mel-spectrogram as the representation format for the audio data. Mixup, Random erasing, scaling, and shifting are used as data augmentation techniques. A second model that uses scaled labels was built to account for human errors in the annotations. The proposed model achieved the first rank on the leaderboard with Micro-AUPRC values of 0.751 and 0.860 on fine and coarse tags, respectively.