LGFeb 27, 2024
Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative RecommendationsJiaqi Zhai, Lucy Liao, Xing Liu et al.
Large-scale recommendation systems are characterized by their reliance on high cardinality, heterogeneous features and the need to handle tens of billions of user actions on a daily basis. Despite being trained on huge volume of data with thousands of features, most Deep Learning Recommendation Models (DLRMs) in industry fail to scale with compute. Inspired by success achieved by Transformers in language and vision domains, we revisit fundamental design choices in recommendation systems. We reformulate recommendation problems as sequential transduction tasks within a generative modeling framework ("Generative Recommenders"), and propose a new architecture, HSTU, designed for high cardinality, non-stationary streaming recommendation data. HSTU outperforms baselines over synthetic and public datasets by up to 65.8% in NDCG, and is 5.3x to 15.2x faster than FlashAttention2-based Transformers on 8192 length sequences. HSTU-based Generative Recommenders, with 1.5 trillion parameters, improve metrics in online A/B tests by 12.4% and have been deployed on multiple surfaces of a large internet platform with billions of users. More importantly, the model quality of Generative Recommenders empirically scales as a power-law of training compute across three orders of magnitude, up to GPT-3/LLaMa-2 scale, which reduces carbon footprint needed for future model developments, and further paves the way for the first foundational models in recommendations.
LGNov 26, 2025
Generative Early Stage RankingJuhee Hong, Meng Liu, Shengzhi Wang et al.
Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the "user-item decoupling" approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA's specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.
IRJul 24, 2025
Request-Only Optimization for Recommendation SystemsLiang Guo, Wei Li, Lucy Liao et al.
Deep Learning Recommendation Models (DLRMs) represent one of the largest machine learning applications on the planet. Industry-scale DLRMs are trained with petabytes of recommendation data to serve billions of users every day. To utilize the rich user signals in the long user history, DLRMs have been scaled up to unprecedented complexity, up to trillions of floating-point operations (TFLOPs) per example. This scale, coupled with the huge amount of training data, necessitates new storage and training algorithms to efficiently improve the quality of these complex recommendation systems. In this paper, we present a Request-Only Optimizations (ROO) training and modeling paradigm. ROO simultaneously improves the storage and training efficiency as well as the model quality of recommendation systems. We holistically approach this challenge through co-designing data (i.e., request-only data), infrastructure (i.e., request-only based data processing pipeline), and model architecture (i.e., request-only neural architectures). Our ROO training and modeling paradigm treats a user request as a unit of the training data. Compared with the established practice of treating a user impression as a unit, our new design achieves native feature deduplication in data logging, consequently saving data storage. Second, by de-duplicating computations and communications across multiple impressions in a request, this new paradigm enables highly scaled-up neural network architectures to better capture user interest signals, such as Generative Recommenders (GRs) and other request-only friendly architectures.
IRAug 4, 2025
Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model DeploymentDai Li, Kevin Course, Wei Li et al.
While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems.
CVNov 23, 2020
Adversarial Refinement Network for Human Motion PredictionXianjin Chao, Yanrui Bin, Wenqing Chu et al.
Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in a finer prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions.
CVAug 3, 2020
Adversarial Semantic Data Augmentation for Human Pose EstimationYanrui Bin, Xuan Cao, Xinya Chen et al.
Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks.
CVMay 3, 2019
Anti-Confusing: Region-Aware Network for Human Pose EstimationXuan Cao, Yanhao Ge, Ying Tai et al.
In this work, we propose a novel framework named Region-Aware Network (RANet), which learns the ability of anti-confusing in case of heavy occlusion, nearby person and symmetric appearance, for human pose estimation. Specifically, the proposed method addresses three key aspects, i.e., data augmentation, feature learning and prediction fusion, respectively. First, we propose Parsing-based Data Augmentation (PDA) to generate abundant data that synthesizes confusing textures. Second, we not only propose a Feature Pyramid Stem (FPS) to learn stronger low-level features in lower stage; but also incorporate an Effective Region Extraction (ERE) module to excavate better target-specific features. Third, we introduce Cascade Voting Fusion (CVF) to explicitly exclude the inferior predictions and fuse the rest effective predictions for the final pose estimation. Extensive experimental results on two popular benchmarks, i.e. MPII and LSP, demonstrate the effectiveness of our method against the state-of-the-art competitors. Especially on easily-confusable joints, our method makes significant improvement.
CVNov 29, 2017
Sparse Photometric 3D Face Reconstruction Guided by Morphable ModelsXuan Cao, Zhang Chen, Anpei Chen et al.
We present a novel 3D face reconstruction technique that leverages sparse photometric stereo (PS) and latest advances on face registration/modeling from a single image. We observe that 3D morphable faces approach provides a reasonable geometry proxy for light position calibration. Specifically, we develop a robust optimization technique that can calibrate per-pixel lighting direction and illumination at a very high precision without assuming uniform surface albedos. Next, we apply semantic segmentation on input images and the geometry proxy to refine hairy vs. bare skin regions using tailored filters. Experiments on synthetic and real data show that by using a very small set of images, our technique is able to reconstruct fine geometric details such as wrinkles, eyebrows, whelks, pores, etc, comparable to and sometimes surpassing movie quality productions.
CVAug 2, 2017
A Learning-based Framework for Hybrid Depth-from-Defocus and Stereo MatchingZhang Chen, Xinqing Guo, Siyuan Li et al.
Depth from defocus (DfD) and stereo matching are two most studied passive depth sensing schemes. The techniques are essentially complementary: DfD can robustly handle repetitive textures that are problematic for stereo matching whereas stereo matching is insensitive to defocus blurs and can handle large depth range. In this paper, we present a unified learning-based technique to conduct hybrid DfD and stereo matching. Our input is image triplets: a stereo pair and a defocused image of one of the stereo views. We first apply depth-guided light field rendering to construct a comprehensive training dataset for such hybrid sensing setups. Next, we adopt the hourglass network architecture to separately conduct depth inference from DfD and stereo. Finally, we exploit different connection methods between the two separate networks for integrating them into a unified solution to produce high fidelity 3D disparity maps. Comprehensive experiments on real and synthetic data show that our new learning-based hybrid 3D sensing technique can significantly improve accuracy and robustness in 3D reconstruction.