IRNov 16, 2023
Scaling User Modeling: Large-scale Online User Representations for Ads Personalization in MetaWei Zhang, Dai Li, Chen Liang et al.
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and improved infrastructure efficiency.
LGSep 19, 2024
Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash AttentionRengan Xu, Junjie Yang, Yifan Xu et al.
The integration of hardware accelerators has significantly advanced the capabilities of modern recommendation systems, enabling the exploration of complex ranking paradigms previously deemed impractical. However, the GPU-based computational costs present substantial challenges. In this paper, we demonstrate our development of an efficiency-driven approach to explore these paradigms, moving beyond traditional reliance on native PyTorch modules. We address the specific challenges posed by ranking models' dependence on categorical features, which vary in length and complicate GPU utilization. We introduce Jagged Feature Interaction Kernels, a novel method designed to extract fine-grained insights from long categorical features through efficient handling of dynamically sized tensors. We further enhance the performance of attention mechanisms by integrating Jagged tensors with Flash Attention. Our novel Jagged Flash Attention achieves up to 9x speedup and 22x memory reduction compared to dense attention. Notably, it also outperforms dense flash attention, with up to 3x speedup and 53% more memory efficiency. In production models, we observe 10% QPS improvement and 18% memory savings, enabling us to scale our recommendation systems with longer features and more complex architectures.
CLJun 25, 2025
ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language ModelsDai Li, Linzhuo Li, Huilian Sophie Qiu
Large language models (LLMs) in the form of chatbots like ChatGPT and Llama are increasingly proposed as "silicon samples" for simulating human opinions. This study examines this notion, arguing that LLMs may misrepresent population-level opinions. We identify two fundamental challenges: a failure in structural consistency, where response accuracy doesn't hold across demographic aggregation levels, and homogenization, an underrepresentation of minority opinions. To investigate these, we prompted ChatGPT (GPT-4) and Meta's Llama 3.1 series (8B, 70B, 405B) with questions on abortion and unauthorized immigration from the American National Election Studies (ANES) 2020. Our findings reveal significant structural inconsistencies and severe homogenization in LLM responses compared to human data. We propose an "accuracy-optimization hypothesis," suggesting homogenization stems from prioritizing modal responses. These issues challenge the validity of using LLMs, especially chatbots AI, as direct substitutes for human survey data, potentially reinforcing stereotypes and misinforming policy.
IRAug 4, 2025
Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model DeploymentDai Li, Kevin Course, Wei Li et al.
While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific "expert" models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems.
IRJun 9, 2024
Async Learned User Embeddings for Ads Delivery OptimizationMingwei Tang, Meng Liu, Hong Li et al.
In recommendation systems, high-quality user embeddings can capture subtle preferences, enable precise similarity calculations, and adapt to changing preferences over time to maintain relevance. The effectiveness of recommendation systems depends on the quality of user embedding. We propose to asynchronously learn high fidelity user embeddings for billions of users each day from sequence based multimodal user activities through a Transformer-like large scale feature learning module. The async learned user representations embeddings (ALURE) are further converted to user similarity graphs through graph learning and then combined with user realtime activities to retrieval highly related ads candidates for the ads delivery system. Our method shows significant gains in both offline and online experiments.
CRFeb 17, 2022
MeNTT: A Compact and Efficient Processing-in-Memory Number Theoretic Transform (NTT) AcceleratorDai Li, Akhil Pakala, Kaiyuan Yang
Lattice-based cryptography (LBC) exploiting Learning with Errors (LWE) problems is a promising candidate for post-quantum cryptography. Number theoretic transform (NTT) is the latency- and energy- dominant process in the computation of LWE problems. This paper presents a compact and efficient in-MEmory NTT accelerator, named MeNTT, which explores optimized computation in and near a 6T SRAM array. Specifically-designed peripherals enable fast and efficient modular operations. Moreover, a novel mapping strategy reduces the data flow between NTT stages into a unique pattern, which greatly simplifies the routing among processing units (i.e., SRAM column in this work), reducing energy and area overheads. The accelerator achieves significant latency and energy reductions over prior arts.
CRJul 7, 2021
A Dual-Port 8-T CAM-Based Network Intrusion Detection Engine for IoTDai Li, Kaiyuan Yang
This letter presents an energy- and memory-efficient pattern-matching engine for a network intrusion detection system (NIDS) in the Internet of Things. Tightly coupled architecture and circuit co-designs are proposed to fully exploit the statistical behaviors of NIDS pattern matching. The proposed engine performs pattern matching in three phases, where the phase-1 prefix matching employs reconfigurable pipelined automata processing to minimize memory footprint without loss of throughput and efficiency. The processing elements utilize 8-T content-addressable memory (CAM) cells for dual-port search by leveraging proposed fixed-1s encoding. A 65-nm prototype demonstrates best-in-class 1.54-fJ energy per search per pattern byte and 0.9-byte memory usage per pattern byte.
SPJul 7, 2021
A Self-Regulated and Reconfigurable CMOS Physically Unclonable Function Featuring Zero-Overhead StabilizationDai Li, Kaiyuan Yang
This article presents a reconfigurable physically unclonable function (PUF) design fabricated using 65-nm CMOS technology. A subthreshold-inverter-based static PUF cell achieves 0.3% native bit error rate (BER) at 0.062-fJ per bit core energy efficiency. A flexible, native transistor-based voltage regulation scheme achieves low-overhead supply regulation with 6-mV/V line sensitivity, making the PUF resistant against voltage variations. Additionally, the PUF cell is designed to be reconfigurable with no area overhead, which enables stabilization without redundancy on chip. Thanks to the highly stable and self-regulated PUF cell and the zero-overhead stabilization scheme, a 0.00182% native BER is obtained after reconfiguration. The proposed design shows 0.12%/10 °C and 0.057%/0.1-V bit error across the military-grade temperature range from -55 °C to 125 °C and supply voltage variation from 0.7 to 1.4 V. The total energy per bit is 15.3 fJ. Furthermore, the unstable bits can be detected by sweeping the body bias instead of temperature during enrollment, thereby significantly reducing the testing costs. Last but not least, the prototype exhibits almost ideal uniqueness and randomness, with a mean inter-die Hamming distance (HD) of 0.4998 and a 1020x inter-/intra-die HD separation. It also passes both NIST 800-22 and 800-90B randomness tests.
ARJul 6, 2021
CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN InferenceZhiyu Chen, Zhanghao Yu, Qing Jin et al.
A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512x128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.
CVMar 20, 2018
Dynamic Filtering with Large Sampling Field for ConvNetsJialin Wu, Dai Li, Yu Yang et al.
We propose a dynamic filtering strategy with large sampling field for ConvNets (LS-DFN), where the position-specific kernels learn from not only the identical position but also multiple sampled neighbor regions. During sampling, residual learning is introduced to ease training and an attention mechanism is applied to fuse features from different samples. Such multiple samples enlarge the kernels' receptive fields significantly without requiring more parameters. While LS-DFN inherits the advantages of DFN, namely avoiding feature map blurring by position-wise kernels while keeping translation invariance, it also efficiently alleviates the overfitting issue caused by much more parameters than normal CNNs. Our model is efficient and can be trained end-to-end via standard back-propagation. We demonstrate the merits of our LS-DFN on both sparse and dense prediction tasks involving object detection, semantic segmentation, and flow estimation. Our results show LS-DFN enjoys stronger recognition abilities in object detection and semantic segmentation tasks on VOC benchmark and sharper responses in flow estimation on FlyingChairs dataset compared to strong baselines.