IRJul 22, 2024Code
Deep Uncertainty-Based Explore for Index Construction and Retrieval in Recommendation SystemXin Jiang, Kaiqiang Wang, Yinlong Wang et al.
In recommendation systems, the relevance and novelty of the final results are selected through a cascade system of Matching -> Ranking -> Strategy. The matching model serves as the starting point of the pipeline and determines the upper bound of the subsequent stages. Balancing the relevance and novelty of matching results is a crucial step in the design and optimization of recommendation systems, contributing significantly to improving recommendation quality. However, the typical matching algorithms have not simultaneously addressed the relevance and novelty perfectly. One main reason is that deep matching algorithms exhibit significant uncertainty when estimating items in the long tail (e.g., due to insufficient training samples) items.The uncertainty not only affects the training of the models but also influences the confidence in the index construction and beam search retrieval process of these models. This paper proposes the UICR (Uncertainty-based explore for Index Construction and Retrieval) algorithm, which introduces the concept of uncertainty modeling in the matching stage and achieves multi-task modeling of model uncertainty and index uncertainty. The final matching results are obtained by combining the relevance score and uncertainty score infered by the model. Experimental results demonstrate that the UICR improves novelty without sacrificing relevance on realworld industrial productive environments and multiple open-source datasets. Remarkably, online A/B test results of display advertising in Shopee demonstrates the effectiveness of the proposed algorithm.
LGMar 2, 2024Code
AutoDefense: Multi-Agent LLM Defense against Jailbreak AttacksYifan Zeng, Yiran Wu, Xiao Zhang et al.
Despite extensive pre-training in moral alignment to prevent generating harmful information, large language models (LLMs) remain vulnerable to jailbreak attacks. In this paper, we propose AutoDefense, a multi-agent defense framework that filters harmful responses from LLMs. With the response-filtering mechanism, our framework is robust against different jailbreak attack prompts, and can be used to defend different victim models. AutoDefense assigns different roles to LLM agents and employs them to complete the defense task collaboratively. The division in tasks enhances the overall instruction-following of LLMs and enables the integration of other defense components as tools. With AutoDefense, small open-source LMs can serve as agents and defend larger models against jailbreak attacks. Our experiments show that AutoDefense can effectively defense against different jailbreak attacks, while maintaining the performance at normal user request. For example, we reduce the attack success rate on GPT-3.5 from 55.74% to 7.95% using LLaMA-2-13b with a 3-agent system. Our code and data are publicly available at https://github.com/XHMY/AutoDefense.
AIMay 22
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing TradeoffsYifan Zeng, Yiran Wu, Yaolun Zhang et al.
Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.
IROct 30, 2023
LightSAGE: Graph Neural Networks for Large Scale Item Retrieval in Shopee's Advertisement RecommendationDang Minh Nguyen, Chenfei Wang, Yan Shen et al.
Graph Neural Network (GNN) is the trending solution for item retrieval in recommendation problems. Most recent reports, however, focus heavily on new model architectures. This may bring some gaps when applying GNN in the industrial setup, where, besides the model, constructing the graph and handling data sparsity also play critical roles in the overall success of the project. In this work, we report how GNN is applied for large-scale e-commerce item retrieval at Shopee. We introduce our simple yet novel and impactful techniques in graph construction, modeling, and handling data skewness. Specifically, we construct high-quality item graphs by combining strong-signal user behaviors with high-precision collaborative filtering (CF) algorithm. We then develop a new GNN architecture named LightSAGE to produce high-quality items' embeddings for vector search. Finally, we design multiple strategies to handle cold-start and long-tail items, which are critical in an advertisement (ads) system. Our models bring improvement in offline evaluations, online A/B tests, and are deployed to the main traffic of Shopee's Recommendation Advertisement system.
CVJun 16, 2025Code
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative RefinementChelsi Jain, Yiran Wu, Yifan Zeng et al.
Document Visual Question Answering (DocVQA) is a practical yet challenging task, which is to ask questions based on documents while referring to multiple pages and different modalities of information, e.g, images and tables. To handle multi-modality, recent methods follow a similar Retrieval Augmented Generation (RAG) pipeline, but utilize Visual Language Models (VLMs) based embedding model to embed and retrieve relevant pages as images, and generate answers with VLMs that can accept an image as input. In this paper, we introduce SimpleDoc, a lightweight yet powerful retrieval - augmented framework for DocVQA. It boosts evidence page gathering by first retrieving candidates through embedding similarity and then filtering and re-ranking these candidates based on page summaries. A single VLM-based reasoner agent repeatedly invokes this dual-cue retriever, iteratively pulling fresh pages into a working memory until the question is confidently answered. SimpleDoc outperforms previous baselines by 3.2% on average on 4 DocVQA datasets with much fewer pages retrieved. Our code is available at https://github.com/ag2ai/SimpleDoc.
LGJan 17, 2024Code
Deep Ensemble Shape Calibration: Multi-Field Post-hoc Calibration in Online AdvertisingShuai Yang, Hao Yang, Zhuang Zou et al.
In the e-commerce advertising scenario, estimating the true probabilities (known as a calibrated estimate) on Click-Through Rate (CTR) and Conversion Rate (CVR) is critical. Previous research has introduced numerous solutions for addressing the calibration problem. These methods typically involve the training of calibrators using a validation set and subsequently applying these calibrators to correct the original estimated values during online inference. However, what sets e-commerce advertising scenarios apart is the challenge of multi-field calibration. Multi-field calibration requires achieving calibration in each field. In order to achieve multi-field calibration, it is necessary to have a strong data utilization ability. Because the quantity of pCTR specified range for a single field-value (such as user ID and item ID) sample is relatively small, this makes the calibrator more difficult to train. However, existing methods have difficulty effectively addressing these issues. To solve these problems, we propose a new method named Deep Ensemble Shape Calibration (DESC). In terms of business understanding and interpretability, we decompose multi-field calibration into value calibration and shape calibration. We introduce innovative basis calibration functions, which enhance both function expression capabilities and data utilization by combining these basis calibration functions. A significant advancement lies in the development of an allocator capable of allocating the most suitable calibrators to different estimation error distributions within diverse fields and values. We achieve significant improvements in both public and industrial datasets. In online experiments, we observe a +2.5% increase in CVR and +4.0% in GMV (Gross Merchandise Volume). Our code is now available at: https://github.com/HaoYang0123/DESC.
IRSep 23, 2025Code
The Ranking Blind Spot: Decision Hijacking in LLM-based Text RankingYaoyao Qian, Yifan Zeng, Yuchao Jiang et al.
Large Language Models (LLMs) have demonstrated strong performance in information retrieval tasks like passage ranking. Our research examines how instruction-following capabilities in LLMs interact with multi-document comparison tasks, identifying what we term the "Ranking Blind Spot", a characteristic of LLM decision processes during comparative evaluation. We analyze how this ranking blind spot affects LLM evaluation systems through two approaches: Decision Objective Hijacking, which alters the evaluation goal in pairwise ranking systems, and Decision Criteria Hijacking, which modifies relevance standards across ranking schemes. These approaches demonstrate how content providers could potentially influence LLM-based ranking systems to affect document positioning. These attacks aim to force the LLM ranker to prefer a specific passage and rank it at the top. Malicious content providers can exploit this weakness, which helps them gain additional exposure by attacking the ranker. In our experiment, We empirically show that the proposed attacks are effective in various LLMs and can be generalized to multiple ranking schemes. We apply these attack to realistic examples to show their effectiveness. We also found stronger LLMs are more vulnerable to these attacks. Our code is available at: https://github.com/blindspotorg/RankingBlindSpot
CLOct 18, 2024
TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N SamplingJiahao Qiu, Yifu Lu, Yifan Zeng et al.
Inference-time alignment enhances the performance of large language models without requiring additional training or fine-tuning but presents challenges due to balancing computational efficiency with high-quality output. Best-of-N (BoN) sampling, as a simple yet powerful approach, generates multiple responses and selects the best one, achieving improved performance but with a high computational cost. We propose TreeBoN, a novel framework that integrates a speculative tree-search strategy into Best-of-N (BoN) Sampling. TreeBoN maintains a set of parent nodes, iteratively branching and pruning low-quality responses, thereby reducing computational overhead while maintaining high output quality. Our approach also leverages token-level rewards from Direct Preference Optimization (DPO) to guide tree expansion and prune low-quality paths. We evaluate TreeBoN using AlpacaFarm, HH-RLHF, UltraFeedback, GSM8K, and TutorEval datasets, demonstrating consistent improvements. Specifically, TreeBoN achieves the highest win rate of 65% on TutorEval and around 60% win rates across other different datasets, outperforming standard BoN with the same computational cost and showcasing its scalability and alignment efficacy.
IRJan 17, 2024
A New Creative Generation Pipeline for Click-Through Rate with Stable Diffusion ModelHao Yang, Jianxin Yuan, Shuai Yang et al.
In online advertising scenario, sellers often create multiple creatives to provide comprehensive demonstrations, making it essential to present the most appealing design to maximize the Click-Through Rate (CTR). However, sellers generally struggle to consider users preferences for creative design, leading to the relatively lower aesthetics and quantities compared to Artificial Intelligence (AI)-based approaches. Traditional AI-based approaches still face the same problem of not considering user information while having limited aesthetic knowledge from designers. In fact that fusing the user information, the generated creatives can be more attractive because different users may have different preferences. To optimize the results, the generated creatives in traditional methods are then ranked by another module named creative ranking model. The ranking model can predict the CTR score for each creative considering user features. However, the two above stages are regarded as two different tasks and are optimized separately. In this paper, we proposed a new automated Creative Generation pipeline for Click-Through Rate (CG4CTR) with the goal of improving CTR during the creative generation stage. Our contributions have 4 parts: 1) The inpainting mode in stable diffusion is firstly applied to creative generation task in online advertising scene. A self-cyclic generation pipeline is proposed to ensure the convergence of training. 2) Prompt model is designed to generate individualized creatives for different user groups, which can further improve the diversity and quality. 3) Reward model comprehensively considers the multimodal features of image and text to improve the effectiveness of creative ranking task, and it is also critical in self-cyclic pipeline. 4) The significant benefits obtained in online and offline experiments verify the significance of our proposed method.
CLNov 15, 2024
HistoLens: An LLM-Powered Framework for Multi-Layered Analysis of Historical Texts -- A Case Application of Yantie LunYifan Zeng
This paper proposes HistoLens, a multi-layered analysis framework for historical texts based on Large Language Models (LLMs). Using the important Western Han dynasty text "Yantie Lun" as a case study, we demonstrate the framework's potential applications in historical research and education. HistoLens integrates NLP technology (especially LLMs), including named entity recognition, knowledge graph construction, and geographic information visualization. The paper showcases how HistoLens explores Western Han culture in "Yantie Lun" through multi-dimensional, visual, and quantitative methods, focusing particularly on the influence of Confucian and Legalist thoughts on political, economic, military, and ethnic. We also demonstrate how HistoLens constructs a machine teaching scenario using LLMs for explainable analysis, based on a dataset of Confucian and Legalist ideas extracted with LLM assistance. This approach offers novel and diverse perspectives for studying historical texts like "Yantie Lun" and provides new auxiliary tools for history education. The framework aims to equip historians and learners with LLM-assisted tools to facilitate in-depth, multi-layered analysis of historical texts and foster innovation in historical education.
LGOct 17, 2024
A Common Pitfall of Margin-based Language Model Alignment: Gradient EntanglementHui Yuan, Yifan Zeng, Yue Wu et al.
Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for language model (LM) alignment. At its core, RLHF uses a margin-based loss for preference optimization, specifying ideal LM behavior only by the difference between preferred and dispreferred responses. In this paper, we identify a common pitfall of margin-based methods -- the under-specification of ideal LM behavior on preferred and dispreferred responses individually, which leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) The probability of preferred responses may decrease, even when those responses are ideal. We demystify the reasons behind these problematic behaviors: margin-based losses couple the change in the preferred probability to the gradient of the dispreferred one, and vice versa, often preventing the preferred probability from increasing while the dispreferred one decreases, and thus causing a synchronized increase or decrease in both probabilities. We term this effect, inherent in margin-based objectives, gradient entanglement. Formally, we derive conditions for general margin-based alignment objectives under which gradient entanglement becomes concerning: the inner product of the gradients of preferred and dispreferred log-probabilities is large relative to the individual gradient norms. We theoretically investigate why such inner products can be large when aligning language models and empirically validate our findings. Empirical implications of our framework extend to explaining important differences in the training dynamics of various preference optimization algorithms, and suggesting potential algorithm designs to mitigate the under-specification issue of margin-based methods and thereby improving language model alignment.
CYOct 26, 2024
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-PlayYifan Zeng, Liang Kairong, Fangzhou Dong et al.
As Large Language Models (LLMs) become more prevalent, concerns about their safety, ethics, and potential biases have risen. Systematically evaluating LLMs' risk decision-making tendencies and attitudes, particularly in the ethical domain, has become crucial. This study innovatively applies the Domain-Specific Risk-Taking (DOSPERT) scale from cognitive science to LLMs and proposes a novel Ethical Decision-Making Risk Attitude Scale (EDRAS) to assess LLMs' ethical risk attitudes in depth. We further propose a novel approach integrating risk scales and role-playing to quantitatively evaluate systematic biases in LLMs. Through systematic evaluation and analysis of multiple mainstream LLMs, we assessed the "risk personalities" of LLMs across multiple domains, with a particular focus on the ethical domain, and revealed and quantified LLMs' systematic biases towards different groups. This research helps understand LLMs' risk decision-making and ensure their safe and reliable application. Our approach provides a tool for identifying and mitigating biases, contributing to fairer and more trustworthy AI systems. The code and data are available.
CLMay 6, 2025
Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at ScaleJiale Liu, Yifan Zeng, Shaokun Zhang et al.
LLM-based optimization has shown remarkable potential in enhancing agentic systems. However, the conventional approach of prompting LLM optimizer with the whole training trajectories on training dataset in a single pass becomes untenable as datasets grow, leading to context window overflow and degraded pattern recognition. To address these challenges, we propose Fine-Grained Optimization (FGO), a scalable framework that divides large optimization tasks into manageable subsets, performs targeted optimizations, and systematically combines optimized components through progressive merging. Evaluation across ALFWorld, LogisticsQA, and GAIA benchmarks demonstrate that FGO outperforms existing approaches by 1.6-8.6% while reducing average prompt token consumption by 56.3%. Our framework provides a practical solution for scaling up LLM-based optimization of increasingly sophisticated agent systems. Further analysis demonstrates that FGO achieves the most consistent performance gain in all training dataset sizes, showcasing its scalability and efficiency.
CRMay 20, 2025
CSAGC-IDS: A Dual-Module Deep Learning Network Intrusion Detection Model for Complex and Imbalanced DataYifan Zeng
As computer networks proliferate, the gravity of network intrusions has escalated, emphasizing the criticality of network intrusion detection systems for safeguarding security. While deep learning models have exhibited promising results in intrusion detection, they face challenges in managing high-dimensional, complex traffic patterns and imbalanced data categories. This paper presents CSAGC-IDS, a network intrusion detection model based on deep learning techniques. CSAGC-IDS integrates SC-CGAN, a self-attention-enhanced convolutional conditional generative adversarial network that generates high-quality data to mitigate class imbalance. Furthermore, CSAGC-IDS integrates CSCA-CNN, a convolutional neural network enhanced through cost sensitive learning and channel attention mechanism, to extract features from complex traffic data for precise detection. Experiments conducted on the NSL-KDD dataset. CSAGC-IDS achieves an accuracy of 84.55% and an F1-score of 84.52% in five-class classification task, and an accuracy of 91.09% and an F1 score of 92.04% in binary classification task.Furthermore, this paper provides an interpretability analysis of the proposed model, using SHAP and LIME to explain the decision-making mechanisms of the model.
CLDec 17, 2024
Memory-Augmented Agent Training for Business Document UnderstandingJiale Liu, Yifan Zeng, Malte Højmark-Bertelsen et al.
Traditional enterprises face significant challenges in processing business documents, where tasks like extracting transport references from invoices remain largely manual despite their crucial role in logistics operations. While Large Language Models offer potential automation, their direct application to specialized business domains often yields unsatisfactory results. We introduce Matrix (Memory-Augmented agent Training through Reasoning and Iterative eXploration), a novel paradigm that enables LLM agents to progressively build domain expertise through experience-driven memory refinement and iterative learning. To validate this approach, we collaborate with one of the world's largest logistics companies to create a dataset of Universal Business Language format invoice documents, focusing on the task of transport reference extraction. Experiments demonstrate that Matrix outperforms prompting a single LLM by 30.3%, vanilla LLM agent by 35.2%. We further analyze the metrics of the optimized systems and observe that the agent system requires less API calls, fewer costs and can analyze longer documents on average. Our methods establish a new approach to transform general-purpose LLMs into specialized business tools through systematic memory enhancement in document processing tasks.
AIOct 22, 2025
WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph RepresentationYaoyao Qian, Yuanli Wang, Jinda Zhang et al.
Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.
CLJun 26, 2025
Structuralist Approach to AI Literary Criticism: Leveraging Greimas Semiotic Square for Large Language ModelsFangzhou Dong, Yifan Zeng, Yingpeng Sang et al.
Large Language Models (LLMs) excel in understanding and generating text but struggle with providing professional literary criticism for works with profound thoughts and complex narratives. This paper proposes GLASS (Greimas Literary Analysis via Semiotic Square), a structured analytical framework based on Greimas Semiotic Square (GSS), to enhance LLMs' ability to conduct in-depth literary analysis. GLASS facilitates the rapid dissection of narrative structures and deep meanings in narrative works. We propose the first dataset for GSS-based literary criticism, featuring detailed analyses of 48 works. Then we propose quantitative metrics for GSS-based literary criticism using the LLM-as-a-judge paradigm. Our framework's results, compared with expert criticism across multiple works and LLMs, show high performance. Finally, we applied GLASS to 39 classic works, producing original and high-quality analyses that address existing research gaps. This research provides an AI-based tool for literary research and education, offering insights into the cognitive mechanisms underlying literary engagement.
IRNov 25, 2020
Exploration in Online Advertising Systems with Deep Uncertainty-Aware LearningChao Du, Zhifeng Gao, Shuo Yuan et al.
Modern online advertising systems inevitably rely on personalization methods, such as click-through rate (CTR) prediction. Recent progress in CTR prediction enjoys the rich representation capabilities of deep learning and achieves great success in large-scale industrial applications. However, these methods can suffer from lack of exploration. Another line of prior work addresses the exploration-exploitation trade-off problem with contextual bandit methods, which are recently less studied in the industry due to the difficulty in extending their flexibility with deep models. In this paper, we propose a novel Deep Uncertainty-Aware Learning (DUAL) method to learn CTR models based on Gaussian processes, which can provide predictive uncertainty estimations while maintaining the flexibility of deep neural networks. DUAL can be easily implemented on existing models and deployed in real-time systems with minimal extra computational overhead. By linking the predictive uncertainty estimation ability of DUAL to well-known bandit algorithms, we further present DUAL-based Ad-ranking strategies to boost up long-term utilities such as the social welfare in advertising systems. Experimental results on several public datasets demonstrate the effectiveness of our methods. Remarkably, an online A/B test deployed in the Alibaba display advertising platform shows an 8.2% social welfare improvement and an 8.0% revenue lift.
LGMar 14, 2019
High-Throughput CNN Inference on Embedded ARM big.LITTLE Multi-Core ProcessorsSiqi Wang, Gayathri Ananthanarayanan, Yifan Zeng et al.
IoT Edge intelligence requires Convolutional Neural Network (CNN) inference to take place in the edge devices itself. ARM big.LITTLE architecture is at the heart of prevalent commercial edge devices. It comprises of single-ISA heterogeneous cores grouped into multiple homogeneous clusters that enable power and performance trade-offs. All cores are expected to be simultaneously employed in inference to attain maximal throughput. However, high communication overhead involved in parallelization of computations from convolution kernels across clusters is detrimental to throughput. We present an alternative framework called Pipe-it that employs pipelined design to split convolutional layers across clusters while limiting parallelization of their respective kernels to the assigned cluster. We develop a performance-prediction model that utilizes only the convolutional layer descriptors to predict the execution time of each layer individually on all permitted core configurations (type and count). Pipe-it then exploits the predictions to create a balanced pipeline using an efficient design space exploration algorithm. Pipe-it on average results in a 39% higher throughput than the highest antecedent throughput.
GTFeb 27, 2017
Optimized Cost per Click in Taobao Display AdvertisingHan Zhu, Junqi Jin, Chang Tan et al.
Taobao, as the largest online retail platform in the world, provides billions of online display advertising impressions for millions of advertisers every day. For commercial purposes, the advertisers bid for specific spots and target crowds to compete for business traffic. The platform chooses the most suitable ads to display in tens of milliseconds. Common pricing methods include cost per mille (CPM) and cost per click (CPC). Traditional advertising systems target certain traits of users and ad placements with fixed bids, essentially regarded as coarse-grained matching of bid and traffic quality. However, the fixed bids set by the advertisers competing for different quality requests cannot fully optimize the advertisers' key requirements. Moreover, the platform has to be responsible for the business revenue and user experience. Thus, we proposed a bid optimizing strategy called optimized cost per click (OCPC) which automatically adjusts the bid to achieve finer matching of bid and traffic quality of page view (PV) request granularity. Our approach optimizes advertisers' demands, platform business revenue and user experience and as a whole improves traffic allocation efficiency. We have validated our approach in Taobao display advertising system in production. The online A/B test shows our algorithm yields substantially better results than previous fixed bid manner.