Le Yan

IR
h-index117
9papers
3,578citations
Novelty51%
AI Score40

9 Papers

IRJun 30, 2023Code
Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting

Zhen Qin, Rolf Jagerman, Kai Hui et al. · deepmind

Ranking documents using Large Language Models (LLMs) by directly feeding the query and candidate documents into the prompt is an interesting and practical problem. However, researchers have found it difficult to outperform fine-tuned baseline rankers on benchmark datasets. We analyze pointwise and listwise ranking prompts used by existing methods and argue that off-the-shelf LLMs do not fully understand these challenging ranking formulations. In this paper, we propose to significantly reduce the burden on LLMs by using a new technique called Pairwise Ranking Prompting (PRP). Our results are the first in the literature to achieve state-of-the-art ranking performance on standard benchmarks using moderate-sized open-sourced LLMs. On TREC-DL 2019&2020, PRP based on the Flan-UL2 model with 20B parameters performs favorably with the previous best approach in the literature, which is based on the blackbox commercial GPT-4 that has 50x (estimated) model size, while outperforming other LLM-based solutions, such as InstructGPT which has 175B parameters, by over 10% for all ranking metrics. By using the same prompt template on seven BEIR tasks, PRP outperforms supervised baselines and outperforms the blackbox commercial ChatGPT solution by 4.2% and pointwise LLM-based solutions by more than 10% on average NDCG@10. Furthermore, we propose several variants of PRP to improve efficiency and show that it is possible to achieve competitive results even with linear complexity.

IRDec 28, 2022
Towards Disentangling Relevance and Bias in Unbiased Learning to Rank

Yunan Zhang, Le Yan, Zhen Qin et al. · deepmind

Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose three methods to mitigate the negative confounding effects by better disentangling relevance and bias. Empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches.

IRJun 14, 2023
Learning to Rank when Grades Matter

Le Yan, Zhen Qin, Gil Shamir et al.

Graded labels are ubiquitous in real-world learning-to-rank applications, especially in human rated relevance data. Traditional learning-to-rank techniques aim to optimize the ranked order of documents. They typically, however, ignore predicting actual grades. This prevents them from being adopted in applications where grades matter, such as filtering out ``poor'' documents. Achieving both good ranking performance and good grade prediction performance is still an under-explored problem. Existing research either focuses only on ranking performance by not calibrating model outputs, or treats grades as numerical values, assuming labels are on a linear scale and failing to leverage the ordinal grade information. In this paper, we conduct a rigorous study of learning to rank with grades, where both ranking performance and grade prediction performance are important. We provide a formal discussion on how to perform ranking with non-scalar predictions for grades, and propose a multiobjective formulation to jointly optimize both ranking and grade predictions. In experiments, we verify on several public datasets that our methods are able to push the Pareto frontier of the tradeoff between ranking and grade prediction performance, showing the benefit of leveraging ordinal grade information.

CLJul 7, 2025
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann et al. · amazon-science, baidu

In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.

IRApr 18, 2024
MCRanker: Generating Diverse Criteria On-the-Fly to Improve Point-wise LLM Rankers

Fang Guo, Wenyu Li, Honglei Zhuang et al. · deepmind

The most recent pointwise Large Language Model (LLM) rankers have achieved remarkable ranking results. However, these rankers are hindered by two major drawbacks: (1) they fail to follow a standardized comparison guidance during the ranking process, and (2) they struggle with comprehensive considerations when dealing with complicated passages. To address these shortcomings, we propose to build a ranker that generates ranking scores based on a set of criteria from various perspectives. These criteria are intended to direct each perspective in providing a distinct yet synergistic evaluation. Our research, which examines eight datasets from the BEIR benchmark demonstrates that incorporating this multi-perspective criteria ensemble approach markedly enhanced the performance of pointwise LLM rankers.

CLMar 4, 2025
Adapting Decoder-Based Language Models for Diverse Encoder Downstream Tasks

Paul Suganthan, Fedor Moiseev, Le Yan et al.

Decoder-based transformers, while revolutionizing language modeling and scaling to immense sizes, have not completely overtaken encoder-heavy architectures in natural language processing. Specifically, encoder-only models remain dominant in tasks like classification, regression, and ranking. This is primarily due to the inherent structure of decoder-based models, which limits their direct applicability to these tasks. In this paper, we introduce Gemma Encoder, adapting the powerful Gemma decoder model to an encoder architecture, thereby unlocking its potential for a wider range of non-generative applications. To optimize the adaptation from decoder to encoder, we systematically analyze various pooling strategies, attention mechanisms, and hyperparameters (e.g., dropout rate). Furthermore, we benchmark Gemma Encoder against established approaches on the GLUE benchmarks, and MS MARCO ranking benchmark, demonstrating its effectiveness and versatility.

LGDec 17, 2021
Rank4Class: A Ranking Formulation for Multiclass Classification

Nan Wang, Zhen Qin, Le Yan et al.

Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. In the deep learning era, extensive efforts have been spent on developing more powerful neural embedding models to better represent the instance for improving MCC performance. In this paper, we do not aim to propose new neural models for instance representation learning, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as to rank classes for an instance, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the commonly used Top-$K$ metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking framework. Based on such generalization, we show that it is intuitive to leverage advanced techniques from the learning to rank literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework.

IRSep 30, 2021
Improving Neural Ranking via Lossless Knowledge Distillation

Zhen Qin, Le Yan, Yi Tay et al.

We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over the teacher rankers without increasing model capacity. The key success factors of SDR, which differs from common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the existing knowledge distillation literature. Building upon the state-of-the-art neural ranking structure, SDR is able to push the limits of neural ranking performance above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers, and provide ablation studies to verify the necessity of the key factors in the SDR framework.

LGJan 16, 2020
Stream-Flow Forecasting of Small Rivers Based on LSTM

Youchuan Hu, Le Yan, Tingting Hang et al.

Stream-flow forecasting for small rivers has always been of great importance, yet comparatively challenging due to the special features of rivers with smaller volume. Artificial Intelligence (AI) methods have been employed in this area for long, but improvement of forecast quality is still on the way. In this paper, we tried to provide a new method to do the forecast using the Long-Short Term Memory (LSTM) deep learning model, which aims in the field of time-series data. Utilizing LSTM, we collected the stream flow data from one hydrologic station in Tunxi, China, and precipitation data from 11 rainfall stations around to forecast the stream flow data from that hydrologic station 6 hours in the future. We evaluated the prediction results using three criteria: root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R^2). By comparing LSTM's prediction with predictions of Support Vector Regression (SVR) and Multilayer Perceptions (MLP) models, we showed that LSTM has better performance, achieving RMSE of 82.007, MAE of 27.752, and R^2 of 0.970. We also did extended experiments on LSTM model, discussing influence factors of its performance.