CLMar 3, 2025Code
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAsAbdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson et al. · microsoft-research
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
IRDec 28, 2022
Towards Disentangling Relevance and Bias in Unbiased Learning to RankYunan Zhang, Le Yan, Zhen Qin et al. · deepmind
Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose three methods to mitigate the negative confounding effects by better disentangling relevance and bias. Empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches.
CLFeb 23, 2023Code
A Neural Span-Based Continual Named Entity Recognition ModelYunan Zhang, Qingcai Chen
Named Entity Recognition (NER) models capable of Continual Learning (CL) are realistically valuable in areas where entity types continuously increase (e.g., personal assistants). Meanwhile the learning paradigm of NER advances to new patterns such as the span-based methods. However, its potential to CL has not been fully explored. In this paper, we propose SpanKL, a simple yet effective Span-based model with Knowledge distillation (KD) to preserve memories and multi-Label prediction to prevent conflicts in CL-NER. Unlike prior sequence labeling approaches, the inherently independent modeling in span and entity level with the designed coherent optimization on SpanKL promotes its learning at each incremental step and mitigates the forgetting. Experiments on synthetic CL datasets derived from OntoNotes and Few-NERD show that SpanKL significantly outperforms previous SoTA in many aspects, and obtains the smallest gap from CL to the upper bound revealing its high practiced value. The code is available at https://github.com/Qznan/SpanKL.
CLOct 3, 2023
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMsSuyu Ge, Yunan Zhang, Liyuan Liu et al.
In this study, we introduce adaptive KV cache compression, a plug-and-play method that reduces the memory footprint of generative inference for Large Language Models (LLMs). Different from the conventional KV cache that retains key and value vectors for all context tokens, we conduct targeted profiling to discern the intrinsic structure of attention modules. Based on the recognized structure, we then construct the KV cache in an adaptive manner: evicting long-range contexts on attention heads emphasizing local contexts, discarding non-special tokens on attention heads centered on special tokens, and only employing the standard KV cache for attention heads that broadly attend to all tokens. Moreover, with the lightweight attention profiling used to guide the construction of the adaptive KV cache, FastGen can be deployed without resource-intensive fine-tuning or re-training. In our experiments across various asks, FastGen demonstrates substantial reduction on GPU memory consumption with negligible generation quality loss. We will release our code and the compatible CUDA kernel for reproducibility.
CLApr 22, 2024Code
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your PhoneMarah Abdin, Jyoti Aneja, Hany Awadalla et al. · microsoft-research, stanford
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. Our training dataset is a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide parameter-scaling results with a 7B, 14B models trained for 4.8T tokens, called phi-3-small, phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75%, 78% on MMLU, and 8.7, 8.9 on MT-bench). To enhance multilingual, multimodal, and long-context capabilities, we introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision. The phi-3.5-MoE, a 16 x 3.8B MoE model with 6.6 billion active parameters, achieves superior performance in language reasoning, math, and code tasks compared to other open-source models of similar scale, such as Llama 3.1 and the Mixtral series, and on par with Gemini-1.5-Flash and GPT-4o-mini. Meanwhile, phi-3.5-Vision, a 4.2 billion parameter model derived from phi-3.5-mini, excels in reasoning tasks and is adept at handling both single-image and text prompts, as well as multi-image and text prompts.
CLJul 25, 2024
S2-Attention: Hardware-Aware Context Sharding Among Attention HeadsXihui Lin, Yunan Zhang, Suyu Ge et al.
Sparse attention, which selectively attends to a subset of tokens in the context was supposed to be efficient. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts due to the lack of hardware-aware optimizations like FlashAttention. Meanwhile, it remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models (LLMs) and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels. S2-Attention enables the exploration of novel and high-performance sparse attention techniques, which we demonstrate through extensive ablations across a wide range of sparse attention designs at various model scales. From these insights, we present several basic guidelines to design sparse attention that can achieve not only practical efficiency improvements, but also strong downstream performance. To achieve high parallelization and optimized memory IO, sparse attention should shard the context heterogeneously across attention heads, where each head attends to a different subset of tokens while collectively covering the full context. Meanwhile, we find hybrid architectures combining sparse and dense attention particularly beneficial in practice. S2-Attention achieves wall-clock speedup of 8.79X, 15.87X, 25.3X compared to the strong FlashAttention-2 baseline with strong downstream performance on-par with full attention and perfect retrieval performance at a 128k context length. At inference, for 7B models, our model, with the help of our S2-Attention kernel, achieves 4.5x speed-up compared to dense counterparts. S2-Attention is released with easy-to-customize APIs for direct usage in Megatron and vLLM.
CLAug 6, 2025Code
GeRe: Towards Efficient Anti-Forgetting in Continual Learning of LLM via General Samples ReplayYunan Zhang, Shuoran Jiang, Mengchen Zhao et al.
The continual learning capability of large language models (LLMs) is crucial for advancing artificial general intelligence. However, continual fine-tuning LLMs across various domains often suffers from catastrophic forgetting, characterized by: 1) significant forgetting of their general capabilities, and 2) sharp performance declines in previously learned tasks. To simultaneously address both issues in a simple yet stable manner, we propose General Sample Replay (GeRe), a framework that use usual pretraining texts for efficient anti-forgetting. Beyond revisiting the most prevalent replay-based practices under GeRe, we further leverage neural states to introduce a enhanced activation states constrained optimization method using threshold-based margin (TM) loss, which maintains activation state consistency during replay learning. We are the first to validate that a small, fixed set of pre-collected general replay samples is sufficient to resolve both concerns--retaining general capabilities while promoting overall performance across sequential tasks. Indeed, the former can inherently facilitate the latter. Through controlled experiments, we systematically compare TM with different replay strategies under the GeRe framework, including vanilla label fitting, logit imitation via KL divergence and feature imitation via L1/L2 losses. Results demonstrate that TM consistently improves performance and exhibits better robustness. Our work paves the way for efficient replay of LLMs for the future. Our code and data are available at https://github.com/Qznan/GeRe.
IRFeb 22, 2024
GenSERP: Large Language Models for Whole Page PresentationZhenning Zhang, Yunan Zhang, Suyu Ge et al.
The advent of large language models (LLMs) brings an opportunity to minimize the effort in search engine result page (SERP) organization. In this paper, we propose GenSERP, a framework that leverages LLMs with vision in a few-shot setting to dynamically organize intermediate search results, including generated chat answers, website snippets, multimedia data, knowledge panels into a coherent SERP layout based on a user's query. Our approach has three main stages: (1) An information gathering phase where the LLM continuously orchestrates API tools to retrieve different types of items, and proposes candidate layouts based on the retrieved items, until it's confident enough to generate the final result. (2) An answer generation phase where the LLM populates the layouts with the retrieved content. In this phase, the LLM adaptively optimize the ranking of items and UX configurations of the SERP. Consequently, it assigns a location on the page to each item, along with the UX display details. (3) A scoring phase where an LLM with vision scores all the generated SERPs based on how likely it can satisfy the user. It then send the one with highest score to rendering. GenSERP features two generation paradigms. First, coarse-to-fine, which allow it to approach optimal layout in a more manageable way, (2) beam search, which give it a better chance to hit the optimal solution compared to greedy decoding. Offline experimental results on real-world data demonstrate how LLMs can contextually organize heterogeneous search results on-the-fly and provide a promising user experience.
CLDec 20, 2021
Unifying Model Explainability and Robustness for Joint Text Classification and Rationale ExtractionDongfang Li, Baotian Hu, Qingcai Chen et al.
Recent works have shown explainability and robustness are two crucial ingredients of trustworthy and reliable text classification. However, previous works usually address one of two aspects: i) how to extract accurate rationales for explainability while being beneficial to prediction; ii) how to make the predictive model robust to different types of adversarial attacks. Intuitively, a model that produces helpful explanations should be more robust against adversarial attacks, because we cannot trust the model that outputs explanations but changes its prediction under small perturbations. To this end, we propose a joint classification and rationale extraction model named AT-BMC. It includes two key mechanisms: mixed Adversarial Training (AT) is designed to use various perturbations in discrete and embedding space to improve the model's robustness, and Boundary Match Constraint (BMC) helps to locate rationales more precisely with the guidance of boundary information. Performances on benchmark datasets demonstrate that the proposed AT-BMC outperforms baselines on both classification and rationale extraction by a large margin. Robustness analysis shows that the proposed AT-BMC decreases the attack success rate effectively by up to 69%. The empirical results indicate that there are connections between robust models and better explanations.
CLDec 4, 2019
Cooperative Reasoning on Knowledge Graph and Corpus: A Multi-agentReinforcement Learning ApproachYunan Zhang, Xiang Cheng, Heting Gao et al.
Knowledge-graph-based reasoning has drawn a lot of attention due to its interpretability. However, previous methods suffer from the incompleteness of the knowledge graph, namely the interested link or entity that can be missing in the knowledge graph(explicit missing). Also, most previous models assume the distance between the target and source entity is short, which is not true on a real-world KG like Freebase(implicit missing). The sensitivity to the incompleteness of KG and the incapability to capture the long-distance link between entities have limited the performance of these models on large KG. In this paper, we propose a model that leverages the text corpus to cure such limitations, either the explicit or implicit missing links. We model the question answering on KG as a cooperative task between two agents, a knowledge graph reasoning agent and an information extraction agent. Each agent learns its skill to complete its own task, hopping on KG or select knowledge from the corpus, via maximizing the reward for correctly answering the question. The reasoning agent decides how to find an equivalent path for the given entity and relation. The extraction agent provide shortcut for long-distance target entity or provide missing relations for explicit missing links with messages from the reasoning agent. Through such cooperative reward design, our model can augment the incomplete KG strategically while not introduce much unnecessary noise that could enlarge the search space and lower the performance.
IRNov 28, 2019
Macross: Urban Dynamics Modeling based on Metapath Guided Cross-Modal EmbeddingYunan Zhang, Heting Gao, Tarek Abdelzaher
As the ongoing rapid urbanization takes place with an ever-increasing speed, fully modeling urban dynamics becomes more and more challenging, but also a necessity for socioeconomic development. It is challenging because human activities and constructions are ubiquitous; urban landscape and life content change anywhere and anytime. It's crucial due to the fact that only up-to-date urban dynamics can enable governors to optimize their city planning strategy and help individuals organize their daily lives in a more efficient way. Previous geographic topic model based methods attempt to solve this problem but suffer from high computational cost and memory consumption, limiting their scalability to city level applications. Also, strong prior assumptions make such models fail to capture certain patterns by nature. To bridge the gap, we propose Macross, a metapath guided embedding approach to jointly model location, time and text information. Given a dataset of geo-tagged social media posts, we extract and aggregate location and time and construct a heterogeneous information network using the aggregated space and time. Metapath2vec based approach is used to construct vector representations for times, locations and frequent words such that co-occurrence pairs of nodes are closer in latent space. The vector representations will be used to infer related time, locations or keywords for a user query. Experiments done on enormous datasets show our model can generate comparable if not better quality query results compared to state of the art models and outperform some cutting-edge models for activity recovery and classification.
CLNov 11, 2019
Learning to Order Sub-questions for Complex Question AnsweringYunan Zhang, Xiang Cheng, Yufeng Zhang et al.
Answering complex questions involving multiple entities and relations is a challenging task. Logically, the answer to a complex question should be derived by decomposing the complex question into multiple simple sub-questions and then answering those sub-questions. Existing work has followed this strategy but has not attempted to optimize the order of how those sub-questions are answered. As a result, the sub-questions are answered in an arbitrary order, leading to larger search space and a higher risk of missing an answer. In this paper, we propose a novel reinforcement learning(RL) approach to answering complex questions that can learn a policy to dynamically decide which sub-question should be answered at each stage of reasoning. We lever-age the expected value-variance criterion to enable the learned policy to balance between the risk and utility of answering a sub-question. Experiment results show that the RL approach can substantially improve the optimality of ordering the sub-questions, leading to improved accuracy of question answering. The proposed method for learning to order sub-questions is general and can thus be potentially combined with many existing ideas for answering complex questions to enhance their performance.