16.6IRJun 3
SAGE: Scalable AI Governance & EvaluationBenjamin Le, Xueying Lu, Nick Stern et al.
Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.
CVJan 14Code
STEP3-VL-10B Technical ReportAilin Huang, Chengyuan Yao, Chunrui Han et al.
We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.
36.6IRMay 22
A Unified Structured Query Understanding Framework for Industrial Semantic SearchPing Liu, Qianqi Shen, Jianqiang Shen et al.
Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn's Job Search system. Furthermore, we demonstrate the framework's horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.
AIFeb 9Code
GEBench: Benchmarking Image Generation Models as GUI EnvironmentsHaodong Li, Jingwei Wu, Quan Sun et al.
Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.
CLFeb 11
Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active ParametersAilin Huang, Ang Li, Aobo Kong et al.
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency. We focus on what matters most when building agents: sharp reasoning and fast, reliable execution. Step 3.5 Flash pairs a 196B-parameter foundation with 11B active parameters for efficient inference. It is optimized with interleaved 3:1 sliding-window/full attention and Multi-Token Prediction (MTP-3) to reduce the latency and cost of multi-round agentic interactions. To reach frontier-level intelligence, we design a scalable reinforcement learning framework that combines verifiable signals with preference feedback, while remaining stable under large-scale off-policy training, enabling consistent self-improvement across mathematics, code, and tool use. Step 3.5 Flash demonstrates strong performance across agent, coding, and math tasks, achieving 85.4% on IMO-AnswerBench, 86.4% on LiveCodeBench-v6 (2024.08-2025.05), 88.2% on tau2-Bench, 69.0% on BrowseComp (with context management), and 51.0% on Terminal-Bench 2.0, comparable to frontier models such as GPT-5.2 xHigh and Gemini 3.0 Pro. By redefining the efficiency frontier, Step 3.5 Flash provides a high-density foundation for deploying sophisticated agents in real-world industrial environments.
CVApr 24, 2025Code
Step1X-Edit: A Practical Framework for General Image EditingShiyu Liu, Yucheng Han, Peng Xing et al. · tsinghua
In recent years, image editing models have witnessed remarkable and rapid development. The recent unveiling of cutting-edge multimodal models such as GPT-4o and Gemini2 Flash has introduced highly promising image editing capabilities. These models demonstrate an impressive aptitude for fulfilling a vast majority of user-driven editing requirements, marking a significant advancement in the field of image manipulation. However, there is still a large gap between the open-source algorithm with these closed-source models. Thus, in this paper, we aim to release a state-of-the-art image editing model, called Step1X-Edit, which can provide comparable performance against the closed-source models like GPT-4o and Gemini2 Flash. More specifically, we adopt the Multimodal LLM to process the reference image and the user's editing instruction. A latent embedding has been extracted and integrated with a diffusion image decoder to obtain the target image. To train the model, we build a data generation pipeline to produce a high-quality dataset. For evaluation, we develop the GEdit-Bench, a novel benchmark rooted in real-world user instructions. Experimental results on GEdit-Bench demonstrate that Step1X-Edit outperforms existing open-source baselines by a substantial margin and approaches the performance of leading proprietary models, thereby making significant contributions to the field of image editing.
CVAug 14, 2025Code
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at ScaleNextStep Team, Chunrui Han, Guopeng Li et al. · tsinghua
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.
CVMay 30, 2025Code
ViStoryBench: Comprehensive Benchmark Suite for Story VisualizationCailin Zhuang, Ailin Huang, Wei Cheng et al.
Story visualization aims to generate coherent image sequences that faithfully depict a narrative and align with character references. Despite progress in generative models, existing benchmarks are narrow in scope, often limited to short prompts, no character reference, or single-image cases, and fall short of real-world storytelling complexity. This hinders a nuanced understanding of model capabilities and limitations. We present ViStoryBench, a comprehensive benchmark designed to evaluate story visualization models across diverse narrative structures, visual styles, and character settings. The benchmark features richly annotated multi-shot scripts derived from curated stories spanning literature, film, and folklore. Large language models assist in story summarization and script generation, with all outputs verified by humans to ensure coherence and fidelity. Character references are carefully curated to maintain intra-story consistency across varying artistic styles. To enable thorough evaluation, ViStoryBench introduces a set of automated metrics that assess character consistency, style similarity, prompt adherence, aesthetic quality, and generation artifacts such as copy-paste behavior. These metrics are validated through human studies, and used to benchmark a broad range of open-source and commercial models. ViStoryBench offers a high-fidelity, multi-dimensional evaluation suite that facilitates systematic analysis and fosters future progress in visual storytelling.
12.9IRMay 15
Policy-Grounded Dynamic Facet Suggestions for Job SearchDan Xu, Baofen Zheng, Qianqi Shen et al.
Job seekers often initiate search with short, underspecified queries. At LinkedIn, over 80% of job-related queries contain three or fewer keywords, making accurate user intent inference and relevant job retrieval particularly challenging. We present dynamic facet suggestion (DFS), an interactive query refinement mechanism that facilitates intent disambiguation by surfacing personalized semantic attributes conditioned on the joint user-query context in real time. We propose a policy-grounded, retrieval-augmented ranking framework for facet suggestion, comprising offline taxonomy curation, embedding-based retrieval of top-K candidates, and distilled small language model (SLM) based candidate scoring. The system is optimized for real-time serving via pointwise single-token scoring with batching and prefix caching. Offline evaluation demonstrates high precision for generated suggestions, and online A/B tests show significant improvements in suggestion engagement and job search outcomes.
CVMay 25, 2025Code
Advancing Video Self-Supervised Learning via Image Foundation ModelsJingwei Wu, Zhewei Huang, Chang Liu
In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by $3.4\times$ and GPU memory usage by $8.2\times$. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at https://github.com/JingwWu/advise-video-ssl.
CVDec 4, 2024
ARCON: Advancing Auto-Regressive Continuation for Driving VideosRuibo Ming, Jingwei Wu, Zhewei Huang et al.
Recent advancements in auto-regressive large language models (LLMs) have led to their application in video generation. This paper explores the use of Large Vision Models (LVMs) for video continuation, a task essential for building world models and predicting future frames. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance visual quality. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.
24.3CLApr 2
Grounded Token Initialization for New Vocabulary in LMs for Generative RecommendationDaiwei Chen, Zhoutong Fu, Chengming Jiang et al.
Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
IRNov 25, 2025
MixLM: High-Throughput and Effective LLM Ranking via Text-Embedding Mix-InteractionGuoyao Li, Ran He, Shusen Jing et al.
Large language models (LLMs) excel at capturing semantic nuances and therefore show impressive relevance ranking performance in modern recommendation and search systems. However, they suffer from high computational overhead under industrial latency and throughput requirements. In particular, cross-encoder ranking systems often create long context prefill-heavy workloads, as the model has to be presented with the user, query and item information. To this end, we propose MixLM, a novel LLM-based ranking framework, which significantly improves the system throughput via reducing the input context length, while preserving the semantic strength of cross-encoder rankers. In contrast to a standard ranking system where the context is presented to the model as pure text, we propose to use mix-interaction, a mixture of text and embedding tokens to represent the input. Specifically, MixLM encodes all items in the catalog into a few embedding tokens and stores in a nearline cache. The encoded item descriptions are used during online inference, effectively reducing the item length from a few thousand text tokens to a few embedding tokens. We share insights from deploying our MixLM framework to a real-world search application at LinkedIn, including a detailed discussion of our training pipelines, as well as a thorough analysis of our online serving infrastructure optimization. With the same latency budget and on-par relevance metrics, MixLM increased throughput by 10.0x comparing with strong baselines, 75.9x over full-text LLM rerankers. The efficiency gains delivered by MixLM enabled full-traffic deployment of LLM-powered search, which resulted in a significant 0.47\% increase in Daily Active Users (DAU) in online A/B tests.
IRAug 19, 2025
Powering Job Search at Scale: LLM-Enhanced Query Understanding in Job Matching SystemsPing Liu, Jianqiang Shen, Qianqi Shen et al.
Query understanding is essential in modern relevance systems, where user queries are often short, ambiguous, and highly context-dependent. Traditional approaches often rely on multiple task-specific Named Entity Recognition models to extract structured facets as seen in job search applications. However, this fragmented architecture is brittle, expensive to maintain, and slow to adapt to evolving taxonomies and language patterns. In this paper, we introduce a unified query understanding framework powered by a Large Language Model (LLM), designed to address these limitations. Our approach jointly models the user query and contextual signals such as profile attributes to generate structured interpretations that drive more accurate and personalized recommendations. The framework improves relevance quality in online A/B testing while significantly reducing system complexity and operational overhead. The results demonstrate that our solution provides a scalable and adaptable foundation for query understanding in dynamic web applications.
CVJan 26, 2024
A Survey on Future Frame Synthesis: Bridging Deterministic and Generative ApproachesRuibo Ming, Zhewei Huang, Jingwei Wu et al.
Future Frame Synthesis (FFS), the task of generating subsequent video frames from context, represents a core challenge in machine intelligence and a cornerstone for developing predictive world models. This survey provides a comprehensive analysis of the FFS landscape, charting its critical evolution from deterministic algorithms focused on pixel-level accuracy to modern generative paradigms that prioritize semantic coherence and dynamic plausibility. We introduce a novel taxonomy organized by algorithmic stochasticity, which not only categorizes existing methods but also reveals the fundamental drivers--advances in architectures, datasets, and computational scale--behind this paradigm shift. Critically, our analysis identifies a bifurcation in the field's trajectory: one path toward efficient, real-time prediction, and another toward large-scale, generative world simulation. By pinpointing key challenges and proposing concrete research questions for both frontiers, this survey serves as an essential guide for researchers aiming to advance the frontiers of visual dynamic modeling.