Junzhe Li

CV
h-index12
13papers
254citations
Novelty61%
AI Score61

13 Papers

CVMay 10Code
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin, Fei Shen, Junzhe Li et al.

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.

LGMay 22
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Jade Zou, Tao Huang, Weijie Kong et al.

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

LGJan 30Code
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents

Xiaoce Wang, Guibin Zhang, Junzhe Li et al.

Existing GUI agent models relying on coordinate-based one-step visual grounding struggle with generalizing to varying input resolutions and aspect ratios. Alternatives introduce coordinate-free strategies yet suffer from learning under severe data scarcity. To address the limitations, we propose ToolTok, a novel paradigm of multi-step pathfinding for GUI agents, where operations are modeled as a sequence of progressive tool usage. Specifically, we devise tools aligned with human interaction habits and represent each tool using learnable token embeddings. To enable efficient embedding learning under limited supervision, ToolTok introduces a semantic anchoring mechanism that grounds each tool with semantically related concepts as natural inductive bias. To further enable a pre-trained large language model to progressively acquire tool semantics, we construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding. Extensive experiments on multiple benchmarks show that ToolTok achieves superior performance among models of comparable scale (4B) and remains competitive with a substantially larger model (235B). Notably, these results are obtained using less than 1% of the training data required by other post-training approaches. In addition, ToolTok demonstrates strong generalization across unseen scenarios. Our training & inference code is open-source at https://github.com/ZephinueCode/ToolTok.

AIJul 29, 2025Code
MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

Junzhe Li, Yutao Cui, Tao Huang et al.

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%. Codes and models are available at $\href{https://github.com/Tencent-Hunyuan/MixGRPO}{MixGRPO}$.

CVMar 16
HYDRA: Unifying Multi-modal Generation and Understanding via Representation-Harmonized Tokenization

Xuerui Qiu, Yutao Cui, Guozhen Zhang et al.

Unified Multimodal Models struggle to bridge the fundamental gap between the abstract representations needed for visual understanding and the detailed primitives required for generation. Existing approaches typically compromise by employing decoupled encoders, stacking representation encoder atop VAEs, or utilizing discrete quantization. However, these methods often disrupt information coherence and lead to optimization conflicts. To this end, we introduce HYDRA-TOK, a representation-harmonized pure ViT in the insight that visual modeling should evolve from generation to understanding. HYDRA-TOK reformulates the standard backbone into a progressive learner that transitions from a Gen-ViT, which captures structure-preserving primitives, to a Sem-ViT for semantic encoding. Crucially, this transition is mediated by a Generation-Semantic Bottleneck (GSB), which compresses features into a low-dimensional space to filter noise for robust synthesis, then restores dimensionality to empower complex semantic comprehension. Built upon this foundation, we present HYDRA, a native unified framework integrating perception and generation within a single parameter space. Extensive experiments establish HYDRA as a new state-of-the-art. It sets a benchmark in visual reconstruction (rFID 0.08) and achieves top-tier generation performance on GenEval (0.86), DPG-Bench (86.4), and WISE (0.53), while simultaneously outperforming previous native UMMs by an average of 10.0 points across eight challenging understanding benchmarks.

SEMay 17
Rover: Context-aware Conflict Resolution with LLM

Qingyu Zhang, Junzhe Li, Jiayi Lin et al.

Code merging is a significant challenge, particularly in large-scale projects. Existing solutions, including program analysis and machine learning, show promise but face critical limitations. Program analysis lacks the ability to infer developers' intentions, relying on conservative strategies that offload unresolved conflicts for manual handling. Meanwhile, model-based approaches struggle with conflicts involving complex code dependencies due to insufficient contextual awareness. To address these gaps, we introduce Rover, a novel conflict resolution system that integrates program analysis with large language models (LLMs). To obtain context-aware prompts, we propose Multi-layer Code Property Graph (MtCPG), a new representation capturing inter-file dependencies and enabling contextual analysis for a given conflict. Using graph connectivity algorithms, Rover further clusters conflicting code and associated changes into meaningful "contexts" that guide the LLM in generating accurate resolutions. We compared Rover with standalone LLMs, machine learning baseline MergeGen, and suggestion provider tool WizardMerge with adjacent code as the contexts. Evaluation results show that Rover surpasses all of these approaches in terms of conflict resolution, achieving higher similarity to ground-truth resolutions at character, lexical, and semantic levels.

CVSep 28, 2025Code
HunyuanImage 3.0 Technical Report

Siyu Cao, Hangting Chen, Peng Chen et al.

We present HunyuanImage 3.0, a native multimodal model that unifies multimodal understanding and generation within an autoregressive framework, with its image generation module publicly available. The achievement of HunyuanImage 3.0 relies on several key components, including meticulous data curation, advanced architecture design, a native Chain-of-Thoughts schema, progressive model pre-training, aggressive model post-training, and an efficient infrastructure that enables large-scale training and inference. With these advancements, we successfully trained a Mixture-of-Experts (MoE) model comprising over 80 billion parameters in total, with 13 billion parameters activated per token during inference, making it the largest and most powerful open-source image generative model to date. We conducted extensive experiments and the results of automatic and human evaluation of text-image alignment and visual quality demonstrate that HunyuanImage 3.0 rivals previous state-of-the-art models. By releasing the code and weights of HunyuanImage 3.0, we aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant multimodal ecosystem. All open source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanImage-3.0

LGFeb 11
dnaHNet: A Scalable and Hierarchical Foundation Model for Genomic Sequence Learning

Arnav Shah, Junzhe Li, Parsa Idehpour et al.

Genomic foundation models have the potential to decode DNA syntax, yet face a fundamental tradeoff in their input representation. Standard fixed-vocabulary tokenizers fragment biologically meaningful motifs such as codons and regulatory elements, while nucleotide-level models preserve biological coherence but incur prohibitive computational costs for long contexts. We introduce dnaHNet, a state-of-the-art tokenizer-free autoregressive model that segments and models genomic sequences end-to-end. Using a differentiable dynamic chunking mechanism, dnaHNet compresses raw nucleotides into latent tokens adaptively, balancing compression with predictive accuracy. Pretrained on prokaryotic genomes, dnaHNet outperforms leading architectures including StripedHyena2 in scaling and efficiency. This recursive chunking yields quadratic FLOP reductions, enabling $>3 \times$ inference speedup over Transformers. On zero-shot tasks, dnaHNet achieves superior performance in predicting protein variant fitness and gene essentiality, while automatically discovering hierarchical biological structures without supervision. These results establish dnaHNet as a scalable, interpretable framework for next-generation genomic modeling.

AIJan 11, 2024
Chain of History: Learning and Forecasting with LLMs for Temporal Knowledge Graph Completion

Ruilin Luo, Tianle Gu, Haoling Li et al.

Temporal Knowledge Graph Completion (TKGC) is a complex task involving the prediction of missing event links at future timestamps by leveraging established temporal structural knowledge. This paper aims to provide a comprehensive perspective on harnessing the advantages of Large Language Models (LLMs) for reasoning in temporal knowledge graphs, presenting an easily transferable pipeline. In terms of graph modality, we underscore the LLMs' prowess in discerning the structural information of pivotal nodes within the historical chain. As for the generation mode of the LLMs utilized for inference, we conduct an exhaustive exploration into the variances induced by a range of inherent factors in LLMs, with particular attention to the challenges in comprehending reverse logic. We adopt a parameter-efficient fine-tuning strategy to harmonize the LLMs with the task requirements, facilitating the learning of the key knowledge highlighted earlier. Comprehensive experiments are undertaken on several widely recognized datasets, revealing that our framework exceeds or parallels existing methods across numerous popular metrics. Additionally, we execute a substantial range of ablation experiments and draw comparisons with several advanced commercial LLMs, to investigate the crucial factors influencing LLMs' performance in structured temporal knowledge inference tasks.

CVMar 11, 2025
UniF$^2$ace: A Unified Fine-grained Face Understanding and Generation Model

Junzhe Li, Sifan Zhou, Liya Guo et al.

Unified multimodal models (UMMs) have emerged as a powerful paradigm in fundamental cross-modality research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily faces two challenges: $\textbf{(1)}$ $\textbf{fragmentation development}$, with existing methods failing to unify understanding and generation into a single one, hindering the way to artificial general intelligence. $\textbf{(2) lack of fine-grained facial attributes}$, which are crucial for high-fidelity applications. To handle those issues, we propose $\textbf{UniF$^2$ace}$, $\textit{the first UMM specifically tailored for fine-grained face understanding and generation}$. $\textbf{First}$, we introduce a novel theoretical framework with a Dual Discrete Diffusion (D3Diff) loss, unifying masked generative models with discrete score matching diffusion and leading to a more precise approximation of the negative log-likelihood. Moreover, this D3Diff significantly enhances the model's ability to synthesize high-fidelity facial details aligned with text input. $\textbf{Second}$, we propose a multi-level grouped Mixture-of-Experts architecture, adaptively incorporating the semantic and identity facial embeddings to complement the attribute forgotten phenomenon in representation evolvement. $\textbf{Finally}$, to this end, we construct UniF$^2$aceD-1M, a large-scale dataset comprising 130K fine-grained image-caption pairs and 1M visual question-answering pairs, spanning a much wider range of facial attributes than existing datasets. Extensive experiments demonstrate that UniF$^2$ace outperforms existing models with a similar scale in both understanding and generation tasks, with 7.1\% higher Desc-GPT and 6.6\% higher VQA-score, respectively.

CRSep 21, 2025
R1-Fuzz: Specializing Language Models for Textual Fuzzing via Reinforcement Learning

Jiayi Lin, Liangcai Su, Junzhe Li et al.

Fuzzing is effective for vulnerability discovery but struggles with complex targets such as compilers, interpreters, and database engines, which accept textual input that must satisfy intricate syntactic and semantic constraints. Although language models (LMs) have attracted interest for this task due to their vast latent knowledge and reasoning potential, their practical adoption has been limited. The major challenges stem from insufficient exploration of deep program logic among real-world codebases, and the high cost of leveraging larger models. To overcome these challenges, we propose R1-Fuzz, the first framework that leverages reinforcement learning (RL) to specialize cost-efficient LMs and integrate them for complex textual fuzzing input generation. R1-Fuzz introduces two key designs: coverage-slicing-based question construction and a distance-based reward calculation. Through RL-based post-training of a model with our constructed dataset, R1-Fuzz designs a fuzzing workflow that tightly integrates LMs to reason deep program semantics during fuzzing. Evaluations on diverse real-world targets show that our design enables a small model, named R1-Fuzz-7B, to rival or even outperform much larger models in real-world fuzzing. Notably, R1-Fuzz achieves up to 75\% higher coverage than state-of-the-art fuzzers and discovers 29 previously unknown vulnerabilities, demonstrating its practicality.

LGMar 18, 2025
Potential Score Matching: Debiasing Molecular Structure Sampling with Potential Energy Guidance

Liya Guo, Zun Wang, Chang Liu et al.

The ensemble average of physical properties of molecules is closely related to the distribution of molecular conformations, and sampling such distributions is a fundamental challenge in physics and chemistry. Traditional methods like molecular dynamics (MD) simulations and Markov chain Monte Carlo (MCMC) sampling are commonly used but can be time-consuming and costly. Recently, diffusion models have emerged as efficient alternatives by learning the distribution of training data. Obtaining an unbiased target distribution is still an expensive task, primarily because it requires satisfying ergodicity. To tackle these challenges, we propose Potential Score Matching (PSM), an approach that utilizes the potential energy gradient to guide generative models. PSM does not require exact energy functions and can debias sample distributions even when trained on limited and biased data. Our method outperforms existing state-of-the-art (SOTA) models on the Lennard-Jones (LJ) potential, a commonly used toy model. Furthermore, we extend the evaluation of PSM to high-dimensional problems using the MD17 and MD22 datasets. The results demonstrate that molecular distributions generated by PSM more closely approximate the Boltzmann distribution compared to traditional diffusion models.

CVJun 23, 2024
LiveScene: Language Embedding Interactive Radiance Fields for Physical Scene Rendering and Control

Delin Qu, Qizhi Chen, Pingrui Zhang et al.

This paper scales object-level reconstruction to complex scenes, advancing interactive scene reconstruction. We introduce two datasets, OmniSim and InterReal, featuring 28 scenes with multiple interactive objects. To tackle the challenge of inaccurate interactive motion recovery in complex scenes, we propose LiveScene, a scene-level language-embedded interactive radiance field that efficiently reconstructs and controls multiple objects. By decomposing the interactive scene into local deformable fields, LiveScene enables separate reconstruction of individual object motions, reducing memory consumption. Additionally, our interaction-aware language embedding localizes individual interactive objects, allowing for arbitrary control using natural language. Our approach demonstrates significant superiority in novel view synthesis, interactive scene control, and language grounding performance through extensive experiments. Project page: https://livescenes.github.io.