Yuxing Liu

LG
h-index9
14papers
107citations
Novelty57%
AI Score59

14 Papers

99.4CVMar 20Code
PEARL: Personalized Streaming Video Understanding Model

Yuanhong Zheng, Ruichuan An, Xiaopeng Lin et al.

Human cognition of new concepts is inherently a streaming process: we continuously recognize new objects or identities and update our memories over time. However, current multimodal personalization methods are largely limited to static images or offline videos. This disconnects continuous visual input from instant real-world feedback, limiting their ability to provide the real-time, interactive personalized responses essential for future AI assistants. To bridge this gap, we first propose and formally define the novel task of Personalized Streaming Video Understanding (PSVU). To facilitate research in this new direction, we introduce PEARL-Bench, the first comprehensive benchmark designed specifically to evaluate this challenging setting. It evaluates a model's ability to respond to personalized concepts at exact timestamps under two modes: (1) Frame-level, focusing on a specific person or object in discrete frames, and (2) a novel Video-level, focusing on personalized actions unfolding across continuous frames. PEARL-Bench comprises 132 unique videos and 2,173 fine-grained annotations with precise timestamps. Concept diversity and annotation quality are strictly ensured through a combined pipeline of automated generation and human verification. To tackle this challenging new setting, we further propose PEARL, a plug-and-play, training-free strategy that serves as a strong baseline. Extensive evaluations across 8 offline and online models demonstrate that PEARL achieves state-of-the-art performance. Notably, it brings consistent PSVU improvements when applied to 3 distinct architectures, proving to be a highly effective and robust strategy. We hope this work advances vision-language model (VLM) personalization and inspires further research into streaming personalized AI assistants. Code is available at https://github.com/Yuanhong-Zheng/PEARL.

OCOct 25, 2022
On the Complexity of Decentralized Smooth Nonconvex Finite-Sum Optimization

Luo Luo, Yunyan Bai, Lesi Chen et al.

We study the decentralized optimization problem $\min_{{\bf x}\in{\mathbb R}^d} f({\bf x})\triangleq \frac{1}{m}\sum_{i=1}^m f_i({\bf x})$, where the local function on the $i$-th agent has the form of $f_i({\bf x})\triangleq \frac{1}{n}\sum_{j=1}^n f_{i,j}({\bf x})$ and every individual $f_{i,j}$ is smooth but possibly nonconvex. We propose a stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (DEAREST) method, which achieves an $ε$-stationary point at each agent with the communication rounds of $\tilde{\mathcal O}(Lε^{-2}/\sqrtγ\,)$, the computation rounds of $\tilde{\mathcal O}(n+(L+\min\{nL, \sqrt{n/m}\bar L\})ε^{-2})$, and the local incremental first-oracle calls of ${\mathcal O}(mn + {\min\{mnL, \sqrt{mn}\bar L\}}{ε^{-2}})$, where $L$ is the smoothness parameter of the objective function, $\bar L$ is the mean-squared smoothness parameter of all individual functions, and $γ$ is the spectral gap of the mixing matrix associated with the network. We then establish the lower bounds to show that the proposed method is near-optimal. Notice that the smoothness parameters $L$ and $\bar L$ used in our algorithm design and analysis are global, leading to sharper complexity bounds than existing results that depend on the local smoothness. We further extend DEAREST to solve the decentralized finite-sum optimization problem under the Polyak-Łojasiewicz condition, also achieving the near-optimal complexity bounds.

92.4LGApr 16
StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models

Dingzhi Yu, Rui Pan, Yuxing Liu et al.

Sign-based optimization algorithms, such as SignSGD, have garnered significant attention for their remarkable performance in distributed learning and training large foundation models. Despite their empirical superiority, SignSGD is known to diverge on non-smooth objectives, which are ubiquitous in modern machine learning due to ReLUs, max-pools, and mixture-of-experts. To overcome this fundamental limitation, we propose \textbf{StoSignSGD}, an algorithm that injects structural stochasticity into the sign operator while maintaining an unbiased update step. In the regime of (online) convex optimization, our theoretical analysis shows that StoSignSGD rigorously resolves the non-convergence issues of SignSGD, achieving a sharp convergence rate matching the lower bound. For the more challenging non-convex non-smooth optimization, we introduce generalized stationary measures that encompass prior definitions, proving that StoSignSGD improves upon the best-known complexity bounds by dimensional factors. Empirically, StoSignSGD exhibits robust stability and superior efficiency across diverse large language model (LLM) training regimes. Notably, in low-precision FP8 pretraining -- a setting where AdamW fails catastrophically -- StoSignSGD remains highly stable and yields a remarkable 1.44$\times$ to 2.14$\times$ speedup relative to established baselines. Furthermore, when fine-tuning 7B LLMs on mathematical reasoning tasks, StoSignSGD delivers substantial performance gains over both AdamW and SignSGD. Finally, to dissect the mechanisms driving its success, we develop a sign conversion framework capable of transforming any general optimizer into its unbiased, sign-based counterpart. Utilizing this framework, we deconstruct the core components of StoSignSGD and present a comprehensive ablation study to empirically validate our algorithmic design choices.

LGMar 26, 2025Code
ASGO: Adaptive Structured Gradient Optimization

Kang An, Yuxing Liu, Rui Pan et al.

Training deep neural networks is a structured optimization problem, because the parameters are naturally represented by matrices and tensors rather than by vectors. Under this structural representation, it has been widely observed that gradients are low-rank and Hessians are approximately block diagonal. These structured properties are crucial for designing efficient optimization algorithms, but are not utilized by many current popular optimizers like Adam. In this paper, we present a novel optimization algorithm ASGO that capitalizes on these properties by employing a preconditioner that is adaptively updated using structured gradients. By a fine-grained theoretical analysis, ASGO is proven to achieve superior convergence rates compared to existing structured gradient methods. Based on this convergence theory, we further demonstrate that ASGO can benefit from low-rank gradients and block diagonal Hessians. We also discuss practical modifications of ASGO and empirically verify ASGO's effectiveness on language model tasks. Code is available at https://github.com/infinity-stars/ASGO.

AIDec 30, 2025
Thinking on Maps: How Foundation Model Agents Explore, Remember, and Reason Map Environments

Zhiwei Wei, Yuxing Liu, Hua Liao et al.

Map environments provide a fundamental medium for representing spatial structure. Understanding how foundation model (FM) agents understand and act in such environments is therefore critical for enabling reliable map-based reasoning and applications. However, most existing evaluations of spatial ability in FMs rely on static map inputs or text-based queries, overlooking the interactive and experience-driven nature of spatial understanding.In this paper, we propose an interactive evaluation framework to analyze how FM agents explore, remember, and reason in symbolic map environments. Agents incrementally explore partially observable grid-based maps consisting of roads, intersections, and points of interest (POIs), receiving only local observations at each step. Spatial understanding is then evaluated using six kinds of spatial tasks. By systematically varying exploration strategies, memory representations, and reasoning schemes across multiple foundation models, we reveal distinct functional roles of these components. Exploration primarily affects experience acquisition but has a limited impact on final reasoning accuracy. In contrast, memory representation plays a central role in consolidating spatial experience, with structured memories particularly sequential and graph-based representations, substantially improving performance on structure-intensive tasks such as path planning. Reasoning schemes further shape how stored spatial knowledge is used, with advanced prompts supporting more effective multi-step inference. We further observe that spatial reasoning performance saturates across model versions and scales beyond a certain capability threshold, indicating that improvements in map-based spatial understanding require mechanisms tailored to spatial representation and reasoning rather than scaling alone.

LGJul 9, 2025Code
GUIDE: Towards Scalable Advising for Research Ideas

Yaowenqi Liu, Bingxu Meng, Rui Pan et al.

The field of AI research is advancing at an unprecedented pace, enabling automated hypothesis generation and experimental design across diverse domains such as biology, mathematics, and artificial intelligence. Despite these advancements, there remains a significant gap in the availability of scalable advising systems capable of providing high-quality, well-reasoned feedback to refine proposed hypotheses and experimental designs. To address this challenge, we explore key factors that underlie the development of robust advising systems, including model size, context length, confidence estimation, and structured reasoning processes. Our findings reveal that a relatively small model, when equipped with a well-compressed literature database and a structured reasoning framework, can outperform powerful general-purpose language models such as Deepseek-R1 in terms of acceptance rates for self-ranked top-30% submissions to ICLR 2025. Moreover, when limited to high-confidence predictions, our system achieves an acceptance rate exceeding 90% on the ICLR 2025 test set, underscoring its potential to significantly enhance the quality and efficiency of hypothesis generation and experimental design. The code is released at https://github.com/HowardLiu0830/GUIDE-Research-Idea-Evaluation.

61.8LGMay 7
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Yuxing Liu, Jianyu Wang, Tong Zhang

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

LGDec 22, 2023
Accelerated Convergence of Stochastic Heavy Ball Method under Anisotropic Gradient Noise

Rui Pan, Yuxing Liu, Xiaoyu Wang et al.

Heavy-ball momentum with decaying learning rates is widely used with SGD for optimizing deep learning models. In contrast to its empirical popularity, the understanding of its theoretical property is still quite limited, especially under the standard anisotropic gradient noise condition for quadratic regression problems. Although it is widely conjectured that heavy-ball momentum method can provide accelerated convergence and should work well in large batch settings, there is no rigorous theoretical analysis. In this paper, we fill this theoretical gap by establishing a non-asymptotic convergence bound for stochastic heavy-ball methods with step decay scheduler on quadratic objectives, under the anisotropic gradient noise condition. As a direct implication, we show that heavy-ball momentum can provide $\tilde{\mathcal{O}}(\sqrtκ)$ accelerated convergence of the bias term of SGD while still achieving near-optimal convergence rate with respect to the stochastic variance term. The combined effect implies an overall convergence rate within log factors from the statistical minimax rate. This means SGD with heavy-ball momentum is useful in the large-batch settings such as distributed machine learning or federated learning, where a smaller number of iterations can significantly reduce the number of communication rounds, leading to acceleration in practice.

LGSep 9, 2025
Theoretical Analysis on how Learning Rate Warmup Accelerates Convergence

Yuxing Liu, Yuze Ge, Rui Pan et al.

Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most $Θ(T)$ times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.

CVJul 2, 2025
OoDDINO:A Multi-level Framework for Anomaly Segmentation on Complex Road Scenes

Yuxing Liu, Ji Zhang, Zhou Xuchuan et al.

Anomaly segmentation aims to identify Out-of-Distribution (OoD) anomalous objects within images. Existing pixel-wise methods typically assign anomaly scores individually and employ a global thresholding strategy to segment anomalies. Despite their effectiveness, these approaches encounter significant challenges in real-world applications: (1) neglecting spatial correlations among pixels within the same object, resulting in fragmented segmentation; (2) variabil ity in anomaly score distributions across image regions, causing global thresholds to either generate false positives in background areas or miss segments of anomalous objects. In this work, we introduce OoDDINO, a novel multi-level anomaly segmentation framework designed to address these limitations through a coarse-to-fine anomaly detection strategy. OoDDINO combines an uncertainty-guided anomaly detection model with a pixel-level segmentation model within a two-stage cascade architecture. Initially, we propose an Orthogonal Uncertainty-Aware Fusion Strategy (OUAFS) that sequentially integrates multiple uncertainty metrics with visual representations, employing orthogonal constraints to strengthen the detection model's capacity for localizing anomalous regions accurately. Subsequently, we develop an Adaptive Dual-Threshold Network (ADT-Net), which dynamically generates region-specific thresholds based on object-level detection outputs and pixel-wise anomaly scores. This approach allows for distinct thresholding strategies within foreground and background areas, achieving fine-grained anomaly segmentation. The proposed framework is compatible with other pixel-wise anomaly detection models, which acts as a plug-in to boost the performance. Extensive experiments on two benchmark datasets validate our framework's superiority and compatibility over state-of-the-art methods.

OCFeb 4, 2024
On the Complexity of Finite-Sum Smooth Optimization under the Polyak-Łojasiewicz Condition

Yunyan Bai, Yuxing Liu, Luo Luo

This paper considers the optimization problem of the form $\min_{{\bf x}\in{\mathbb R}^d} f({\bf x})\triangleq \frac{1}{n}\sum_{i=1}^n f_i({\bf x})$, where $f(\cdot)$ satisfies the Polyak--Łojasiewicz (PL) condition with parameter $μ$ and $\{f_i(\cdot)\}_{i=1}^n$ is $L$-mean-squared smooth. We show that any gradient method requires at least $Ω(n+κ\sqrt{n}\log(1/ε))$ incremental first-order oracle (IFO) calls to find an $ε$-suboptimal solution, where $κ\triangleq L/μ$ is the condition number of the problem. This result nearly matches upper bounds of IFO complexity for best-known first-order methods. We also study the problem of minimizing the PL function in the distributed setting such that the individuals $f_1(\cdot),\dots,f_n(\cdot)$ are located on a connected network of $n$ agents. We provide lower bounds of $Ω(κ/\sqrtγ\,\log(1/ε))$, $Ω((κ+τκ/\sqrtγ\,)\log(1/ε))$ and $Ω\big(n+κ\sqrt{n}\log(1/ε)\big)$ for communication rounds, time cost and local first-order oracle calls respectively, where $γ\in(0,1]$ is the spectral gap of the mixing matrix associated with the network and~$τ>0$ is the time cost of per communication round. Furthermore, we propose a decentralized first-order method that nearly matches above lower bounds in expectation.

CVDec 2, 2025
ClimaOoD: Improving Anomaly Segmentation via Physically Realistic Synthetic Data

Yuxing Liu, Yong Liu

Anomaly segmentation seeks to detect and localize unknown or out-of-distribution (OoD) objects that fall outside predefined semantic classes a capability essential for safe autonomous driving. However, the scarcity and limited diversity of anomaly data severely constrain model generalization in open-world environments. Existing approaches mitigate this issue through synthetic data generation, either by copy-pasting external objects into driving scenes or by leveraging text-to-image diffusion models to inpaint anomalous regions. While these methods improve anomaly diversity, they often lack contextual coherence and physical realism, resulting in domain gaps between synthetic and real data. In this paper, we present ClimaDrive, a semantics-guided image-to-image framework for synthesizing semantically coherent, weather-diverse, and physically plausible OoD driving data. ClimaDrive unifies structure-guided multi-weather generation with prompt-driven anomaly inpainting, enabling the creation of visually realistic training data. Based on this framework, we construct ClimaOoD, a large-scale benchmark spanning six representative driving scenarios under both clear and adverse weather conditions. Extensive experiments on four state-of-the-art methods show that training with ClimaOoD leads to robust improvements in anomaly segmentation. Across all methods, AUROC, AP, and FPR95 show notable gains, with FPR95 dropping from 3.97 to 3.52 for RbA on Fishyscapes LAF. These results demonstrate that ClimaOoD enhances model robustness, offering valuable training data for better generalization in open-world anomaly detection.

LGOct 20, 2025
Unbiased Gradient Low-Rank Projection

Rui Pan, Yang Luo, Yuxing Liu et al.

Memory-efficient optimization is critical for training increasingly large language models (LLMs). A popular strategy involves gradient low-rank projection, storing only the projected optimizer states, with GaLore being a representative example. However, a significant drawback of many such methods is their lack of convergence guarantees, as various low-rank projection approaches introduce inherent biases relative to the original optimization algorithms, which contribute to performance gaps compared to full-parameter training. Aiming to tackle this problem, this paper investigates the layerwise sampling technique for debiasing low-rank projection mechanisms. In particular, an instantiation of the paradigm gives rise to a novel and unbiased low-rank optimization method built upon GaLore's mechanism and the Muon algorithm, named GaLore Unbiased with Muon (GUM). We theoretically prove our method matches the convergence guarantees of the base Muon algorithm while preserving the memory efficiency of low-rank techniques. Empirical experiments on LLM fine-tuning and pretraining also demonstrate non-trivial improvements over GaLore and even better performance than full-parameter training. Further investigation shows that the improvement of this technique comes from a more uniform distribution of knowledge inside layers, leading to more efficient utilization of the model parameter space and better memorization.

LGJun 21, 2024
AdaGrad under Anisotropic Smoothness

Yuxing Liu, Rui Pan, Tong Zhang

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with uniform step sizes across all coordinates (e.g. SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate this benefit was obtained in the original paper of Adagrad for convex nonsmooth objective functions, which is insufficient for large batch algorithms. In this work, we attempt to resolve this gap between theory and practice by proposing a novel anisotropic generalized smoothness assumption and providing corresponding analyses of Adagrad. It is shown that under anisotropic smoothness and noise conditions, AdaGrad can achieve faster convergence guarantees in terms of better dimensional dependence than algorithms with uniform step sizes across all coordinates. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our novel assumption and theoretical analysis.