Zhongxuan Li

h-index5

3papers

43citations

Novelty28%

AI Score35

Ranked #124,387 of 205,806 authors (top 60%)#22,022 in CL (top 68%)

3 Papers

69.7ROMay 27

World Models for Robotic Manipulation: A Survey

Fangyuan Wang, Ziyuan Wang, Guorui Pei et al.

Robotic manipulation depends on the ability to anticipate how actions reshape objects, contacts, and scene geometry before execution. Learned world models provide this capability by predicting task-relevant future evolution under robot intervention, yet the term now spans latent dynamics models, action-conditioned video generators, three- and four-dimensional scene predictors, physics-informed simulators, and predictive modules inside vision-language-action systems. This breadth has fragmented the literature and obscured the design choices that matter for manipulation. We survey world models for robotic manipulation through three questions: what future representation is predicted, how prediction is connected to action, and when prediction is used in the robot-learning pipeline. We operationally define a world model as an action-conditioned predictive system and distinguish it from perception modules, inverse models, policies, rewards, and value functions. We then organize existing work into five representation families, develop a functional taxonomy that separates integrated prediction-action models from explicit predictive planners, and characterize infrastructure roles including synthetic experience generation, candidate filtering, search-based evaluation, learned environments, and outcome verification. We further map these roles across pretraining, post-training, and inference adaptation, review 34 manipulation datasets, and synthesize evaluation protocols for predictive fidelity, task performance, and simulator reliability. This survey shows that world models are evolving from task-specific dynamics predictors into predictive infrastructure for robot learning, while exposing open challenges in contact modeling, hallucination control, action alignment, and benchmarking under closed-loop use.

CLApr 22, 2025

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song et al.

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

CVApr 27, 2025

Blind Source Separation Based on Sparsity

Zhongxuan Li

Blind source separation (BSS) is a key technique in array processing and data analysis, aiming to recover unknown sources from observed mixtures without knowledge of the mixing matrix. Classical independent component analysis (ICA) methods rely on the assumption that sources are mutually independent. To address limitations of ICA, sparsity-based methods have been introduced, which decompose source signals sparsely in a predefined dictionary. Morphological Component Analysis (MCA), based on sparse representation theory, assumes that a signal is a linear combination of components with distinct geometries, each sparsely representable in one dictionary and not in others. This approach has recently been applied to BSS with promising results. This report reviews key approaches derived from classical ICA and explores sparsity-based methods for BSS. It introduces the theory of sparse representation and decomposition, followed by a block coordinate relaxation MCA algorithm, whose variants are used in Multichannel MCA (MMCA) and Generalized MCA (GMCA). A local dictionary learning method using K-SVD is then presented. Finally, we propose an improved algorithm, SAC+BK-SVD, which enhances K-SVD by learning a block-sparsifying dictionary that clusters and updates similar atoms in blocks. The implementation includes experiments on image segmentation and blind image source separation using the discussed techniques. We also compare the proposed block-sparse dictionary learning algorithm with K-SVD. Simulation results demonstrate that our method yields improved blind image separation quality.