Kaiyue Zhou

CV
h-index28
7papers
156citations
Novelty53%
AI Score55

7 Papers

LGFeb 17Code
GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team, Aohan Zeng, Xin Lv et al. · tsinghua

We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.

CVApr 15
PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Xuan Wang, Kai Ruan, Jiayi Han et al.

Audio-driven bimanual piano motion generation requires precise modeling of complex musical structures and dynamic cross-hand coordination. However, existing methods often rely on acoustic-only representations lacking symbolic priors, employ inflexible interaction mechanisms, and are limited to computationally expensive short-sequence generation. To address these limitations, we propose PianoFlow, a flow-matching framework for precise and coordinated bimanual piano motion synthesis. Our approach strategically leverages MIDI as a privileged modality during training, distilling these structured musical priors to achieve deep semantic understanding while maintaining audio-only inference. Furthermore, we introduce an asymmetric role-gated interaction module to explicitly capture dynamic cross-hand coordination through role-aware attention and temporal gating. To enable real-time streaming generation for arbitrarily long sequences, we design an autoregressive flow continuation scheme that ensures seamless cross-chunk temporal coherence. Extensive experiments on the PianoMotion10M dataset demonstrate that PianoFlow achieves superior quantitative and qualitative performance, while accelerating inference by over 9\times compared to previous methods.

CVFeb 12
DynaHOI: Benchmarking Hand-Object Interaction for Dynamic Target

BoCheng Hu, Zhonghan Zhao, Kaiyue Zhou et al.

Most existing hand motion generation benchmarks for hand-object interaction (HOI) focus on static objects, leaving dynamic scenarios with moving targets and time-critical coordination largely untested. To address this gap, we introduce the DynaHOI-Gym, a unified online closed-loop platform with parameterized motion generators and rollout-based metrics for dynamic capture evaluation. Built on DynaHOI-Gym, we release DynaHOI-10M, a large-scale benchmark with 10M frames and 180K hand capture trajectories, whose target motions are organized into 8 major categories and 22 fine-grained subcategories. We also provide a simple observe-before-act baseline (ObAct) that integrates short-term observations with the current frame via spatiotemporal attention to predict actions, achieving an 8.1% improvement in location success rate.

CLJan 19
Multimodal Multi-Agent Empowered Legal Judgment Prediction

Zhaolu Kang, Junhao Gong, Qingxi Chen et al.

Legal Judgment Prediction (LJP) aims to predict the outcomes of legal cases based on factual descriptions, serving as a fundamental task to advance the development of legal systems. Traditional methods often rely on statistical analyses or role-based simulations but face challenges with multiple allegations, diverse evidence, and lack adaptability. In this paper, we introduce JurisMMA, a novel framework for LJP that effectively decomposes trial tasks, standardizes processes, and organizes them into distinct stages. Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation. Experiments on JurisMM and the benchmark LawBench validate our framework's effectiveness. These results indicate that our framework is effective not only for LJP but also for a broader range of legal applications, offering new perspectives for the development of future legal methods and datasets.

CVJul 25, 2025
Preserving Topological and Geometric Embeddings for Point Cloud Recovery

Kaiyue Zhou, Zelong Tan, Hongxiao Wang et al.

Recovering point clouds involves the sequential process of sampling and restoration, yet existing methods struggle to effectively leverage both topological and geometric attributes. To address this, we propose an end-to-end architecture named \textbf{TopGeoFormer}, which maintains these critical properties throughout the sampling and restoration phases. First, we revisit traditional feature extraction techniques to yield topological embedding using a continuous mapping of relative relationships between neighboring points, and integrate it in both phases for preserving the structure of the original space. Second, we propose the \textbf{InterTwining Attention} to fully merge topological and geometric embeddings, which queries shape with local awareness in both phases to form a learnable 3D shape context facilitated with point-wise, point-shape-wise, and intra-shape features. Third, we introduce a full geometry loss and a topological constraint loss to optimize the embeddings in both Euclidean and topological spaces. The geometry loss uses inconsistent matching between coarse-to-fine generations and targets for reconstructing better geometric details, and the constraint loss limits embedding variances for better approximation of the topological space. In experiments, we comprehensively analyze the circumstances using the conventional and learning-based sampling/upsampling/recovery algorithms. The quantitative and qualitative results demonstrate that our method significantly outperforms existing sampling and recovery methods.

CVDec 28, 2023
Joint Learning for Scattered Point Cloud Understanding with Hierarchical Self-Distillation

Kaiyue Zhou, Ming Dong, Peiyuan Zhi et al.

Numerous point-cloud understanding techniques focus on whole entities and have succeeded in obtaining satisfactory results and limited sparsity tolerance. However, these methods are generally sensitive to incomplete point clouds that are scanned with flaws or large gaps. In this paper, we propose an end-to-end architecture that compensates for and identifies partial point clouds on the fly. First, we propose a cascaded solution that integrates both the upstream masked autoencoder (MAE) and downstream understanding networks simultaneously, allowing the task-oriented downstream to identify the points generated by the completion-oriented upstream. These two streams complement each other, resulting in improved performance for both completion and downstream-dependent tasks. Second, to explicitly understand the predicted points' pattern, we introduce hierarchical self-distillation (HSD), which can be applied to any hierarchy-based point cloud methods. HSD ensures that the deepest classifier with a larger perceptual field of local kernels and longer code length provides additional regularization to intermediate ones rather than simply aggregating the multi-scale features, and therefore maximizing the mutual information (MI) between a teacher and students. The proposed HSD strategy is particularly well-suited for tasks involving scattered point clouds, wherein a singular prediction may yield imprecise outcomes due to the inherently irregular and sparse nature of the geometric shape being reconstructed. We show the advantage of the self-distillation process in the hyperspaces based on the information bottleneck principle. Our method achieves state-of-the-art on both classification and part segmentation tasks.

CVJun 25, 2021
"Zero-Shot" Point Cloud Upsampling

Kaiyue Zhou, Ming Dong, Suzan Arslanturk

Recent supervised point cloud upsampling methods are restricted by the size of training data and are limited in terms of covering all object shapes. Besides the challenges faced due to data acquisition, the networks also struggle to generalize on unseen records. In this paper, we present an internal point cloud upsampling approach at a holistic level referred to as "Zero-Shot" Point Cloud Upsampling (ZSPU). Our approach is data agnostic and relies solely on the internal information provided by a particular point cloud without patching in both self-training and testing phases. This single-stream design significantly reduces the training time by learning the relation between low resolution (LR) point clouds and their high (original) resolution (HR) counterparts. This association will then provide super resolution (SR) outputs when original point clouds are loaded as input. ZSPU achieves competitive/superior quantitative and qualitative performances on benchmark datasets when compared with other upsampling methods.