Zhiyong Xu

CV
h-index16
11papers
1,412citations
Novelty50%
AI Score55

11 Papers

CVSep 17, 2024Code
STCMOT: Spatio-Temporal Cohesion Learning for UAV-Based Multiple Object Tracking

Jianbo Ma, Chuanming Tang, Fei Wu et al.

Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at https://github.com/ydhcg-BoBo/STCMOT.

CVDec 3, 2024Code
HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang et al. · tencent-ai, tsinghua

Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.

IVMar 25, 2024Code
RSTAR4D: Rotational Streak Artifact Reduction in 4D CBCT using a Separable 4D CNN

Ziheng Deng, Hua Chen, Yongzheng Zhou et al.

Four-dimensional cone-beam computed tomography (4D CBCT) provides respiration-resolved images and can be used for image-guided radiation therapy. However, the ability to reveal respiratory motion comes at the cost of image artifacts. As raw projection data are sorted into multiple respiratory phases, the cone-beam projections become much sparser and the reconstructed 4D CBCT images will be covered by severe streak artifacts. Although several deep learning-based methods have been proposed to address this issue, most algorithms employ 2D network models as backbones, neglecting the intrinsic structural priors within 4D CBCT images. In this paper, we first explore the origin and appearance of streak artifacts in 4D CBCT images. We find that streak artifacts exhibit a unique rotational motion along with the patient's respiration, distinguishable from diaphragm-driven respiratory motion in the spatiotemporal domain. Therefore, we propose a novel 4D neural network model, RSTAR4D-Net, designed to address Rotational STreak Artifact Reduction by integrating the spatial and temporal information within 4D CBCT images. Specifically, we overcome the computational and training difficulties of a 4D neural network. The specially designed model adopts an efficient implementation of 4D convolutions to reduce computational costs and thus can process the whole 4D image in one pass. Additionally, a Tetris training strategy pertinent to the separable 4D convolutions is proposed to effectively train the model using limited 4D training samples. Extensive experiments substantiate the effectiveness of our proposed method, and the RSTAR4D-Net shows superior performance compared to other methods. The source code and dynamic demos are available at https://github.com/ivy9092111111/RSTAR.

CVJun 20, 2025
Hunyuan-GameCraft: High-dynamic Interactive Game Video Generation with Hybrid History Condition

Jiaqi Li, Junshu Tang, Zhiyong Xu et al.

Recent advances in diffusion-based and controllable video generation have enabled high-quality and temporally coherent video synthesis, laying the groundwork for immersive interactive gaming experiences. However, current methods face limitations in dynamics, generality, long-term consistency, and efficiency, which limit the ability to create various gameplay videos. To address these gaps, we introduce Hunyuan-GameCraft, a novel framework for high-dynamic interactive video generation in game environments. To achieve fine-grained action control, we unify standard keyboard and mouse inputs into a shared camera representation space, facilitating smooth interpolation between various camera and movement operations. Then we propose a hybrid history-conditioned training strategy that extends video sequences autoregressively while preserving game scene information. Additionally, to enhance inference efficiency and playability, we achieve model distillation to reduce computational overhead while maintaining consistency across long temporal sequences, making it suitable for real-time deployment in complex interactive environments. The model is trained on a large-scale dataset comprising over one million gameplay recordings across over 100 AAA games, ensuring broad coverage and diversity, then fine-tuned on a carefully annotated synthetic dataset to enhance precision and control. The curated game scene data significantly improves the visual fidelity, realism and action controllability. Extensive experiments demonstrate that Hunyuan-GameCraft significantly outperforms existing models, advancing the realism and playability of interactive game video generation.

CVMay 20, 2025
Hunyuan-Game: Industrial-grade Intelligent Game Creation Model

Ruihuang Li, Caijin Zhou, Shoujian Zheng et al. · tencent-ai

Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.

CVJun 10, 2025
HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation

Ziyao Huang, Zixiang Zhou, Juan Cao et al.

To address key limitations in human-object interaction (HOI) video generation -- specifically the reliance on curated motion data, limited generalization to novel objects/scenarios, and restricted accessibility -- we introduce HunyuanVideo-HOMA, a weakly conditioned multimodal-driven framework. HunyuanVideo-HOMA enhances controllability and reduces dependency on precise inputs through sparse, decoupled motion guidance. It encodes appearance and motion signals into the dual input space of a multimodal diffusion transformer (MMDiT), fusing them within a shared context space to synthesize temporally consistent and physically plausible interactions. To optimize training, we integrate a parameter-space HOI adapter initialized from pretrained MMDiT weights, preserving prior knowledge while enabling efficient adaptation, and a facial cross-attention adapter for anatomically accurate audio-driven lip synchronization. Extensive experiments confirm state-of-the-art performance in interaction naturalness and generalization under weak supervision. Finally, HunyuanVideo-HOMA demonstrates versatility in text-conditioned generation and interactive object manipulation, supported by a user-friendly demo interface. The project page is at https://anonymous.4open.science/w/homa-page-0FBE/.

CVNov 26, 2025
Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi, Wenqing Yu, Jiesong Lian et al.

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

CVOct 2, 2025
Pack and Force Your Memory: Long-form and Consistent Video Generation

Xiaofei Wu, Guozhen Zhang, Zhiyong Xu et al.

Long-form video generation presents a dual challenge: models must capture long-range dependencies while preventing the error accumulation inherent in autoregressive decoding. To address these challenges, we make two contributions. First, for dynamic context modeling, we propose MemoryPack, a learnable context-retrieval mechanism that leverages both textual and image information as global guidance to jointly model short- and long-term dependencies, achieving minute-level temporal consistency. This design scales gracefully with video length, preserves computational efficiency, and maintains linear complexity. Second, to mitigate error accumulation, we introduce Direct Forcing, an efficient single-step approximating strategy that improves training-inference alignment and thereby curtails error propagation during inference. Together, MemoryPack and Direct Forcing substantially enhance the context consistency and reliability of long-form video generation, advancing the practical usability of autoregressive video models.

CVAug 2, 2025
Enhancing Multi-view Open-set Learning via Ambiguity Uncertainty Calibration and View-wise Debiasing

Zihan Fang, Zhiyong Xu, Lan Du et al.

Existing multi-view learning models struggle in open-set scenarios due to their implicit assumption of class completeness. Moreover, static view-induced biases, which arise from spurious view-label associations formed during training, further degrade their ability to recognize unknown categories. In this paper, we propose a multi-view open-set learning framework via ambiguity uncertainty calibration and view-wise debiasing. To simulate ambiguous samples, we design O-Mix, a novel synthesis strategy to generate virtual samples with calibrated open-set ambiguity uncertainty. These samples are further processed by an auxiliary ambiguity perception network that captures atypical patterns for improved open-set adaptation. Furthermore, we incorporate an HSIC-based contrastive debiasing module that enforces independence between view-specific ambiguous and view-consistent representations, encouraging the model to learn generalizable features. Extensive experiments on diverse multi-view benchmarks demonstrate that the proposed framework consistently enhances unknown-class recognition while preserving strong closed-set performance.

CVSep 6, 2021
Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing

Xingjian He, Weining Wang, Zhiyong Xu et al.

Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction. In this paper, we propose a Spatial-Temporal Semantic Consistency method to capture class-exclusive context information. Specifically, we design a spatial-temporal consistency loss to constrain the semantic consistency in spatial and temporal dimensions. In addition, we adopt an pseudo-labeling strategy to enrich the training dataset. We obtain the scores of 59.84% and 58.85% mIoU on development (test part 1) and testing set of VSPW, respectively. And our method wins the 1st place on VSPW challenge at ICCV2021.

IVSep 10, 2019
U-net super-neural segmentation and similarity calculation to realize vegetation change assessment in satellite imagery

Chunxue Wu, Bobo Ju, Naixue Xiong et al.

Vegetation is the natural linkage connecting soil, atmosphere and water. It can represent the change of land cover to a certain extent and serve as an indicator for global change research. Methods for measuring coverage can be divided into two types: surface measurement and remote sensing. Because vegetation cover has significant spatial and temporal differentiation characteristics, remote sensing has become an important technical means to estimate vegetation coverage. This paper firstly uses U-net to perform remote sensing image semantic segmentation training, then uses the result of semantic segmentation, and then uses the integral progressive method to calculate the forestland change rate, and finally realizes automated valuation of woodland change rate.