Mingtao Chen

CV
h-index7
8papers
423citations
Novelty55%
AI Score60

8 Papers

CVFeb 15, 2023Code
Efficient Teacher: Semi-Supervised Object Detection for YOLOv5

Bowen Xu, Mingtao Chen, Wenlong Guan et al.

Semi-Supervised Object Detection (SSOD) has been successful in improving the performance of both R-CNN series and anchor-free detectors. However, one-stage anchor-based detectors lack the structure to generate high-quality or flexible pseudo labels, leading to serious inconsistency problems in SSOD. In this paper, we propose the Efficient Teacher framework for scalable and effective one-stage anchor-based SSOD training, consisting of Dense Detector, Pseudo Label Assigner, and Epoch Adaptor. Dense Detector is a baseline model that extends RetinaNet with dense sampling techniques inspired by YOLOv5. The Efficient Teacher framework introduces a novel pseudo label assignment mechanism, named Pseudo Label Assigner, which makes more refined use of pseudo labels from Dense Detector. Epoch Adaptor is a method that enables a stable and efficient end-to-end semi-supervised training schedule for Dense Detector. The Pseudo Label Assigner prevents the occurrence of bias caused by a large number of low-quality pseudo labels that may interfere with the Dense Detector during the student-teacher mutual learning mechanism, and the Epoch Adaptor utilizes domain and distribution adaptation to allow Dense Detector to learn globally distributed consistent features, making the training independent of the proportion of labeled data. Our experiments show that the Efficient Teacher framework achieves state-of-the-art results on VOC, COCO-standard, and COCO-additional using fewer FLOPs than previous methods. To the best of our knowledge, this is the first attempt to apply Semi-Supervised Object Detection to YOLOv5.Code is available: https://github.com/AlibabaResearch/efficientteacher

CVMay 25
VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

Jingru Chen, Yiming Liu, Mingtao Chen et al.

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

CVMay 14, 2024Code
Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding

Zhimin Li, Jianwei Zhang, Qin Lin et al.

We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context. Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models. Code and pretrained models are publicly available at github.com/Tencent/HunyuanDiT

CVJan 9
TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang, Jianxiang Lu, Guangzheng Xu et al.

Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.

CVNov 24, 2025Code
HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li et al.

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions. Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

LGOct 2, 2025
Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization

Kezhao Liu, Jason Klein Liu, Mingtao Chen et al.

Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting. However, in methods such as GRPO, its implementation may be guided by principles from numerical value estimation-a practice that overlooks the term's functional role as an optimization loss. To analyze this issue, we establish a unified framework that connects two seemingly distinct implementation styles: using the mathematical term $k_n$ as a detached coefficient for the policy's score function ('$k_n$ in reward') or as a direct loss function through which gradients are propagated ('$k_n$ as loss'). We show that the latter can always be analyzed via an equivalent gradient coefficient in the former, unifying the two perspectives. Through this framework, we prove that the conventional '$k_1$ in reward' (like in PPO) is the principled loss for Reverse KL (RKL) regularization. We further establish a key finding: under on-policy conditions, the '$k_2$ as loss' formulation is, in fact, gradient-equivalent to '$k_1$ in reward'. This equivalence, first proven in our work, identifies both as the theoretically sound implementations of the RKL objective. In contrast, we show that the recently adopted '$k_3$ as loss' (like in GRPO) is merely a first-order, biased approximation of the principled loss. Furthermore, we argue that common off-policy implementations of '$k_n$ as loss' methods are biased due to neglected importance sampling, and we propose a principled correction. Our findings provide a comprehensive, gradient-based rationale for choosing and correctly implementing KL regularization, paving the way for more robust and effective RLHF systems.

CVNov 24, 2025
Q-Save: Towards Scoring and Attribution for Generated Video Evaluation

Xiele Wu, Zicheng Zhang, Mingtao Chen et al.

Evaluating AI-generated video (AIGV) quality hinges on three crucial dimensions: visual quality, dynamic quality, and text-video alignment. While numerous evaluation datasets and algorithms have been proposed, existing approaches are constrained by two limitations: the absence of systematic definitions for evaluation dimensions, and the isolated treatment of the three dimensions in separate models. Therefore, we introduce Q-Save, a holistic benchmark dataset and unified evaluation model for AIGV quality assessment. The Q-Save dataset contains nearly 10,000 video samples, each annotated with Mean Opinion Scores (MOS) and fine-grained attribution explanations across the three core dimensions. Leveraging this attribution-annotated dataset, we train the proposed Q-Save model, which adopts the SlowFast framework to balance accuracy and efficiency, and employs a three-stage training strategy with Chain-of-Thought (COT) formatted data: Supervised Fine-Tuning (SFT), Grouped Relative Policy Optimization (GRPO), and a final SFT round for stability, to jointly perform quality scoring and attribution generation. Experimental results demonstrate that Q-Save achieves superior performance in AIGV quality prediction while providing interpretable justifications. Code and dataset will be released upon publication.

AIJun 23, 2025
Airalogy: AI-empowered universal data digitization for research automation

Zijie Yang, Qiji Zhou, Fang Guo et al.

Research data are the foundation of Artificial Intelligence (AI)-driven science, yet current AI applications remain limited to a few fields with readily available, well-structured, digitized datasets. Achieving comprehensive AI empowerment across multiple disciplines is still out of reach. Present-day research data collection is often fragmented, lacking unified standards, inefficiently managed, and difficult to share. Creating a single platform for standardized data digitization needs to overcome the inherent challenge of balancing between universality (supporting the diverse, ever-evolving needs of various disciplines) and standardization (enforcing consistent formats to fully enable AI). No existing platform accommodates both facets. Building a truly multidisciplinary platform requires integrating scientific domain knowledge with sophisticated computing skills. Researchers often lack the computational expertise to design customized and standardized data recording methods, whereas platform developers rarely grasp the intricate needs of multiple scientific domains. These gaps impede research data standardization and hamper AI-driven progress. In this study, we address these challenges by developing Airalogy (https://airalogy.com), the world's first AI- and community-driven platform that balances universality and standardization for digitizing research data across multiple disciplines. Airalogy represents entire research workflows using customizable, standardized data records and offers an advanced AI research copilot for intelligent Q&A, automated data entry, analysis, and research automation. Already deployed in laboratories across all four schools of Westlake University, Airalogy has the potential to accelerate and automate scientific innovation in universities, industry, and the global research community-ultimately benefiting humanity as a whole.