Minghao Han

CV
h-index27
25papers
352citations
Novelty57%
AI Score59

25 Papers

CVAug 21, 2024Code
MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

Minghao Han, Linhao Qu, Dingkang Yang et al.

Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM's text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.

CVDec 24, 2025Code
Beyond Pixel Simulation: Pathology Image Generation via Diagnostic Semantic Tokens and Prototype Control

Minghao Han, Yichen Liu, Yizhou Liu et al.

In computational pathology, understanding and generation have evolved along disparate paths: advanced understanding models already exhibit diagnostic-level competence, whereas generative models largely simulate pixels. Progress remains hindered by three coupled factors: the scarcity of large, high-quality image-text corpora; the lack of precise, fine-grained semantic control, which forces reliance on non-semantic cues; and terminological heterogeneity, where diverse phrasings for the same diagnostic concept impede reliable text conditioning. We introduce UniPath, a semantics-driven pathology image generation framework that leverages mature diagnostic understanding to enable controllable generation. UniPath implements Multi-Stream Control: a Raw-Text stream; a High-Level Semantics stream that uses learnable queries to a frozen pathology MLLM to distill paraphrase-robust Diagnostic Semantic Tokens and to expand prompts into diagnosis-aware attribute bundles; and a Prototype stream that affords component-level morphological control via a prototype bank. On the data front, we curate a 2.65M image-text corpus and a finely annotated, high-quality 68K subset to alleviate data scarcity. For a comprehensive assessment, we establish a four-tier evaluation hierarchy tailored to pathology. Extensive experiments demonstrate UniPath's SOTA performance, including a Patho-FID of 80.9 (51% better than the second-best) and fine-grained semantic control achieving 98.7% of the real-image. The dataset and code can be obtained from https://github.com/Hanminghao/UniPath.

SEDec 28, 2025Code
FasterPy: An LLM-based Code Execution Efficiency Optimization Framework

Yue Wu, Minghao Han, Ruiyin Li et al.

Code often suffers from performance bugs. These bugs necessitate the research and practice of code optimization. Traditional rule-based methods rely on manually designing and maintaining rules for specific performance bugs (e.g., redundant loops, repeated computations), making them labor-intensive and limited in applicability. In recent years, machine learning and deep learning-based methods have emerged as promising alternatives by learning optimization heuristics from annotated code corpora and performance measurements. However, these approaches usually depend on specific program representations and meticulously crafted training datasets, making them costly to develop and difficult to scale. With the booming of Large Language Models (LLMs), their remarkable capabilities in code generation have opened new avenues for automated code optimization. In this work, we proposed FasterPy, a low-cost and efficient framework that adapts LLMs to optimize the execution efficiency of Python code. FasterPy combines Retrieval-Augmented Generation (RAG), supported by a knowledge base constructed from existing performance-improving code pairs and corresponding performance measurements, with Low-Rank Adaptation (LoRA) to enhance code optimization performance. Our experimental results on the Performance Improving Code Edits (PIE) benchmark demonstrate that our method outperforms existing models on multiple metrics. The FasterPy tool and the experimental results are available at https://github.com/WuYue22/fasterpy.

CVApr 12Code
Differentiable Vector Quantization for Rate-Distortion Optimization of Generative Image Compression

Shiyin Jiang, Wei Long, Minghao Han et al.

The rapid growth of visual data under stringent storage and bandwidth constraints makes extremely low-bitrate image compression increasingly important. While Vector Quantization (VQ) offers strong structural fidelity, existing methods lack a principled mechanism for joint rate-distortion (RD) optimization due to the disconnect between representation learning and entropy modeling. We propose RDVQ, a unified framework that enables end-to-end RD optimization for VQ-based compression via a differentiable relaxation of the codebook distribution, allowing the entropy loss to directly shape the latent prior. We further develop an autoregressive entropy model that supports accurate entropy modeling and test-time rate control. Extensive experiments demonstrate that RDVQ achieves strong performance at extremely low bitrates with a lightweight architecture, attaining competitive or superior perceptual quality with significantly fewer parameters. Compared with RDEIC, RDVQ reduces bitrate by up to 75.71% on DISTS and 37.63% on LPIPS on DIV2K-val. Beyond empirical gains, RDVQ introduces an entropy-constrained formulation of VQ, highlighting the potential for a more unified view of image tokenization and compression. The code will be available at https://github.com/CVL-UESTC/RDVQ.

SEFeb 11Code
FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Qixing Zhou, Jiacheng Zhang, Haiyang Wang et al.

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current boundaries of their coding abilities. Existing agentic coding benchmarks, however, cover a limited task scope, e.g., bug fixing within a single pull request (PR), and often rely on non-executable evaluations or lack an automated approach for continually updating the evaluation coverage. To address such issues, we propose FeatureBench, a benchmark designed to evaluate agentic coding performance in end-to-end, feature-oriented software development. FeatureBench incorporates an execution-based evaluation protocol and a scalable test-driven method that automatically derives tasks from code repositories with minimal human effort. By tracing from unit tests along a dependency graph, our approach can identify feature-level coding tasks spanning multiple commits and PRs scattered across the development timeline, while ensuring the proper functioning of other features after the separation. Using this framework, we curated 200 challenging evaluation tasks and 3825 executable environments from 24 open-source repositories in the first version of our benchmark. Empirical evaluation reveals that the state-of-the-art agentic model, such as Claude 4.5 Opus, which achieves a 74.4% resolved rate on SWE-bench, succeeds on only 11.0% of tasks, opening new opportunities for advancing agentic coding. Moreover, benefiting from our automated task collection toolkit, FeatureBench can be easily scaled and updated over time to mitigate data leakage. The inherent verifiability of constructed environments also makes our method potentially valuable for agent training.

CVDec 15, 2025
Forging a Dynamic Memory: Retrieval-Guided Continual Learning for Generalist Medical Foundation Models

Zizhi Chen, Yizhen Gao, Minghao Han et al.

Multimodal biomedical Vision-Language Models (VLMs) exhibit immense potential in the field of Continual Learning (CL). However, they confront a core dilemma: how to preserve fine-grained intra-modality features while bridging the significant domain gap across different modalities. To address this challenge, we propose a comprehensive framework. Leveraging our 18-million multimodal and comprehensive medical retrieval database derived from PubMed scientific papers, we pioneer the integration of Retrieval-Augmented Generation (RAG) into CL. Specifically, we employ a multi-modal, multi-layer RAG system that provides real-time guidance for model fine-tuning through dynamic, on-demand knowledge retrieval. Building upon this, we introduce a dynamic knowledge distillation framework. This framework precisely resolves the aforementioned core dilemma by dynamically modulating the importance of the parameter space, the granularity of the distilled knowledge, and the data distribution of the reference dataset in accordance with the required level of detail. To thoroughly validate the clinical value of our strategy, we have designed a more rigorous \textbf{M}edical Generalist Task Incremental Learning (MGTIL) benchmark. This benchmark is engineered to simultaneously evaluate the model's capacity for adaptation to significant domain shifts, retention of subtle intra-domain features, and real-time learning of novel and complex medical tasks. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) performance across all metrics. The code is provided in the supplementary materials.

CVFeb 5
Exploring Physical Intelligence Emergence via Omni-Modal Architecture and Physical Data Engine

Minghao Han, Dingkang Yang, Yue Jiang et al.

Physical understanding remains brittle in omni-modal models because key physical attributes are visually ambiguous and sparsely represented in web-scale data. We present OmniFysics, a compact omni-modal model that unifies understanding across images, audio, video, and text, with integrated speech and image generation. To inject explicit physical knowledge, we build a physical data engine with two components. FysicsAny produces physics-grounded instruction--image supervision by mapping salient objects to verified physical attributes through hierarchical retrieval over a curated prototype database, followed by physics-law--constrained verification and caption rewriting. FysicsOmniCap distills web videos via audio--visual consistency filtering to generate high-fidelity video--instruction pairs emphasizing cross-modal physical cues. We train OmniFysics with staged multimodal alignment and instruction tuning, adopt latent-space flow matching for text-to-image generation, and use an intent router to activate generation only when needed. Experiments show competitive performance on standard multimodal benchmarks and improved results on physics-oriented evaluations.

CVDec 1, 2024Code
Towards Unified Molecule-Enhanced Pathology Image Representation Learning via Integrating Spatial Transcriptomics

Minghao Han, Dingkang Yang, Jiabei Cheng et al.

Recent advancements in multimodal pre-training models have significantly advanced computational pathology. However, current approaches predominantly rely on visual-language models, which may impose limitations from a molecular perspective and lead to performance bottlenecks. Here, we introduce a Unified Molecule-enhanced Pathology Image REpresentationn Learning framework (UMPIRE). UMPIRE aims to leverage complementary information from gene expression profiles to guide the multimodal pre-training, enhancing the molecular awareness of pathology image representation learning. We demonstrate that this molecular perspective provides a robust, task-agnostic training signal for learning pathology image embeddings. Due to the scarcity of paired data, approximately 4 million entries of spatial transcriptomics gene expression were collected to train the gene encoder. By leveraging powerful pre-trained encoders, UMPIRE aligns the encoders across over 697K pathology image-gene expression pairs. The performance of UMPIRE is demonstrated across various molecular-related downstream tasks, including gene expression prediction, spot classification, and mutation state prediction in whole slide images. Our findings highlight the effectiveness of multimodal data integration and open new avenues for exploring computational pathology enhanced by molecular perspectives. The code and pre-trained weights are available at https://github.com/Hanminghao/UMPIRE.

CVApr 30, 2024Code
Multi-Scale Heterogeneity-Aware Hypergraph Representation for Histopathology Whole Slide Images

Minghao Han, Xukun Zhang, Dingkang Yang et al.

Survival prediction is a complex ordinal regression task that aims to predict the survival coefficient ranking among a cohort of patients, typically achieved by analyzing patients' whole slide images. Existing deep learning approaches mainly adopt multiple instance learning or graph neural networks under weak supervision. Most of them are unable to uncover the diverse interactions between different types of biological entities(\textit{e.g.}, cell cluster and tissue block) across multiple scales, while such interactions are crucial for patient survival prediction. In light of this, we propose a novel multi-scale heterogeneity-aware hypergraph representation framework. Specifically, our framework first constructs a multi-scale heterogeneity-aware hypergraph and assigns each node with its biological entity type. It then mines diverse interactions between nodes on the graph structure to obtain a global representation. Experimental results demonstrate that our method outperforms state-of-the-art approaches on three benchmark datasets. Code is publicly available at \href{https://github.com/Hanminghao/H2GT}{https://github.com/Hanminghao/H2GT}.

CVFeb 15Code
Fusing Pixels and Genes: Spatially-Aware Learning in Computational Pathology

Minghao Han, Dingkang Yang, Linhao Qu et al.

Recent years have witnessed remarkable progress in multimodal learning within computational pathology. Existing models primarily rely on vision and language modalities; however, language alone lacks molecular specificity and offers limited pathological supervision, leading to representational bottlenecks. In this paper, we propose STAMP, a Spatial Transcriptomics-Augmented Multimodal Pathology representation learning framework that integrates spatially-resolved gene expression profiles to enable molecule-guided joint embedding of pathology images and transcriptomic data. Our study shows that self-supervised, gene-guided training provides a robust and task-agnostic signal for learning pathology image representations. Incorporating spatial context and multi-scale information further enhances model performance and generalizability. To support this, we constructed SpaVis-6M, the largest Visium-based spatial transcriptomics dataset to date, and trained a spatially-aware gene encoder on this resource. Leveraging hierarchical multi-scale contrastive alignment and cross-scale patch localization mechanisms, STAMP effectively aligns spatial transcriptomics with pathology images, capturing spatial structure and molecular variation. We validate STAMP across six datasets and four downstream tasks, where it consistently achieves strong performance. These results highlight the value and necessity of integrating spatially resolved molecular supervision for advancing multimodal learning in computational pathology. The code is included in the supplementary materials. The pretrained weights and SpaVis-6M are available at: https://github.com/Hanminghao/STAMP.

IVApr 22, 2025Code
VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining

Zizhi Chen, Xinyu Zhang, Minghao Han et al.

In histopathology, tissue sections are typically stained using common H&E staining or special stains (MAS, PAS, PASM, etc.) to clearly visualize specific tissue structures. The rapid advancement of deep learning offers an effective solution for generating virtually stained images, significantly reducing the time and labor costs associated with traditional histochemical staining. However, a new challenge arises in separating the fundamental visual characteristics of tissue sections from the visual differences induced by staining agents. Additionally, virtual staining often overlooks essential pathological knowledge and the physical properties of staining, resulting in only style-level transfer. To address these issues, we introduce, for the first time in virtual staining tasks, a pathological vision-language large model (VLM) as an auxiliary tool. We integrate contrastive learnable prompts, foundational concept anchors for tissue sections, and staining-specific concept anchors to leverage the extensive knowledge of the pathological VLM. This approach is designed to describe, frame, and enhance the direction of virtual staining. Furthermore, we have developed a data augmentation method based on the constraints of the VLM. This method utilizes the VLM's powerful image interpretation capabilities to further integrate image style and structural information, proving beneficial in high-precision pathological diagnostics. Extensive evaluations on publicly available multi-domain unpaired staining datasets demonstrate that our method can generate highly realistic images and enhance the accuracy of downstream tasks, such as glomerular detection and segmentation. Our code is available at: https://github.com/CZZZZZZZZZZZZZZZZZ/VPGAN-HARBOR

CVMar 25, 2025Code
VGAT: A Cancer Survival Analysis Framework Transitioning from Generative Visual Question Answering to Genomic Reconstruction

Zizhi Chen, Minghao Han, Xukun Zhang et al.

Multimodal learning combining pathology images and genomic sequences enhances cancer survival analysis but faces clinical implementation barriers due to limited access to genomic sequencing in under-resourced regions. To enable survival prediction using only whole-slide images (WSI), we propose the Visual-Genomic Answering-Guided Transformer (VGAT), a framework integrating Visual Question Answering (VQA) techniques for genomic modality reconstruction. By adapting VQA's text feature extraction approach, we derive stable genomic representations that circumvent dimensionality challenges in raw genomic data. Simultaneously, a cluster-based visual prompt module selectively enhances discriminative WSI patches, addressing noise from unfiltered image regions. Evaluated across five TCGA datasets, VGAT outperforms existing WSI-only methods, demonstrating the viability of genomic-informed inference without sequencing. This approach bridges multimodal research and clinical feasibility in resource-constrained settings. The code link is https://github.com/CZZZZZZZZZZZZZZZZZ/VGAT.

LGNov 7, 2019Code
$H_\infty$ Model-free Reinforcement Learning with Robust Stability Guarantee

Minghao Han, Yuan Tian, Lixian Zhang et al.

Reinforcement learning is showing great potentials in robotics applications, including autonomous driving, robot manipulation and locomotion. However, with complex uncertainties in the real-world environment, it is difficult to guarantee the successful generalization and sim-to-real transfer of learned policies theoretically. In this paper, we introduce and extend the idea of robust stability and $H_\infty$ control to design policies with both stability and robustness guarantee. Specifically, a sample-based approach for analyzing the Lyapunov stability and performance robustness of a learning-based control system is proposed. Based on the theoretical results, a maximum entropy algorithm is developed for searching Lyapunov function and designing a policy with provable robust stability guarantee. Without any specific domain knowledge, our method can find a policy that is robust to various uncertainties and generalizes well to different test environments. In our experiments, we show that our method achieves better robustness to both large impulsive disturbances and parametric variations in the environment than the state-of-art results in both robust and generic RL, as well as classic control. Anonymous code is available to reproduce the experimental results at https://github.com/RobustStabilityGuaranteeRL/RobustStabilityGuaranteeRL.

SYApr 9, 2025
Deep Neural Koopman Operator-based Economic Model Predictive Control of Shipboard Carbon Capture System

Minghao Han, Xunyuan Yin

Shipboard carbon capture is a promising solution to help reduce carbon emissions in international shipping. In this work, we propose a data-driven dynamic modeling and economic predictive control approach within the Koopman framework. This integrated modeling and control approach is used to achieve safe and energy-efficient process operation of shipboard post-combustion carbon capture plants. Specifically, we propose a deep neural Koopman operator modeling approach, based on which a Koopman model with time-varying model parameters is established. This Koopman model predicts the overall economic operational cost and key system outputs, based on accessible partial state measurements. By leveraging this learned model, a constrained economic predictive control scheme is developed. Despite time-varying parameters involved in the formulated model, the formulated optimization problem associated with the economic predictive control design is convex, and it can be solved efficiently during online control implementations. Extensive tests are conducted on a high-fidelity simulation environment for shipboard post-combustion carbon capture processes. Four ship operational conditions are taken into account. The results show that the proposed method significantly improves the overall economic operational performance and carbon capture rate. Additionally, the proposed method guarantees safe operation by ensuring that hard constraints on the system outputs are satisfied.

CVAug 18, 2025
Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination

Yizhou Liu, Jingwei Wei, Zizhi Chen et al.

Reinforcement learning (RL) with rule-based rewards has demonstrated strong potential in enhancing the reasoning and generalization capabilities of vision-language models (VLMs) and large language models (LLMs), while reducing computational overhead. However, its application in medical imaging remains underexplored. Existing reinforcement fine-tuning (RFT) approaches in this domain primarily target closed-ended visual question answering (VQA), limiting their applicability to real-world clinical reasoning. In contrast, open-ended medical VQA better reflects clinical practice but has received limited attention. While some efforts have sought to unify both formats via semantically guided RL, we observe that model-based semantic rewards often suffer from reward collapse, where responses with significant semantic differences receive similar scores. To address this, we propose ARMed (Adaptive Reinforcement for Medical Reasoning), a novel RL framework for open-ended medical VQA. ARMed first incorporates domain knowledge through supervised fine-tuning (SFT) on chain-of-thought data, then applies reinforcement learning with textual correctness and adaptive semantic rewards to enhance reasoning quality. We evaluate ARMed on six challenging medical VQA benchmarks. Results show that ARMed consistently boosts both accuracy and generalization, achieving a 32.64% improvement on in-domain tasks and an 11.65% gain on out-of-domain benchmarks. These results highlight the critical role of reward discriminability in medical RL and the promise of semantically guided rewards for enabling robust and clinically meaningful multimodal reasoning.

CVApr 29, 2025
Neural Stereo Video Compression with Hybrid Disparity Compensation

Shiyin Jiang, Zhenghao Chen, Minghao Han et al.

Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an "explicit pixel-wise attention score" to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.

CVDec 14, 2025
FysicsWorld: A Unified Full-Modality Benchmark for Any-to-Any Understanding, Generation, and Reasoning

Yue Jiang, Dingkang Yang, Minghao Han et al.

Despite rapid progress in multimodal large language models (MLLMs) and emerging omni-modal architectures, current benchmarks remain limited in scope and integration, suffering from incomplete modality coverage, restricted interaction to text-centric outputs, and weak interdependence and complementarity among modalities. To bridge these gaps, we introduce FysicsWorld, the first unified full-modality benchmark that supports bidirectional input-output across image, video, audio, and text, enabling comprehensive any-to-any evaluation across understanding, generation, and reasoning. FysicsWorld encompasses 16 primary tasks and 3,268 curated samples, aggregated from over 40 high-quality sources and covering a rich set of open-domain categories with diverse question types. We also propose the Cross-Modal Complementarity Screening (CMCS) strategy integrated in a systematic data construction framework that produces omni-modal data for spoken interaction and fusion-dependent cross-modal reasoning. Through a comprehensive evaluation of over 30 state-of-the-art baselines, spanning MLLMs, modality-specific models, unified understanding-generation models, and omni-modal language models, FysicsWorld exposes the performance disparities and limitations across models in understanding, generation, and reasoning. Our benchmark establishes a unified foundation and strong baselines for evaluating and advancing next-generation full-modality architectures.

SYOct 10, 2025
MAKO: Meta-Adaptive Koopman Operators for Learning-based Model Predictive Control of Parametrically Uncertain Nonlinear Systems

Minghao Han, Kiwan Wong, Adrian Wing-Keung Law et al.

In this work, we propose a meta-learning-based Koopman modeling and predictive control approach for nonlinear systems with parametric uncertainties. An adaptive deep meta-learning-based modeling approach, called Meta Adaptive Koopman Operator (MAKO), is proposed. Without knowledge of the parametric uncertainty, the proposed MAKO approach can learn a meta-model from a multi-modal dataset and efficiently adapt to new systems with previously unseen parameter settings by using online data. Based on the learned meta Koopman model, a predictive control scheme is developed, and the stability of the closed-loop system is ensured even in the presence of previously unseen parameter settings. Through extensive simulations, our proposed approach demonstrates superior performance in both modeling accuracy and control efficacy as compared to competitive baselines.

CVAug 27, 2025
PersonaAnimator: Personalized Motion Transfer from Unconstrained Videos

Ziyun Qian, Runyu Xiao, Shuyuan Tu et al.

Recent advances in motion generation show remarkable progress. However, several limitations remain: (1) Existing pose-guided character motion transfer methods merely replicate motion without learning its style characteristics, resulting in inexpressive characters. (2) Motion style transfer methods rely heavily on motion capture data, which is difficult to obtain. (3) Generated motions sometimes violate physical laws. To address these challenges, this paper pioneers a new task: Video-to-Video Motion Personalization. We propose a novel framework, PersonaAnimator, which learns personalized motion patterns directly from unconstrained videos. This enables personalized motion transfer. To support this task, we introduce PersonaVid, the first video-based personalized motion dataset. It contains 20 motion content categories and 120 motion style categories. We further propose a Physics-aware Motion Style Regularization mechanism to enforce physical plausibility in the generated motions. Extensive experiments show that PersonaAnimator outperforms state-of-the-art motion transfer methods and sets a new benchmark for the Video-to-Video Motion Personalization task.

IVMay 27, 2025
Generative Image Compression by Estimating Gradients of the Rate-variable Feature Distribution

Minghao Han, Weiyi You, Jinhua Zhang et al.

While learned image compression (LIC) focuses on efficient data transmission, generative image compression (GIC) extends this framework by integrating generative modeling to produce photo-realistic reconstructed images. In this paper, we propose a novel diffusion-based generative modeling framework tailored for generative image compression. Unlike prior diffusion-based approaches that indirectly exploit diffusion modeling, we reinterpret the compression process itself as a forward diffusion path governed by stochastic differential equations (SDEs). A reverse neural network is trained to reconstruct images by reversing the compression process directly, without requiring Gaussian noise initialization. This approach achieves smooth rate adjustment and photo-realistic reconstructions with only a minimal number of sampling steps. Extensive experiments on benchmark datasets demonstrate that our method outperforms existing generative image compression approaches across a range of metrics, including perceptual distortion, statistical fidelity, and no-reference quality assessments.

CVMay 19, 2025
MVAR: Visual Autoregressive Modeling with Scale and Spatial Markovian Conditioning

Jinhua Zhang, Wei Long, Minghao Han et al.

Essential to visual generation is efficient modeling of visual data priors. Conventional next-token prediction methods define the process as learning the conditional probability distribution of successive tokens. Recently, next-scale prediction methods redefine the process to learn the distribution over multi-scale representations, significantly reducing generation latency. However, these methods condition each scale on all previous scales and require each token to consider all preceding tokens, exhibiting scale and spatial redundancy. To better model the distribution by mitigating redundancy, we propose Markovian Visual AutoRegressive modeling (MVAR), a novel autoregressive framework that introduces scale and spatial Markov assumptions to reduce the complexity of conditional probability modeling. Specifically, we introduce a scale-Markov trajectory that only takes as input the features of adjacent preceding scale for next-scale prediction, enabling the adoption of a parallel training strategy that significantly reduces GPU memory consumption. Furthermore, we propose spatial-Markov attention, which restricts the attention of each token to a localized neighborhood of size k at corresponding positions on adjacent scales, rather than attending to every token across these scales, for the pursuit of reduced modeling complexity. Building on these improvements, we reduce the computational complexity of attention calculation from O(N^2) to O(Nk), enabling training with just eight NVIDIA RTX 4090 GPUs and eliminating the need for KV cache during inference. Extensive experiments on ImageNet demonstrate that MVAR achieves comparable or superior performance with both small model trained from scratch and large fine-tuned models, while reducing the average GPU memory footprint by 3.0x.

LGJan 20, 2022
A Prescriptive Dirichlet Power Allocation Policy with Deep Reinforcement Learning

Yuan Tian, Minghao Han, Chetan Kulkarni et al.

Prescribing optimal operation based on the condition of the system and, thereby, potentially prolonging the remaining useful lifetime has a large potential for actively managing the availability, maintenance and costs of complex systems. Reinforcement learning (RL) algorithms are particularly suitable for this type of problems given their learning capabilities. A special case of a prescriptive operation is the power allocation task, which can be considered as a sequential allocation problem, where the action space is bounded by a simplex constraint. A general continuous action-space solution of such sequential allocation problems has still remained an open research question for RL algorithms. In continuous action-space, the standard Gaussian policy applied in reinforcement learning does not support simplex constraints, while the Gaussian-softmax policy introduces a bias during training. In this work, we propose the Dirichlet policy for continuous allocation tasks and analyze the bias and variance of its policy gradients. We demonstrate that the Dirichlet policy is bias-free and provides significantly faster convergence, better performance and better hyperparameters robustness over the Gaussian-softmax policy. Moreover, we demonstrate the applicability of the proposed algorithm on a prescriptive operation case, where we propose the Dirichlet power allocation policy and evaluate the performance on a case study of a set of multiple lithium-ion (Li-I) battery systems. The experimental results show the potential to prescribe optimal operation, improve the efficiency and sustainability of multi-power source systems.

RONov 4, 2021
Modeling and Control of an Omnidirectional Micro Aerial Vehicle Equipped with a Soft Robotic Arm

Róbert Szász, Mike Allenspach, Minghao Han et al.

Flying manipulators are aerial drones with attached rigid-bodied robotic arms and belong to the latest and most actively developed research areas in robotics. The rigid nature of these arms often lack compliance, flexibility, and smoothness in movement. This work proposes to use a soft-bodied robotic arm attached to an omnidirectional micro aerial vehicle (OMAV) to leverage the compliant and flexible behavior of the arm, while remaining maneuverable and dynamic thanks to the omnidirectional drone as the floating base. The unification of the arm with the drone poses challenges in the modeling and control of such a combined platform; these challenges are addressed with this work. We propose a unified model for the flying manipulator based on three modeling principles: the Piecewise Constant Curvature (PCC) and Augmented Rigid Body Model (ARBM) hypotheses for modeling soft continuum robots and a floating-base approach borrowed from the traditional rigid-body robotics literature. To demonstrate the validity and usefulness of this parametrisation, a hierarchical model-based feedback controller is implemented. The controller is verified and evaluated in simulation on various dynamical tasks, where the nullspace motions, disturbance recovery, and trajectory tracking capabilities of the platform are examined and validated. The soft flying manipulator platform could open new application fields in aerial construction, goods delivery, human assistance, maintenance, and warehouse automation.

SYNov 13, 2020
Reinforcement Learning Control of Constrained Dynamic Systems with Uniformly Ultimate Boundedness Stability Guarantee

Minghao Han, Yuan Tian, Lixian Zhang et al.

Reinforcement learning (RL) is promising for complicated stochastic nonlinear control problems. Without using a mathematical model, an optimal controller can be learned from data evaluated by certain performance criteria through trial-and-error. However, the data-based learning approach is notorious for not guaranteeing stability, which is the most fundamental property for any control system. In this paper, the classic Lyapunov's method is explored to analyze the uniformly ultimate boundedness stability (UUB) solely based on data without using a mathematical model. It is further shown how RL with UUB guarantee can be applied to control dynamic systems with safety constraints. Based on the theoretical results, both off-policy and on-policy learning algorithms are proposed respectively. As a result, optimal controllers can be learned to guarantee UUB of the closed-loop system both at convergence and during learning. The proposed algorithms are evaluated on a series of robotic continuous control tasks with safety constraints. In comparison with the existing RL algorithms, the proposed method can achieve superior performance in terms of maintaining safety. As a qualitative evaluation of stability, our method shows impressive resilience even in the presence of external disturbances.

ROApr 29, 2020
Actor-Critic Reinforcement Learning for Control with Stability Guarantee

Minghao Han, Lixian Zhang, Jun Wang et al.

Reinforcement Learning (RL) and its integration with deep learning have achieved impressive performance in various robotic control tasks, ranging from motion planning and navigation to end-to-end visual manipulation. However, stability is not guaranteed in model-free RL by solely using data. From a control-theoretic perspective, stability is the most important property for any control system, since it is closely related to safety, robustness, and reliability of robotic systems. In this paper, we propose an actor-critic RL framework for control which can guarantee closed-loop stability by employing the classic Lyapunov's method in control theory. First of all, a data-based stability theorem is proposed for stochastic nonlinear systems modeled by Markov decision process. Then we show that the stability condition could be exploited as the critic in the actor-critic RL to learn a controller/policy. At last, the effectiveness of our approach is evaluated on several well-known 3-dimensional robot control tasks and a synthetic biology gene network tracking task in three different popular physics simulation platforms. As an empirical evaluation on the advantage of stability, we show that the learned policies can enable the systems to recover to the equilibrium or way-points when interfered by uncertainties such as system parametric variations and external disturbances to a certain extent.