Shuo Han

IV
h-index47
57papers
2,407citations
Novelty50%
AI Score59

57 Papers

CLFeb 16, 2023
TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

Yunyang Zeng, Joseph Konan, Shuo Han et al. · cmu

Speech enhancement models have greatly progressed in recent years, but still show limits in perceptual quality of their speech outputs. We propose an objective for perceptual quality based on temporal acoustic parameters. These are fundamental speech features that play an essential role in various applications, including speaker recognition and paralinguistic analysis. We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features. Unlike prior work that looks at aggregated acoustic parameters or a few categories of acoustic parameters, our temporal acoustic parameter (TAP) loss enables auxiliary optimization and improvement of many fine-grain speech characteristics in enhancement workflows. We show that adding TAPLoss as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility. We use data from the Deep Noise Suppression 2020 Challenge to demonstrate that both time-domain models and time-frequency domain models can benefit from our method.

SDFeb 16, 2023
PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Muqiao Yang, Joseph Konan, David Bick et al. · cmu

Despite rapid advancement in recent years, current speech enhancement models often produce speech that differs in perceptual quality from real clean speech. We propose a learning objective that formalizes differences in perceptual quality, by using domain knowledge of acoustic-phonetics. We identify temporal acoustic parameters -- such as spectral tilt, spectral flux, shimmer, etc. -- that are non-differentiable, and we develop a neural network estimator that can accurately predict their time-series values across an utterance. We also model phoneme-specific weights for each feature, as the acoustic parameters are known to show different behavior in different phonemes. We can add this criterion as an auxiliary loss to any model that produces speech, to optimize speech outputs to match the values of clean speech in these features. Experimentally we show that it improves speech enhancement workflows in both time-domain and time-frequency domain, as measured by standard evaluation metrics. We also provide an analysis of phoneme-dependent improvement on acoustic parameters, demonstrating the additional interpretability that our method provides. This analysis can suggest which features are currently the bottleneck for improvement.

SYMar 14, 2016
Taxi Dispatch with Real-Time Sensing Data in Metropolitan Areas: A Receding Horizon Control Approach

Fei Miao, Shuo Han, Shan Lin et al.

Traditional taxi systems in metropolitan areas often suffer from inefficiencies due to uncoordinated actions as system capacity and customer demand change. With the pervasive deployment of networked sensors in modern vehicles, large amounts of information regarding customer demand and system status can be collected in real time. This information provides opportunities to perform various types of control and coordination for large-scale intelligent transportation systems. In this paper, we present a receding horizon control (RHC) framework to dispatch taxis, which incorporates highly spatiotemporally correlated demand/supply models and real-time GPS location and occupancy information. The objectives include matching spatiotemporal ratio between demand and supply for service quality with minimum current and anticipated future taxi idle driving distance. Extensive trace-driven analysis with a data set containing taxi operational records in San Francisco shows that our solution reduces the average total idle distance by 52%, and reduces the supply demand ratio error across the city during one experimental time slot by 45%. Moreover, our RHC framework is compatible with a wide variety of predictive models and optimization problem formulations. This compatibility property allows us to solve robust optimization problems with corresponding demand uncertainty models that provide disruptive event information.

LGJul 30, 2023Code
Robust Multi-Agent Reinforcement Learning with State Uncertainty

Sihong He, Songyang Han, Sanbao Su et al.

In real-world multi-agent reinforcement learning (MARL) applications, agents may not have perfect state information (e.g., due to inaccurate measurement or malicious attacks), which challenges the robustness of agents' policies. Though robustness is getting important in MARL deployment, little prior work has studied state uncertainties in MARL, neither in problem formulation nor algorithm design. Motivated by this robustness issue and the lack of corresponding studies, we study the problem of MARL with state uncertainty in this work. We provide the first attempt to the theoretical and empirical analysis of this challenging problem. We first model the problem as a Markov Game with state perturbation adversaries (MG-SPA) by introducing a set of state perturbation adversaries into a Markov Game. We then introduce robust equilibrium (RE) as the solution concept of an MG-SPA. We conduct a fundamental analysis regarding MG-SPA such as giving conditions under which such a robust equilibrium exists. Then we propose a robust multi-agent Q-learning (RMAQ) algorithm to find such an equilibrium, with convergence guarantees. To handle high-dimensional state-action space, we design a robust multi-agent actor-critic (RMAAC) algorithm based on an analytical expression of the policy gradient derived in the paper. Our experiments show that the proposed RMAQ algorithm converges to the optimal value function; our RMAAC algorithm outperforms several MARL and robust MARL methods in multiple multi-agent environments when state uncertainty is present. The source code is public on \url{https://github.com/sihongho/robust_marl_with_state_uncertainty}.

SDMar 16, 2023
Improving Perceptual Quality, Intelligibility, and Acoustics on VoIP Platforms

Joseph Konan, Ojas Bhargave, Shikhar Agnihotri et al. · cmu

In this paper, we present a method for fine-tuning models trained on the Deep Noise Suppression (DNS) 2020 Challenge to improve their performance on Voice over Internet Protocol (VoIP) applications. Our approach involves adapting the DNS 2020 models to the specific acoustic characteristics of VoIP communications, which includes distortion and artifacts caused by compression, transmission, and platform-specific processing. To this end, we propose a multi-task learning framework for VoIP-DNS that jointly optimizes noise suppression and VoIP-specific acoustics for speech enhancement. We evaluate our approach on a diverse VoIP scenarios and show that it outperforms both industry performance and state-of-the-art methods for speech enhancement on VoIP applications. Our results demonstrate the potential of models trained on DNS-2020 to be improved and tailored to different VoIP platforms using VoIP-DNS, whose findings have important applications in areas such as speech recognition, voice assistants, and telecommunication.

SYOct 20, 2017
Data-Driven Robust Taxi Dispatch under Demand Uncertainties

Fei Miao, Shuo Han, Shan Lin et al.

In modern taxi networks, large amounts of taxi occupancy status and location data are collected from networked in-vehicle sensors in real-time. They provide knowledge of system models on passenger demand and mobility patterns for efficient taxi dispatch and coordination strategies. Such approaches face new challenges: how to deal with uncertainties of predicted customer demand while fulfilling the system's performance requirements, including minimizing taxis' total idle mileage and maintaining service fairness across the whole city; how to formulate a computationally tractable problem. To address this problem, we develop a data-driven robust taxi dispatch framework to consider spatial-temporally correlated demand uncertainties. The robust vehicle dispatch problem we formulate is concave in the uncertain demand and convex in the decision variables. Uncertainty sets of random demand vectors are constructed from data based on theories in hypothesis testing, and provide a desired probabilistic guarantee level for the performance of robust taxi dispatch solutions. We prove equivalent computationally tractable forms of the robust dispatch problem using the minimax theorem and strong duality. Evaluations on four years of taxi trip data for New York City show that by selecting a probabilistic guarantee level at 75%, the average demand-supply ratio error is reduced by 31.7%, and the average total idle driving distance is reduced by 10.13% or about 20 million miles annually, compared with non-robust dispatch solutions.

IVMar 5, 2022
Coordinate Translator for Learning Deformable Medical Image Registration

Yihao Liu, Lianrui Zuo, Shuo Han et al.

The majority of deep learning (DL) based deformable image registration methods use convolutional neural networks (CNNs) to estimate displacement fields from pairs of moving and fixed images. This, however, requires the convolutional kernels in the CNN to not only extract intensity features from the inputs but also understand image coordinate systems. We argue that the latter task is challenging for traditional CNNs, limiting their performance in registration tasks. To tackle this problem, we first introduce Coordinate Translator, a differentiable module that identifies matched features between the fixed and moving image and outputs their coordinate correspondences without the need for training. It unloads the burden of understanding image coordinate systems for CNNs, allowing them to focus on feature extraction. We then propose a novel deformable registration network, im2grid, that uses multiple Coordinate Translator's with the hierarchical features extracted from a CNN encoder and outputs a deformation field in a coarse-to-fine fashion. We compared im2grid with the state-of-the-art DL and non-DL methods for unsupervised 3D magnetic resonance image registration. Our experiments show that im2grid outperforms these methods both qualitatively and quantitatively.

IVMay 10, 2022
Disentangling A Single MR Modality

Lianrui Zuo, Yihao Liu, Yuan Xue et al.

Disentangling anatomical and contrast information from medical images has gained attention recently, demonstrating benefits for various image analysis tasks. Current methods learn disentangled representations using either paired multi-modal images with the same underlying anatomy or auxiliary labels (e.g., manual delineations) to provide inductive bias for disentanglement. However, these requirements could significantly increase the time and cost in data collection and limit the applicability of these methods when such data are not available. Moreover, these methods generally do not guarantee disentanglement. In this paper, we present a novel framework that learns theoretically and practically superior disentanglement from single modality magnetic resonance images. Moreover, we propose a new information-based metric to quantitatively evaluate disentanglement. Comparisons over existing disentangling methods demonstrate that the proposed method achieves superior performance in both disentanglement and cross-domain image-to-image translation tasks.

LGJun 20, 2023
Transforming Graphs for Enhanced Attribute Clustering: An Innovative Graph Transformer-Based Method

Shuo Han, Jiacheng Liu, Jiayun Wu et al. · pku

Graph Representation Learning (GRL) is an influential methodology, enabling a more profound understanding of graph-structured data and aiding graph clustering, a critical task across various domains. The recent incursion of attention mechanisms, originally an artifact of Natural Language Processing (NLP), into the realm of graph learning has spearheaded a notable shift in research trends. Consequently, Graph Attention Networks (GATs) and Graph Attention Auto-Encoders have emerged as preferred tools for graph clustering tasks. Yet, these methods primarily employ a local attention mechanism, thereby curbing their capacity to apprehend the intricate global dependencies between nodes within graphs. Addressing these impediments, this study introduces an innovative method known as the Graph Transformer Auto-Encoder for Graph Clustering (GTAGC). By melding the Graph Auto-Encoder with the Graph Transformer, GTAGC is adept at capturing global dependencies between nodes. This integration amplifies the graph representation and surmounts the constraints posed by the local attention mechanism. The architecture of GTAGC encompasses graph embedding, integration of the Graph Transformer within the autoencoder structure, and a clustering component. It strategically alternates between graph embedding and clustering, thereby tailoring the Graph Transformer for clustering tasks, whilst preserving the graph's global structural information. Through extensive experimentation on diverse benchmark datasets, GTAGC has exhibited superior performance against existing state-of-the-art graph clustering methodologies.

IVSep 6, 2022
Deep filter bank regression for super-resolution of anisotropic MR brain images

Samuel W. Remedios, Shuo Han, Yuan Xue et al.

In 2D multi-slice magnetic resonance (MR) acquisition, the through-plane signals are typically of lower resolution than the in-plane signals. While contemporary super-resolution (SR) methods aim to recover the underlying high-resolution volume, the estimated high-frequency information is implicit via end-to-end data-driven training rather than being explicitly stated and sought. To address this, we reframe the SR problem statement in terms of perfect reconstruction filter banks, enabling us to identify and directly estimate the missing information. In this work, we propose a two-stage approach to approximate the completion of a perfect reconstruction filter bank corresponding to the anisotropic acquisition of a particular scan. In stage 1, we estimate the missing filters using gradient descent and in stage 2, we use deep networks to learn the mapping from coarse coefficients to detail coefficients. In addition, the proposed formulation does not rely on external training data, circumventing the need for domain shift correction. Under our approach, SR performance is improved particularly in "slice gap" scenarios, likely due to the constrained solution space imposed by the framework.

99.1CVApr 14Code
Medical thinking with multiple images

Zonghai Yao, Benlu Wang, Yifan Zhang et al.

Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.

SDOct 11, 2023
Psychoacoustic Challenges Of Speech Enhancement On VoIP Platforms

Joseph Konan, Shikhar Agnihotri, Ojas Bhargave et al. · cmu

Within the ambit of VoIP (Voice over Internet Protocol) telecommunications, the complexities introduced by acoustic transformations merit rigorous analysis. This research, rooted in the exploration of proprietary sender-side denoising effects, meticulously evaluates platforms such as Google Meets and Zoom. The study draws upon the Deep Noise Suppression (DNS) 2020 dataset, ensuring a structured examination tailored to various denoising settings and receiver interfaces. A methodological novelty is introduced via Blinder-Oaxaca decomposition, traditionally an econometric tool, repurposed herein to analyze acoustic-phonetic perturbations within VoIP systems. To further ground the implications of these transformations, psychoacoustic metrics, specifically PESQ and STOI, were used to explain of perceptual quality and intelligibility. Cumulatively, the insights garnered underscore the intricate landscape of VoIP-influenced acoustic dynamics. In addition to the primary findings, a multitude of metrics are reported, extending the research purview. Moreover, out-of-domain benchmarking for both time and time-frequency domain speech enhancement models is included, thereby enhancing the depth and applicability of this inquiry.

AIDec 6, 2022
What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning?

Songyang Han, Sanbao Su, Sihong He et al.

Various methods for Multi-Agent Reinforcement Learning (MARL) have been developed with the assumption that agents' policies are based on accurate state information. However, policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks. In this work, we propose a State-Adversarial Markov Game (SAMG) and make the first attempt to investigate different solution concepts of MARL under state uncertainties. Our analysis shows that the commonly used solution concepts of optimal agent policy and robust Nash equilibrium do not always exist in SAMGs. To circumvent this difficulty, we consider a new solution concept called robust agent policy, where agents aim to maximize the worst-case expected state value. We prove the existence of robust agent policy for finite state and finite action SAMGs. Additionally, we propose a Robust Multi-Agent Adversarial Actor-Critic (RMA3C) algorithm to learn robust policies for MARL agents under state uncertainties. Our experiments demonstrate that our algorithm outperforms existing methods when faced with state perturbations and greatly improves the robustness of MARL policies. Our code is public on https://songyanghan.github.io/what_is_solution/.

SISep 28, 2016
Bio-Inspired Framework for Allocation of Protection Resources in Cyber-Physical Networks

Victor M. Preciado, Michael Zargham, Chinwendu Enyioha et al.

In this chapter, we consider the problem of designing protection strategies to contain spreading processes in complex cyber-physical networks. We illustrate our ideas using a family of bio-motivated spreading models originally proposed in the epidemiological literature, e.g., the Susceptible-Infected-Susceptible (SIS) model. We first introduce a framework in which we are allowed to distribute two types of resources in order to contain the spread, namely, (i) preventive resources able to reduce the spreading rate, and (ii) corrective resources able to increase the recovery rate of nodes in which the resources are allocated. In practice, these resources have an associated cost that depends on either the resiliency level achieved by the preventive resource, or the restoration efficiency of the corrective resource. We present a mathematical framework, based on dynamic systems theory and convex optimization, to find the cost-optimal distribution of protection resources in a network to contain the spread. We also present two extensions to this framework in which (i) we consider generalized epidemic models, beyond the simple SIS model, and (ii) we assume uncertainties in the contact network in which the spreading is taking place. We compare these protection strategies with common heuristics previously proposed in the literature and illustrate our results with numerical simulations using the air traffic network.

68.0LGMay 26
Ratio-Variance Regularized Policy Optimization

Yu Luo, Shuo Han, Yihan Hu et al.

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

MAJul 30, 2023
Robust Electric Vehicle Balancing of Autonomous Mobility-On-Demand System: A Multi-Agent Reinforcement Learning Approach

Sihong He, Shuo Han, Fei Miao

Electric autonomous vehicles (EAVs) are getting attention in future autonomous mobility-on-demand (AMoD) systems due to their economic and societal benefits. However, EAVs' unique charging patterns (long charging time, high charging frequency, unpredictable charging behaviors, etc.) make it challenging to accurately predict the EAVs supply in E-AMoD systems. Furthermore, the mobility demand's prediction uncertainty makes it an urgent and challenging task to design an integrated vehicle balancing solution under supply and demand uncertainties. Despite the success of reinforcement learning-based E-AMoD balancing algorithms, state uncertainties under the EV supply or mobility demand remain unexplored. In this work, we design a multi-agent reinforcement learning (MARL)-based framework for EAVs balancing in E-AMoD systems, with adversarial agents to model both the EAVs supply and mobility demand uncertainties that may undermine the vehicle balancing solutions. We then propose a robust E-AMoD Balancing MARL (REBAMA) algorithm to train a robust EAVs balancing policy to balance both the supply-demand ratio and charging utilization rate across the whole city. Experiments show that our proposed robust method performs better compared with a non-robust MARL method that does not consider state uncertainties; it improves the reward, charging utilization fairness, and supply-demand fairness by 19.28%, 28.18%, and 3.97%, respectively. Compared with a robust optimization-based method, the proposed MARL algorithm can improve the reward, charging utilization fairness, and supply-demand fairness by 8.21%, 8.29%, and 9.42%, respectively.

IVOct 10, 2022
Masked Autoencoders for Low dose CT denoising

Dayang Wang, Yongshun Xu, Shuo Han et al.

Low-dose computed tomography (LDCT) reduces the X-ray radiation but compromises image quality with more noises and artifacts. A plethora of transformer models have been developed recently to improve LDCT image quality. However, the success of a transformer model relies on a large amount of paired noisy and clean data, which is often unavailable in clinical applications. In computer vision and natural language processing fields, masked autoencoders (MAE) have been proposed as an effective label-free self-pretraining method for transformers, due to its excellent feature representation ability. Here, we redesign the classical encoder-decoder learning model to match the denoising task and apply it to LDCT denoising problem. The MAE can leverage the unlabeled data and facilitate structural preservation for the LDCT denoising model when ground truth data are missing. Experiments on the Mayo dataset validate that the MAE can boost the transformer's denoising performance and relieve the dependence on the ground truth data.

OCMar 4, 2019
Systematic Design of Decentralized Algorithms for Consensus Optimization

Shuo Han

We propose a separation principle that enables a systematic way of designing decentralized algorithms used in consensus optimization. Specifically, we show that a decentralized optimization algorithm can be constructed by combining a non-decentralized base optimization algorithm and decentralized consensus tracking. The separation principle provides modularity in both the design and analysis of algorithms under an automated convergence analysis framework using integral quadratic constraints (IQCs). We show that consensus tracking can be incorporated into the IQC-based analysis. The workflow is illustrated through the design and analysis of a decentralized algorithm based on the alternating direction method of multipliers.

OCMar 25, 2019
Computational Convergence Analysis of Distributed Gradient Tracking for Smooth Convex Optimization Using Dissipativity Theory

Shuo Han

We present a computational analysis that establishes the $O(1/K)$ convergence of the distributed gradient tracking method when the objective function is smooth and convex but not strongly convex. The analysis is inspired by recent work on applying dissipativity theory to the analysis of centralized optimization algorithms, in which convergence is proved by searching for a numerical certificate consisting of a storage function and a supply rate. We derive a base supply rate that can be used to analyze distributed optimization with non-strongly convex objective functions. The base supply rate is then used to create a class of supply rates by combining with integral quadratic constraints. Provided that the class of supply rates is rich enough, a numerical certificate of convergence can be automatically generated following a standard procedure that involves solving a linear matrix inequality. Our computational analysis is found capable of certifying convergence under a broader range of step sizes than what is given by the original analytic result.

MASep 17, 2022
A Robust and Constrained Multi-Agent Reinforcement Learning Electric Vehicle Rebalancing Method in AMoD Systems

Sihong He, Yue Wang, Shuo Han et al.

Electric vehicles (EVs) play critical roles in autonomous mobility-on-demand (AMoD) systems, but their unique charging patterns increase the model uncertainties in AMoD systems (e.g. state transition probability). Since there usually exists a mismatch between the training and test/true environments, incorporating model uncertainty into system design is of critical importance in real-world applications. However, model uncertainties have not been considered explicitly in EV AMoD system rebalancing by existing literature yet, and the coexistence of model uncertainties and constraints that the decision should satisfy makes the problem even more challenging. In this work, we design a robust and constrained multi-agent reinforcement learning (MARL) framework with state transition kernel uncertainty for EV AMoD systems. We then propose a robust and constrained MARL algorithm (ROCOMA) with robust natural policy gradients (RNPG) that trains a robust EV rebalancing policy to balance the supply-demand ratio and the charging utilization rate across the city under model uncertainty. Experiments show that the ROCOMA can learn an effective and robust rebalancing policy. It outperforms non-robust MARL methods in the presence of model uncertainties. It increases the system fairness by 19.6% and decreases the rebalancing costs by 75.8%.

CLSep 5, 2024
GraphInsight: Unlocking Insights in Large Language Models for Graph Structure Understanding

Yukun Cao, Shuo Han, Zengyi Gao et al.

Although Large Language Models (LLMs) have demonstrated potential in processing graphs, they struggle with comprehending graphical structure information through prompts of graph description sequences, especially as the graph size increases. We attribute this challenge to the uneven memory performance of LLMs across different positions in graph description sequences, known as ''positional biases''. To address this, we propose GraphInsight, a novel framework aimed at improving LLMs' comprehension of both macro- and micro-level graphical information. GraphInsight is grounded in two key strategies: 1) placing critical graphical information in positions where LLMs exhibit stronger memory performance, and 2) investigating a lightweight external knowledge base for regions with weaker memory performance, inspired by retrieval-augmented generation (RAG). Moreover, GraphInsight explores integrating these two strategies into LLM agent processes for composite graph tasks that require multi-step reasoning. Extensive empirical studies on benchmarks with a wide range of evaluation tasks show that GraphInsight significantly outperforms all other graph description methods (e.g., prompting techniques and reordering strategies) in understanding graph structures of varying sizes.

IVOct 19, 2023
LoMAE: Low-level Vision Masked Autoencoders for Low-dose CT Denoising

Dayang Wang, Yongshun Xu, Shuo Han et al.

Low-dose computed tomography (LDCT) offers reduced X-ray radiation exposure but at the cost of compromised image quality, characterized by increased noise and artifacts. Recently, transformer models emerged as a promising avenue to enhance LDCT image quality. However, the success of such models relies on a large amount of paired noisy and clean images, which are often scarce in clinical settings. In the fields of computer vision and natural language processing, masked autoencoders (MAE) have been recognized as an effective label-free self-pretraining method for transformers, due to their exceptional feature representation ability. However, the original pretraining and fine-tuning design fails to work in low-level vision tasks like denoising. In response to this challenge, we redesign the classical encoder-decoder learning model and facilitate a simple yet effective low-level vision MAE, referred to as LoMAE, tailored to address the LDCT denoising problem. Moreover, we introduce an MAE-GradCAM method to shed light on the latent learning mechanisms of the MAE/LoMAE. Additionally, we explore the LoMAE's robustness and generability across a variety of noise levels. Experiments results show that the proposed LoMAE can enhance the transformer's denoising performance and greatly relieve the dependence on the ground truth clean data. It also demonstrates remarkable robustness and generalizability over a spectrum of noise levels.

32.4SYApr 23
Planning Stealthy Backdoor Attacks in MDPs with Observation-Based Triggers

Xinyi Wei, Shuo Han, Ahmed H. Hemida et al.

This paper investigates backdoor attack planning in stochastic control systems modeled as Markov Decision Processes (MDPs). A backdoor attack involves an adversary deploying a policy that performs well in the original MDP to pass testing, but behaves maliciously at runtime when combined with a trigger that perturbs system dynamics. We consider a sophisticated attacker capable of jointly optimizing the backdoor policy and its trigger using only a blackbox simulator. During execution, the attacker has access only to partial observations of the system state and is restricted to introduce small perturbations to the system's transition dynamics. We formulate the attack planning problem as a constrained Markov game with an augmented state space and two players: Player 0 learns a backdoor policy that maximizes attack rewards when the trigger is active. However, when the trigger is inactive, the backdoor policy behaves near-optimally in the original MDP; Player 1 designs a finite-memory, observation-based trigger to activate the attack. We propose a switching gradient-based optimization algorithm to jointly solve for the backdoor policy and trigger. Experiments on a case study demonstrate the effectiveness of our method in achieving stealthy and successful backdoor attacks, and how the attack performance varies under different parameters related to the stealthiness of the backdoor attack.

MAOct 25, 2023
Covert Planning against Imperfect Observers

Haoxiang Ma, Chongyang Shi, Shuo Han et al.

Covert planning refers to a class of constrained planning problems where an agent aims to accomplish a task with minimal information leaked to a passive observer to avoid detection. However, existing methods of covert planning often consider deterministic environments or do not exploit the observer's imperfect information. This paper studies how covert planning can leverage the coupling of stochastic dynamics and the observer's imperfect observation to achieve optimal task performance without being detected. Specifically, we employ a Markov decision process to model the interaction between the agent and its stochastic environment, and a partial observation function to capture the leaked information to a passive observer. Assuming the observer employs hypothesis testing to detect if the observation deviates from a nominal policy, the covert planning agent aims to maximize the total discounted reward while keeping the probability of being detected as an adversary below a given threshold. We prove that finite-memory policies are more powerful than Markovian policies in covert planning. Then, we develop a primal-dual proximal policy gradient method with a two-time-scale update to compute a (locally) optimal covert policy. We demonstrate the effectiveness of our methods using a stochastic gridworld example. Our experimental results illustrate that the proposed method computes a policy that maximizes the adversary's expected reward without violating the detection constraint, and empirically demonstrates how the environmental noises can influence the performance of the covert policies.

AISep 26, 2025Code
PRIME: Planning and Retrieval-Integrated Memory for Enhanced Reasoning

Hieu Tran, Zonghai Yao, Nguyen Luong Tran et al.

Inspired by the dual-process theory of human cognition from \textit{Thinking, Fast and Slow}, we introduce \textbf{PRIME} (Planning and Retrieval-Integrated Memory for Enhanced Reasoning), a multi-agent reasoning framework that dynamically integrates \textbf{System 1} (fast, intuitive thinking) and \textbf{System 2} (slow, deliberate thinking). PRIME first employs a Quick Thinking Agent (System 1) to generate a rapid answer; if uncertainty is detected, it then triggers a structured System 2 reasoning pipeline composed of specialized agents for \textit{planning}, \textit{hypothesis generation}, \textit{retrieval}, \textit{information integration}, and \textit{decision-making}. This multi-agent design faithfully mimics human cognitive processes and enhances both efficiency and accuracy. Experimental results with LLaMA 3 models demonstrate that PRIME enables open-source LLMs to perform competitively with state-of-the-art closed-source models like GPT-4 and GPT-4o on benchmarks requiring multi-hop and knowledge-grounded reasoning. This research establishes PRIME as a scalable solution for improving LLMs in domains requiring complex, knowledge-intensive reasoning.

CVMar 14, 2025Code
ECLARE: Efficient cross-planar learning for anisotropic resolution enhancement

Samuel W. Remedios, Shuwen Wei, Shuo Han et al.

In clinical imaging, magnetic resonance (MR) image volumes are often acquired as stacks of 2D slices with decreased scan times, improved signal-to-noise ratio, and image contrasts unique to 2D MR pulse sequences. While this is sufficient for clinical evaluation, automated algorithms designed for 3D analysis perform poorly on multi-slice 2D MR volumes, especially those with thick slices and gaps between slices. Super-resolution (SR) methods aim to address this problem, but previous methods do not address all of the following: slice profile shape estimation, slice gap, domain shift, and non-integer or arbitrary upsampling factors. In this paper, we propose ECLARE (Efficient Cross-planar Learning for Anisotropic Resolution Enhancement), a self-SR method that addresses each of these factors. ECLARE uses a slice profile estimated from the multi-slice 2D MR volume, trains a network to learn the mapping from low-resolution to high-resolution in-plane patches from the same volume, and performs SR with anti-aliasing. We compared ECLARE to cubic B-spline interpolation, SMORE, and other contemporary SR methods. We used realistic and representative simulations so that quantitative performance against ground truth can be computed, and ECLARE outperformed all other methods in both signal recovery and downstream tasks. Importantly, as ECLARE does not use external training data it cannot suffer from domain shift between training and testing. Our code is open-source and available at https://www.github.com/sremedios/eclare.

CVMar 31, 2021Code
MR Slice Profile Estimation by Learning to Match Internal Patch Distributions

Shuo Han, Samuel Remedios, Aaron Carass et al.

To super-resolve the through-plane direction of a multi-slice 2D magnetic resonance (MR) image, its slice selection profile can be used as the degeneration model from high resolution (HR) to low resolution (LR) to create paired data when training a supervised algorithm. Existing super-resolution algorithms make assumptions about the slice selection profile since it is not readily known for a given image. In this work, we estimate a slice selection profile given a specific image by learning to match its internal patch distributions. Specifically, we assume that after applying the correct slice selection profile, the image patch distribution along HR in-plane directions should match the distribution along the LR through-plane direction. Therefore, we incorporate the estimation of a slice selection profile as part of learning a generator in a generative adversarial network (GAN). In this way, the slice selection profile can be learned without any external data. Our algorithm was tested using simulations from isotropic MR images, incorporated in a through-plane super-resolution algorithm to demonstrate its benefits, and also used as a tool to measure image resolution. Our code is at https://github.com/shuohan/espreso2.

78.5SEApr 10
SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei et al.

Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.

SDFeb 6, 2025
ADIFF: Explaining audio difference using natural language

Soham Deshmukh, Shuo Han, Rita Singh et al.

Understanding and explaining differences between audio recordings is crucial for fields like audio forensics, quality assessment, and audio generation. This involves identifying and describing audio events, acoustic scenes, signal characteristics, and their emotional impact on listeners. This paper stands out as the first work to comprehensively study the task of explaining audio differences and then propose benchmark, baselines for the task. First, we present two new datasets for audio difference explanation derived from the AudioCaps and Clotho audio captioning datasets. Using Large Language Models (LLMs), we generate three levels of difference explanations: (1) concise descriptions of audio events and objects, (2) brief sentences about audio events, acoustic scenes, and signal properties, and (3) comprehensive explanations that include semantics and listener emotions. For the baseline, we use prefix tuning where audio embeddings from two audio files are used to prompt a frozen language model. Our empirical analysis and ablation studies reveal that the naive baseline struggles to distinguish perceptually similar sounds and generate detailed tier 3 explanations. To address these limitations, we propose ADIFF, which introduces a cross-projection module, position captioning, and a three-step training process to enhance the model's ability to produce detailed explanations. We evaluate our model using objective metrics and human evaluation and show our model enhancements lead to significant improvements in performance over naive baseline and SoTA Audio-Language Model (ALM) Qwen Audio. Lastly, we conduct multiple ablation studies to study the effects of cross-projection, language model parameters, position captioning, third stage fine-tuning, and present our findings. Our benchmarks, findings, and strong baseline pave the way for nuanced and human-like explanations of audio differences.

HEP-PHDec 14, 2024
Pretrained Event Classification Model for High Energy Physics Analysis

Joshua Ho, Benjamin Ryan Roberts, Shuo Han et al.

We introduce a foundation model for event classification in high-energy physics, built on a Graph Neural Network architecture and trained on 120 million simulated proton-proton collision events spanning 12 distinct physics processes. The model is pretrained to learn a general and robust representation of collision data using challenging multiclass and multilabel classification tasks. Its performance is evaluated across five event classification tasks, which include both physics processes used during pretraining and new processes not encountered during pretraining. Fine-tuning the pretrained model significantly improves classification performance, particularly in scenarios with limited training data, demonstrating gains in both accuracy and computational efficiency. To investigate the underlying mechanisms behind these performance improvements, we employ a representational similarity evaluation framework based on Centered Kernel Alignment. This analysis reveals notable differences in the learned representations of fine-tuned pretrained models compared to baseline models trained from scratch.

AIOct 18, 2024
Soft-Label Integration for Robust Toxicity Classification

Zelei Cheng, Xian Wu, Jiahao Yu et al.

Toxicity classification in textual content remains a significant problem. Data with labels from a single annotator fall short of capturing the diversity of human perspectives. Therefore, there is a growing need to incorporate crowdsourced annotations for training an effective toxicity classifier. Additionally, the standard approach to training a classifier using empirical risk minimization (ERM) may fail to address the potential shifts between the training set and testing set due to exploiting spurious correlations. This work introduces a novel bi-level optimization framework that integrates crowdsourced annotations with the soft-labeling technique and optimizes the soft-label weights by Group Distributionally Robust Optimization (GroupDRO) to enhance the robustness against out-of-distribution (OOD) risk. We theoretically prove the convergence of our bi-level optimization algorithm. Experimental results demonstrate that our approach outperforms existing baseline methods in terms of both average and worst-group accuracy, confirming its effectiveness in leveraging crowdsourced annotations to achieve more effective and robust toxicity classification.

CLOct 17, 2024
RiTeK: A Dataset for Large Language Models Complex Reasoning over Textual Knowledge Graphs

Jiatan Huang, Mingchen Li, Zonghai Yao et al.

Answering complex real-world questions often requires accurate retrieval from textual knowledge graphs (TKGs). The scarcity of annotated data, along with intricate topological structures, makes this task particularly challenging. As the nature of relational path information could enhance the inference ability of Large Language Models (LLMs), efficiently retrieving more complex relational path information from TKGs presents another key challenge. To tackle these challenges, we first develop a Dataset for LLMs Complex Reasoning over Textual Knowledge Graphs (RiTeK) with a broad topological structure coverage.We synthesize realistic user queries that integrate diverse topological structures, relational information, and complex textual descriptions. We conduct rigorous expert evaluation to validate the quality of our synthesized queries. And then, we introduce an enhanced Monte Carlo Tree Search (MCTS) method, Relational MCTS, to automatically extract relational path information from textual graphs for specific queries. Our dataset mainly covers the medical domain as the relation types and entity are complex and publicly available. Experimental results indicate that RiTeK poses significant challenges for current retrieval and LLM systems, while the proposed Relational MCTS method enhances LLM inference ability and achieves state-of-the-art performance on RiTeK.

LGOct 16, 2024
When to Trust Your Data: Enhancing Dyna-Style Model-Based Reinforcement Learning With Data Filter

Yansong Li, Zeyu Dong, Ertai Luo et al.

Reinforcement learning (RL) algorithms can be divided into two classes: model-free algorithms, which are sample-inefficient, and model-based algorithms, which suffer from model bias. Dyna-style algorithms combine these two approaches by using simulated data from an estimated environmental model to accelerate model-free training. However, their efficiency is compromised when the estimated model is inaccurate. Previous works address this issue by using model ensembles or pretraining the estimated model with data collected from the real environment, increasing computational and sample complexity. To tackle this issue, we introduce an out-of-distribution (OOD) data filter that removes simulated data from the estimated model that significantly diverges from data collected in the real environment. We show theoretically that this technique enhances the quality of simulated data. With the help of the OOD data filter, the data simulated from the estimated model better mimics the data collected by interacting with the real model. This improvement is evident in the critic updates compared to using the simulated data without the OOD data filter. Our experiment integrates the data filter into the model-based policy optimization (MBPO) algorithm. The results demonstrate that our method requires fewer interactions with the real environment to achieve a higher level of optimality than MBPO, even without a model ensemble.

LGOct 3, 2025
D2 Actor Critic: Diffusion Actor Meets Distributional Critic

Lunjun Zhang, Shuo Han, Hanrui Lyu et al.

We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach.

CLSep 20, 2025
From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations

Benlu Wang, Iris Xia, Yifan Zhang et al.

Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.

CLJun 11, 2025
CoLMbo: Speaker Language Model for Descriptive Profiling

Massa Baali, Shuo Han, Syed Abdul Hannan et al.

Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.

LGMay 18, 2025
Of Mice and Machines: A Comparison of Learning Between Real World Mice and RL Agents

Shuo Han, German Espinosa, Junda Huang et al.

Recent advances in reinforcement learning (RL) have demonstrated impressive capabilities in complex decision-making tasks. This progress raises a natural question: how do these artificial systems compare to biological agents, which have been shaped by millions of years of evolution? To help answer this question, we undertake a comparative study of biological mice and RL agents in a predator-avoidance maze environment. Through this analysis, we identify a striking disparity: RL agents consistently demonstrate a lack of self-preservation instinct, readily risking ``death'' for marginal efficiency gains. These risk-taking strategies are in contrast to biological agents, which exhibit sophisticated risk-assessment and avoidance behaviors. Towards bridging this gap between the biological and artificial, we propose two novel mechanisms that encourage more naturalistic risk-avoidance behaviors in RL agents. Our approach leads to the emergence of naturalistic behaviors, including strategic environment assessment, cautious path planning, and predator avoidance patterns that closely mirror those observed in biological systems.

IVNov 17, 2025
PoCGM: Poisson-Conditioned Generative Model for Sparse-View CT Reconstruction

Changsheng Fang, Yongtong Liu, Bahareh Morovati et al.

In computed tomography (CT), reducing the number of projection views is an effective strategy to lower radiation exposure and/or improve temporal resolution. However, this often results in severe aliasing artifacts and loss of structural details in reconstructed images, posing significant challenges for clinical applications. Inspired by the success of the Poisson Flow Generative Model (PFGM++) in natural image generation, we propose a PoCGM (Poisson-Conditioned Generative Model) to address the challenges of sparse-view CT reconstruction. Since PFGM++ was originally designed for unconditional generation, it lacks direct applicability to medical imaging tasks that require integrating conditional inputs. To overcome this limitation, the PoCGM reformulates PFGM++ into a conditional generative framework by incorporating sparse-view data as guidance during both training and sampling phases. By modeling the posterior distribution of full-view reconstructions conditioned on sparse observations, PoCGM effectively suppresses artifacts while preserving fine structural details. Qualitative and quantitative evaluations demonstrate that PoCGM outperforms the baselines, achieving improved artifact suppression, enhanced detail preservation, and reliable performance in dose-sensitive and time-critical imaging scenarios.

AIOct 19, 2025
See or Say Graphs: Agent-Driven Scalable Graph Understanding with Vision-Language Models

Shuo Han, Yukun Cao, Zezhong Ding et al.

Vision-language models (VLMs) have shown promise in graph understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that routes tasks to the most suitable modality-using the text modality for simple property reasoning and the visual modality for local and structurally complex reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to $200\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to $4.4\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

IVSep 16, 2025
Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction for Sparse-View CT

Haodong Li, Shuo Han, Haiyang Mao et al.

Sparse-View CT (SVCT) reconstruction enhances temporal resolution and reduces radiation dose, yet its clinical use is hindered by artifacts due to view reduction and domain shifts from scanner, protocol, or anatomical variations, leading to performance degradation in out-of-distribution (OOD) scenarios. In this work, we propose a Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction (CDPIR) framework to tackle the OOD problem in SVCT. CDPIR integrates cross-distribution diffusion priors, derived from a Scalable Interpolant Transformer (SiT), with model-based iterative reconstruction methods. Specifically, we train a SiT backbone, an extension of the Diffusion Transformer (DiT) architecture, to establish a unified stochastic interpolant framework, leveraging Classifier-Free Guidance (CFG) across multiple datasets. By randomly dropping the conditioning with a null embedding during training, the model learns both domain-specific and domain-invariant priors, enhancing generalizability. During sampling, the globally sensitive transformer-based diffusion model exploits the cross-distribution prior within the unified stochastic interpolant framework, enabling flexible and stable control over multi-distribution-to-noise interpolation paths and decoupled sampling strategies, thereby improving adaptation to OOD reconstruction. By alternating between data fidelity and sampling updates, our model achieves state-of-the-art performance with superior detail preservation in SVCT reconstructions. Extensive experiments demonstrate that CDPIR significantly outperforms existing approaches, particularly under OOD conditions, highlighting its robustness and potential clinical value in challenging imaging scenarios.

CVSep 1, 2025
Clinical Metadata Guided Limited-Angle CT Image Reconstruction

Yu Shi, Shuyi Fan, Changsheng Fang et al.

Limited-angle computed tomography (LACT) offers improved temporal resolution and reduced radiation dose for cardiac imaging, but suffers from severe artifacts due to truncated projections. To address the ill-posedness of LACT reconstruction, we propose a two-stage diffusion framework guided by structured clinical metadata. In the first stage, a transformer-based diffusion model conditioned exclusively on metadata, including acquisition parameters, patient demographics, and diagnostic impressions, generates coarse anatomical priors from noise. The second stage further refines the images by integrating both the coarse prior and metadata to produce high-fidelity results. Physics-based data consistency is enforced at each sampling step in both stages using an Alternating Direction Method of Multipliers module, ensuring alignment with the measured projections. Extensive experiments on both synthetic and real cardiac CT datasets demonstrate that incorporating metadata significantly improves reconstruction fidelity, particularly under severe angular truncation. Compared to existing metadata-free baselines, our method achieves superior performance in SSIM, PSNR, nMI, and PCC. Ablation studies confirm that different types of metadata contribute complementary benefits, particularly diagnostic and demographic priors under limited-angle conditions. These findings highlight the dual role of clinical metadata in improving both reconstruction quality and efficiency, supporting their integration into future metadata-guided medical imaging frameworks.

AIAug 31, 2025
ChatCLIDS: Simulating Persuasive AI Dialogues to Promote Closed-Loop Insulin Adoption in Type 1 Diabetes Care

Zonghai Yao, Talha Chafekar, Junda Wang et al.

Real-world adoption of closed-loop insulin delivery systems (CLIDS) in type 1 diabetes remains low, driven not by technical failure, but by diverse behavioral, psychosocial, and social barriers. We introduce ChatCLIDS, the first benchmark to rigorously evaluate LLM-driven persuasive dialogue for health behavior change. Our framework features a library of expert-validated virtual patients, each with clinically grounded, heterogeneous profiles and realistic adoption barriers, and simulates multi-turn interactions with nurse agents equipped with a diverse set of evidence-based persuasive strategies. ChatCLIDS uniquely supports longitudinal counseling and adversarial social influence scenarios, enabling robust, multi-dimensional evaluation. Our findings reveal that while larger and more reflective LLMs adapt strategies over time, all models struggle to overcome resistance, especially under realistic social pressure. These results highlight critical limitations of current LLMs for behavior change, and offer a high-fidelity, scalable testbed for advancing trustworthy persuasive AI in healthcare and beyond.

IVAug 11, 2025
Towards Human-AI Collaboration System for the Detection of Invasive Ductal Carcinoma in Histopathology Images

Shuo Han, Ahmed Karam Eldaly, Solomon Sunday Oyelere

Invasive ductal carcinoma (IDC) is the most prevalent form of breast cancer, and early, accurate diagnosis is critical to improving patient survival rates by guiding treatment decisions. Combining medical expertise with artificial intelligence (AI) holds significant promise for enhancing the precision and efficiency of IDC detection. In this work, we propose a human-in-the-loop (HITL) deep learning system designed to detect IDC in histopathology images. The system begins with an initial diagnosis provided by a high-performance EfficientNetV2S model, offering feedback from AI to the human expert. Medical professionals then review the AI-generated results, correct any misclassified images, and integrate the revised labels into the training dataset, forming a feedback loop from the human back to the AI. This iterative process refines the model's performance over time. The EfficientNetV2S model itself achieves state-of-the-art performance compared to existing methods in the literature, with an overall accuracy of 93.65\%. Incorporating the human-in-the-loop system further improves the model's accuracy using four experimental groups with misclassified images. These results demonstrate the potential of this collaborative approach to enhance AI performance in diagnostic systems. This work contributes to advancing automated, efficient, and highly accurate methods for IDC detection through human-AI collaboration, offering a promising direction for future AI-assisted medical diagnostics.

IVDec 3, 2024
$ρ$-NeRF: Leveraging Attenuation Priors in Neural Radiance Field for 3D Computed Tomography Reconstruction

Li Zhou, Changsheng Fang, Bahareh Morovati et al.

This paper introduces $ρ$-NeRF, a self-supervised approach that sets a new standard in novel view synthesis (NVS) and computed tomography (CT) reconstruction by modeling a continuous volumetric radiance field enriched with physics-based attenuation priors. The $ρ$-NeRF represents a three-dimensional (3D) volume through a fully-connected neural network that takes a single continuous four-dimensional (4D) coordinate, spatial location $(x, y, z)$ and an initialized attenuation value ($ρ$), and outputs the attenuation coefficient at that position. By querying these 4D coordinates along X-ray paths, the classic forward projection technique is applied to integrate attenuation data across the 3D space. By matching and refining pre-initialized attenuation values derived from traditional reconstruction algorithms like Feldkamp-Davis-Kress algorithm (FDK) or conjugate gradient least squares (CGLS), the enriched schema delivers superior fidelity in both projection synthesis and image recognition.

OCJun 7, 2024
Robust Reward Design for Markov Decision Processes

Shuo Wu, Haoxiang Ma, Jie Fu et al.

The problem of reward design examines the interaction between a leader and a follower, where the leader aims to shape the follower's behavior to maximize the leader's payoff by modifying the follower's reward function. Current approaches to reward design rely on an accurate model of how the follower responds to reward modifications, which can be sensitive to modeling inaccuracies. To address this issue of sensitivity, we present a solution that offers robustness against uncertainties in modeling the follower, including 1) how the follower breaks ties in the presence of nonunique best responses, 2) inexact knowledge of how the follower perceives reward modifications, and 3) bounded rationality of the follower. Our robust solution is guaranteed to exist under mild conditions and can be obtained numerically by solving a mixed-integer linear program. Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs.

CRJan 28, 2022
A Secure and Efficient Federated Learning Framework for NLP

Jieren Deng, Chenghong Wang, Xianrui Meng et al.

In this work, we consider the problem of designing secure and efficient federated learning (FL) frameworks. Existing solutions either involve a trusted aggregator or require heavyweight cryptographic primitives, which degrades performance significantly. Moreover, many existing secure FL designs work only under the restrictive assumption that none of the clients can be dropped out from the training protocol. To tackle these problems, we propose SEFL, a secure and efficient FL framework that (1) eliminates the need for the trusted entities; (2) achieves similar and even better model accuracy compared with existing FL designs; (3) is resilient to client dropouts. Through extensive experimental studies on natural language processing (NLP) tasks, we demonstrate that the SEFL achieves comparable accuracy compared to existing FL solutions, and the proposed pruning technique can improve runtime performance up to 13.7x.

IVJun 29, 2021
RCNN-SliceNet: A Slice and Cluster Approach for Nuclei Centroid Detection in Three-Dimensional Fluorescence Microscopy Images

Liming Wu, Shuo Han, Alain Chen et al.

Robust and accurate nuclei centroid detection is important for the understanding of biological structures in fluorescence microscopy images. Existing automated nuclei localization methods face three main challenges: (1) Most of object detection methods work only on 2D images and are difficult to extend to 3D volumes; (2) Segmentation-based models can be used on 3D volumes but it is computational expensive for large microscopy volumes and they have difficulty distinguishing different instances of objects; (3) Hand annotated ground truth is limited for 3D microscopy volumes. To address these issues, we present a scalable approach for nuclei centroid detection of 3D microscopy volumes. We describe the RCNN-SliceNet to detect 2D nuclei centroids for each slice of the volume from different directions and 3D agglomerative hierarchical clustering (AHC) is used to estimate the 3D centroids of nuclei in a volume. The model was trained with the synthetic microscopy data generated using Spatially Constrained Cycle-Consistent Adversarial Networks (SpCycleGAN) and tested on different types of real 3D microscopy data. Extensive experimental results demonstrate that our proposed method can accurately count and detect the nuclei centroids in a 3D microscopy volume.

CRApr 1, 2021
Qualitative Planning in Imperfect Information Games with Active Sensing and Reactive Sensor Attacks: Cost of Unawareness

Abhishek N. Kulkarni, Shuo Han, Nandi O. Leslie et al.

We consider the probabilistic planning problem where the agent (called Player 1, or P1) can jointly plan the control actions and sensor queries in a sensor network and an attacker (called player 2, or P2) can carry out attacks on the sensors. We model such an adversarial interaction using a formal model -- a reachability game with partially controllable observation functions. The main contribution of this paper is to assess the cost of P1's unawareness: Suppose P1 misinterprets the sensor failures as probabilistic node failures due to unreliable network communication, and P2 is aware of P1's misinterpretation in addition to her partial observability. Then, from which states can P2 carry out sensor attacks to ensure, with probability one, that P1 will not be able to complete her reachability task even though, due to misinterpretation, P1 believes that she can almost-surely achieve her task. We develop an algorithm to solve the almost-sure winning sensor-attack strategy given P1's observation-based strategy. Our attack analysis could be used for attack detection in wireless communication networks and the design of provably secured attack-aware sensor allocation in decision-theoretic models for cyber-physical systems.

IVJan 17, 2020
CHAOS Challenge -- Combined (CT-MR) Healthy Abdominal Organ Segmentation

A. Emre Kavur, N. Sinem Gezer, Mustafa Barış et al.

Segmentation of abdominal organs has been a comprehensive, yet unresolved, research field for many years. In the last decade, intensive developments in deep learning (DL) have introduced new state-of-the-art segmentation systems. In order to expand the knowledge on these topics, the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation challenge has been organized in conjunction with IEEE International Symposium on Biomedical Imaging (ISBI), 2019, in Venice, Italy. CHAOS provides both abdominal CT and MR data from healthy subjects for single and multiple abdominal organ segmentation. Five different but complementary tasks have been designed to analyze the capabilities of current approaches from multiple perspectives. The results are investigated thoroughly, compared with manual annotations and interactive methods. The analysis shows that the performance of DL models for single modality (CT / MR) can show reliable volumetric analysis performance (DICE: 0.98 $\pm$ 0.00 / 0.95 $\pm$ 0.01) but the best MSSD performance remain limited (21.89 $\pm$ 13.94 / 20.85 $\pm$ 10.63 mm). The performances of participating models decrease significantly for cross-modality tasks for the liver (DICE: 0.88 $\pm$ 0.15 MSSD: 36.33 $\pm$ 21.97 mm) and all organs (DICE: 0.85 $\pm$ 0.21 MSSD: 33.17 $\pm$ 38.93 mm). Despite contrary examples on different applications, multi-tasking DL models designed to segment all organs seem to perform worse compared to organ-specific ones (performance drop around 5\%). Besides, such directions of further research for cross-modality segmentation would significantly support real-world clinical applications. Moreover, having more than 1500 participants, another important contribution of the paper is the analysis on shortcomings of challenge organizations such as the effects of multiple submissions and peeking phenomena.

IVSep 13, 2019
Center-Extraction-Based Three Dimensional Nuclei Instance Segmentation of Fluorescence Microscopy Images

David Joon Ho, Shuo Han, Chichen Fu et al.

Fluorescence microscopy is an essential tool for the analysis of 3D subcellular structures in tissue. An important step in the characterization of tissue involves nuclei segmentation. In this paper, a two-stage method for segmentation of nuclei using convolutional neural networks (CNNs) is described. In particular, since creating labeled volumes manually for training purposes is not practical due to the size and complexity of the 3D data sets, the paper describes a method for generating synthetic microscopy volumes based on a spatially constrained cycle-consistent adversarial network. The proposed method is tested on multiple real microscopy data sets and outperforms other commonly used segmentation techniques.