Lin Zhao

CV
h-index35
124papers
6,227citations
Novelty48%
AI Score60

124 Papers

CLMar 20, 2023Code
DeID-GPT: Zero-shot Medical Text De-Identification by GPT-4

Zhengliang Liu, Yue Huang, Xiaowei Yu et al.

The digitization of healthcare has facilitated the sharing and re-using of medical data but has also raised concerns about confidentiality and privacy. HIPAA (Health Insurance Portability and Accountability Act) mandates removing re-identifying information before the dissemination of medical records. Thus, effective and efficient solutions for de-identifying medical data, especially those in free-text forms, are highly needed. While various computer-assisted de-identification methods, including both rule-based and learning-based, have been developed and used in prior practice, such solutions still lack generalizability or need to be fine-tuned according to different scenarios, significantly imposing restrictions in wider use. The advancement of large language models (LLM), such as ChatGPT and GPT-4, have shown great potential in processing text data in the medical domain with zero-shot in-context learning, especially in the task of privacy protection, as these models can identify confidential information by their powerful named entity recognition (NER) capability. In this work, we developed a novel GPT4-enabled de-identification framework (``DeID-GPT") to automatically identify and remove the identifying information. Compared to existing commonly used medical text data de-identification methods, our developed DeID-GPT showed the highest accuracy and remarkable reliability in masking private information from the unstructured medical text while preserving the original structure and meaning of the text. This study is one of the earliest to utilize ChatGPT and GPT-4 for medical text data processing and de-identification, which provides insights for further research and solution development on the use of LLMs such as ChatGPT/GPT-4 in healthcare. Codes and benchmarking data information are available at https://github.com/yhydhx/ChatGPT-API.

CLJun 14, 2023Code
Radiology-GPT: A Large Language Model for Radiology

Zhengliang Liu, Aoxiao Zhong, Yiwei Li et al.

We introduce Radiology-GPT, a large language model for radiology. Using an instruction tuning approach on an extensive dataset of radiology domain knowledge, Radiology-GPT demonstrates superior performance compared to general language models such as StableLM, Dolly and LLaMA. It exhibits significant versatility in radiological diagnosis, research, and communication. This work serves as a catalyst for future developments in clinical NLP. The successful implementation of Radiology-GPT is indicative of the potential of localizing generative large language models, specifically tailored for distinctive medical specialties, while ensuring adherence to privacy standards such as HIPAA. The prospect of developing individualized, large-scale language models that cater to specific needs of various hospitals presents a promising direction. The fusion of conversational competence and domain-specific knowledge in these models is set to foster future development in healthcare AI. A demo of Radiology-GPT is available at https://huggingface.co/spaces/allen-eric/radiology-gpt.

CVJul 3, 2023Code
SAMAug: Point Prompt Augmentation for Segment Anything Model

Haixing Dai, Chong Ma, Zhiling Yan et al.

This paper introduces SAMAug, a novel visual point augmentation method for the Segment Anything Model (SAM) that enhances interactive image segmentation performance. SAMAug generates augmented point prompts to provide more information about the user's intention to SAM. Starting with an initial point prompt, SAM produces an initial mask, which is then fed into our proposed SAMAug to generate augmented point prompts. By incorporating these extra points, SAM can generate augmented segmentation masks based on both the augmented point prompts and the initial prompt, resulting in improved segmentation performance. We conducted evaluations using four different point augmentation strategies: random sampling, sampling based on maximum difference entropy, maximum distance, and saliency. Experiment results on the COCO, Fundus, COVID QUEx, and ISIC2018 datasets show that SAMAug can boost SAM's segmentation results, especially using the maximum distance and saliency. SAMAug demonstrates the potential of visual prompt augmentation for computer vision. Codes of SAMAug are available at github.com/yhydhx/SAMAug

CLJun 16, 2023Code
AD-AutoGPT: An Autonomous GPT for Alzheimer's Disease Infodemiology

Haixing Dai, Yiwei Li, Zhengliang Liu et al.

In this pioneering study, inspired by AutoGPT, the state-of-the-art open-source application based on the GPT-4 large language model, we develop a novel tool called AD-AutoGPT which can conduct data collection, processing, and analysis about complex health narratives of Alzheimer's Disease in an autonomous manner via users' textual prompts. We collated comprehensive data from a variety of news sources, including the Alzheimer's Association, BBC, Mayo Clinic, and the National Institute on Aging since June 2022, leading to the autonomous execution of robust trend analyses, intertopic distance maps visualization, and identification of salient terms pertinent to Alzheimer's Disease. This approach has yielded not only a quantifiable metric of relevant discourse but also valuable insights into public focus on Alzheimer's Disease. This application of AD-AutoGPT in public health signifies the transformative potential of AI in facilitating a data-rich understanding of complex health narratives like Alzheimer's Disease in an autonomous manner, setting the groundwork for future AI-driven investigations in global health landscapes.

LGMay 28
World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

Arif Hassan Zidan, Yi Pan, Hanqi Jiang et al.

World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi-axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state-space and recurrent approaches, transformer-based models, diffusion-based generators, physics-informed networks, and language-augmented multimodal systems; (iii) reasoning strategy, covering imagination-based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive-science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain-of-thought reasoning with world-model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim-to-real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation-scale interactive simulators, and safe deployment in safety-critical domains.

CLFeb 25, 2023
AugGPT: Leveraging ChatGPT for Text Data Augmentation

Haixing Dai, Zhengliang Liu, Wenxiong Liao et al.

Text data augmentation is an effective strategy for overcoming the challenge of limited sample sizes in many natural language processing (NLP) tasks. This challenge is especially prominent in the few-shot learning scenario, where the data in the target domain is generally much scarcer and of lowered quality. A natural and widely-used strategy to mitigate such challenges is to perform data augmentation to better capture the data invariance and increase the sample size. However, current text data augmentation methods either can't ensure the correct labeling of the generated data (lacking faithfulness) or can't ensure sufficient diversity in the generated data (lacking compactness), or both. Inspired by the recent success of large language models, especially the development of ChatGPT, which demonstrated improved language comprehension abilities, in this work, we propose a text data augmentation approach based on ChatGPT (named AugGPT). AugGPT rephrases each sentence in the training samples into multiple conceptually similar but semantically different samples. The augmented samples can then be used in downstream model training. Experiment results on few-shot learning text classification tasks show the superior performance of the proposed AugGPT approach over state-of-the-art text data augmentation methods in terms of testing accuracy and distribution of the augmented samples.

CLApr 4, 2023
Summary of ChatGPT-Related Research and Perspective Towards the Future of Large Language Models

Yiheng Liu, Tianle Han, Siyuan Ma et al.

This paper presents a comprehensive survey of ChatGPT-related (GPT-3.5 and GPT-4) research, state-of-the-art large language models (LLM) from the GPT series, and their prospective applications across diverse domains. Indeed, key innovations such as large-scale pre-training that captures knowledge across the entire world wide web, instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) have played significant roles in enhancing LLMs' adaptability and performance. We performed an in-depth analysis of 194 relevant papers on arXiv, encompassing trend analysis, word cloud representation, and distribution analysis across various application domains. The findings reveal a significant and increasing interest in ChatGPT-related research, predominantly centered on direct natural language processing applications, while also demonstrating considerable potential in areas ranging from education and history to mathematics, medicine, and physics. This study endeavors to furnish insights into ChatGPT's capabilities, potential implications, ethical concerns, and offer direction for future advancements in this field.

CVJul 3, 2023
Review of Large Vision Models and Visual Prompt Engineering

Jiaqi Wang, Zhengliang Liu, Lin Zhao et al.

Visual prompt engineering is a fundamental technology in the field of visual and image Artificial General Intelligence, serving as a key component for achieving zero-shot capabilities. As the development of large vision models progresses, the importance of prompt engineering becomes increasingly evident. Designing suitable prompts for specific visual tasks has emerged as a meaningful research direction. This review aims to summarize the methods employed in the computer vision domain for large vision models and visual prompt engineering, exploring the latest advancements in visual prompt engineering. We present influential large models in the visual domain and a range of prompt engineering methods employed on these models. It is our hope that this review provides a comprehensive and systematic description of prompt engineering methods based on large visual models, offering valuable insights for future researchers in their exploration of this field.

CLAug 29, 2023
Radiology-Llama2: Best-in-Class Large Language Model for Radiology

Zhengliang Liu, Yiwei Li, Peng Shu et al.

This paper introduces Radiology-Llama2, a large language model specialized for radiology through a process known as instruction tuning. Radiology-Llama2 is based on the Llama2 architecture and further trained on a large dataset of radiology reports to generate coherent and clinically useful impressions from radiological findings. Quantitative evaluations using ROUGE metrics on the MIMIC-CXR and OpenI datasets demonstrate that Radiology-Llama2 achieves state-of-the-art performance compared to other generative language models, with a Rouge-1 score of 0.4834 on MIMIC-CXR and 0.4185 on OpenI. Additional assessments by radiology experts highlight the model's strengths in understandability, coherence, relevance, conciseness, and clinical utility. The work illustrates the potential of localized language models designed and tuned for specialized domains like radiology. When properly evaluated and deployed, such models can transform fields like radiology by automating rote tasks and enhancing human expertise.

AIJun 8, 2023
Artificial General Intelligence for Medical Imaging Analysis

Xiang Li, Lin Zhao, Lu Zhang et al.

Large-scale Artificial General Intelligence (AGI) models, including Large Language Models (LLMs) such as ChatGPT/GPT-4, have achieved unprecedented success in a variety of general domain tasks. Yet, when applied directly to specialized domains like medical imaging, which require in-depth expertise, these models face notable challenges arising from the medical field's inherent complexities and unique characteristics. In this review, we delve into the potential applications of AGI models in medical imaging and healthcare, with a primary focus on LLMs, Large Vision Models, and Large Multimodal Models. We provide a thorough overview of the key features and enabling techniques of LLMs and AGI, and further examine the roadmaps guiding the evolution and implementation of AGI models in the medical sector, summarizing their present applications, potentialities, and associated challenges. In addition, we highlight potential future research directions, offering a holistic view on upcoming ventures. This comprehensive review aims to offer insights into the future implications of AGI in medical imaging, healthcare, and beyond.

LGJun 3
Flash-WAM: Modality-Aware Distillation for World Action Models

Arman Akbari, Ci Zhang, Arash Akbari et al.

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

AIMar 28, 2023
When Brain-inspired AI Meets AGI

Lin Zhao, Lu Zhang, Zihao Wu et al.

Artificial General Intelligence (AGI) has been a long-standing goal of humanity, with the aim of creating machines capable of performing any intellectual task that humans can do. To achieve this, AGI researchers draw inspiration from the human brain and seek to replicate its principles in intelligent machines. Brain-inspired artificial intelligence is a field that has emerged from this endeavor, combining insights from neuroscience, psychology, and computer science to develop more efficient and powerful AI systems. In this article, we provide a comprehensive overview of brain-inspired AI from the perspective of AGI. We begin with the current progress in brain-inspired AI and its extensive connection with AGI. We then cover the important characteristics for both human intelligence and AGI (e.g., scaling, multimodality, and reasoning). We discuss important technologies toward achieving AGI in current AI systems, such as in-context learning and prompt tuning. We also investigate the evolution of AGI systems from both algorithmic and infrastructural perspectives. Finally, we explore the limitations and future of AGI.

CLJul 25, 2023
Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu, Tianyang Zhong, Yiwei Li et al.

The rise of large language models (LLMs) has marked a pivotal shift in the field of natural language processing (NLP). LLMs have revolutionized a multitude of domains, and they have made a significant impact in the medical field. Large language models are now more abundant than ever, and many of these models exhibit bilingual capabilities, proficient in both English and Chinese. However, a comprehensive evaluation of these models remains to be conducted. This lack of assessment is especially apparent within the context of radiology NLP. This study seeks to bridge this gap by critically evaluating thirty two LLMs in interpreting radiology reports, a crucial component of radiology NLP. Specifically, the ability to derive impressions from radiologic findings is assessed. The outcomes of this evaluation provide key insights into the performance, strengths, and weaknesses of these LLMs, informing their practical applications within the medical domain.

CLApr 18, 2023
Exploring the Trade-Offs: Unified Large Language Models vs Local Fine-Tuned Models for Highly-Specific Radiology NLI Task

Zihao Wu, Lu Zhang, Chao Cao et al.

Recently, ChatGPT and GPT-4 have emerged and gained immense global attention due to their unparalleled performance in language processing. Despite demonstrating impressive capability in various open-domain tasks, their adequacy in highly specific fields like radiology remains untested. Radiology presents unique linguistic phenomena distinct from open-domain data due to its specificity and complexity. Assessing the performance of large language models (LLMs) in such specific domains is crucial not only for a thorough evaluation of their overall performance but also for providing valuable insights into future model design directions: whether model design should be generic or domain-specific. To this end, in this study, we evaluate the performance of ChatGPT/GPT-4 on a radiology NLI task and compare it to other models fine-tuned specifically on task-related data samples. We also conduct a comprehensive investigation on ChatGPT/GPT-4's reasoning ability by introducing varying levels of inference difficulty. Our results show that 1) GPT-4 outperforms ChatGPT in the radiology NLI task; 2) other specifically fine-tuned models require significant amounts of data samples to achieve comparable performance to ChatGPT/GPT-4. These findings demonstrate that constructing a generic model that is capable of solving various tasks across different domains is feasible.

NANov 24, 2016
A unified integral equation scheme for doubly-periodic Laplace and Stokes boundary value problems in two dimensions

Alex H. Barnett, Gary Marple, Shravan Veerapaneni et al.

We present a spectrally-accurate scheme to turn a boundary integral formulation for an elliptic PDE on a single unit cell geometry into one for the fully periodic problem. Applications include computing the effective permeability of composite media (homogenization), and microfluidic chip design. Our basic idea is to exploit a small least squares solve to apply periodicity without ever handling periodic Green's functions. We exhibit fast solvers for the two-dimensional (2D) doubly-periodic Neumann Laplace problem (flow around insulators), and Stokes non-slip fluid flow problem, that for inclusions with smooth boundaries achieve 12-digit accuracy, and can handle thousands of inclusions per unit cell. We split the infinite sum over the lattice of images into a directly-summed "near" part plus a small number of auxiliary sources which represent the (smooth) remaining "far" contribution. Applying physical boundary conditions on the unit cell walls gives an expanded linear system, which, after a rank-1 or rank-3 correction and a Schur complement, leaves a well-conditioned square system which can be solved iteratively using fast multipole acceleration plus a low-rank term. We are rather explicit about the consistency and nullspaces of both the continuous and discretized problems. The scheme is simple (no lattice sums, Ewald methods, nor particle meshes are required), allows adaptivity, and is essentially dimension- and PDE-independent, so would generalize without fuss to 3D and to other non-oscillatory elliptic problems such as elastostatics. We incorporate recently developed spectral quadratures that accurately handle close-to-touching geometries. We include many numerical examples, and provide a software implementation.

CVApr 29, 2023
Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT

Zhenxiang Xiao, Yuzhong Chen, Lu Zhang et al.

Prompts have been proven to play a crucial role in large language models, and in recent years, vision models have also been using prompts to improve scalability for multiple downstream tasks. In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification which we called Instruction-ViT. The key idea is to implement multi-modal prompts (text or image prompt) related to category information to guide the fine-tuning of the model. Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved. Our work provided an innovative strategy to fuse multi-modal prompts with better performance and faster adaptability for visual classification models.

CVMay 25, 2022
Eye-gaze-guided Vision Transformer for Rectifying Shortcut Learning

Chong Ma, Lin Zhao, Yuzhong Chen et al.

Learning harmful shortcuts such as spurious correlations and biases prevents deep neural networks from learning the meaningful and useful representations, thus jeopardizing the generalizability and interpretability of the learned representation. The situation becomes even more serious in medical imaging, where the clinical data (e.g., MR images with pathology) are limited and scarce while the reliability, generalizability and transparency of the learned model are highly required. To address this problem, we propose to infuse human experts' intelligence and domain knowledge into the training of deep neural networks. The core idea is that we infuse the visual attention information from expert radiologists to proactively guide the deep model to focus on regions with potential pathology and avoid being trapped in learning harmful shortcuts. To do so, we propose a novel eye-gaze-guided vision transformer (EG-ViT) for diagnosis with limited medical image data. We mask the input image patches that are out of the radiologists' interest and add an additional residual connection in the last encoder layer of EG-ViT to maintain the correlations of all patches. The experiments on two public datasets of INbreast and SIIM-ACR demonstrate our EG-ViT model can effectively learn/transfer experts' domain knowledge and achieve much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the EG-ViT model's interpretability. In general, EG-ViT takes the advantages of both human expert's prior knowledge and the power of deep neural networks. This work opens new avenues for advancing current artificial intelligence paradigms by infusing human intelligence.

CLSep 27, 2024
Evaluation of OpenAI o1: Opportunities and Challenges of AGI

Tianyang Zhong, Zhengliang Liu, Yi Pan et al.

This comprehensive study evaluates the performance of OpenAI's o1-preview large language model across a diverse array of complex reasoning tasks, spanning multiple domains, including computer science, mathematics, natural sciences, medicine, linguistics, and social sciences. Through rigorous testing, o1-preview demonstrated remarkable capabilities, often achieving human-level or superior performance in areas ranging from coding challenges to scientific reasoning and from language processing to creative problem-solving. Key findings include: -83.3% success rate in solving complex competitive programming problems, surpassing many human experts. -Superior ability in generating coherent and accurate radiology reports, outperforming other evaluated models. -100% accuracy in high school-level mathematical reasoning tasks, providing detailed step-by-step solutions. -Advanced natural language inference capabilities across general and specialized domains like medicine. -Impressive performance in chip design tasks, outperforming specialized models in areas such as EDA script generation and bug analysis. -Remarkable proficiency in anthropology and geology, demonstrating deep understanding and reasoning in these specialized fields. -Strong capabilities in quantitative investing. O1 has comprehensive financial knowledge and statistical modeling skills. -Effective performance in social media analysis, including sentiment analysis and emotion recognition. The model excelled particularly in tasks requiring intricate reasoning and knowledge integration across various fields. While some limitations were observed, including occasional errors on simpler problems and challenges with certain highly specialized concepts, the overall results indicate significant progress towards artificial general intelligence.

CVJun 17, 2022
Rectify ViT Shortcut Learning by Visual Saliency

Chong Ma, Lin Zhao, Yuzhong Chen et al.

Shortcut learning is common but harmful to deep learning models, leading to degenerated feature representations and consequently jeopardizing the model's generalizability and interpretability. However, shortcut learning in the widely used Vision Transformer framework is largely unknown. Meanwhile, introducing domain-specific knowledge is a major approach to rectifying the shortcuts, which are predominated by background related factors. For example, in the medical imaging field, eye-gaze data from radiologists is an effective human visual prior knowledge that has the great potential to guide the deep learning models to focus on meaningful foreground regions of interest. However, obtaining eye-gaze data is time-consuming, labor-intensive and sometimes even not practical. In this work, we propose a novel and effective saliency-guided vision transformer (SGT) model to rectify shortcut learning in ViT with the absence of eye-gaze data. Specifically, a computational visual saliency model is adopted to predict saliency maps for input image samples. Then, the saliency maps are used to distil the most informative image patches. In the proposed SGT, the self-attention among image patches focus only on the distilled informative ones. Considering this distill operation may lead to global information lost, we further introduce, in the last encoder layer, a residual connection that captures the self-attention across all the image patches. The experiment results on four independent public datasets show that our SGT framework can effectively learn and leverage human prior knowledge without eye gaze data and achieves much better performance than baselines. Meanwhile, it successfully rectifies the harmful shortcut learning and significantly improves the interpretability of the ViT model, demonstrating the promise of transferring human prior knowledge derived visual saliency in rectifying shortcut learning

CVMay 20, 2022
Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

Yuzhong Chen, Zhenxiang Xiao, Lin Zhao et al.

Learning with little data is challenging but often inevitable in various application scenarios where the labeled data is limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning based FSL approaches are inefficient in knowledge generalization and thus degenerate the downstream task performances. In this paper, we propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient FSL on ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT to focus on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pre-trained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning based sample selection method to further improve the generalizability of MG-ViT based FSL. We evaluate the proposed MG-ViT on both Agri-ImageNet classification task and ACFR apple detection task with gradient-weighted class activation mapping (Grad-CAM) as the mask. The experimental results show that the MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models, providing novel insights and a concrete approach towards generalizing data-intensive and large-scale deep learning models for FSL.

CLMar 27, 2023
Coupling Artificial Neurons in BERT and Biological Neurons in the Human Brain

Xu Liu, Mengyue Zhou, Gaosheng Shi et al.

Linking computational natural language processing (NLP) models and neural responses to language in the human brain on the one hand facilitates the effort towards disentangling the neural representations underpinning language perception, on the other hand provides neurolinguistics evidence to evaluate and improve NLP models. Mappings of an NLP model's representations of and the brain activities evoked by linguistic input are typically deployed to reveal this symbiosis. However, two critical problems limit its advancement: 1) The model's representations (artificial neurons, ANs) rely on layer-level embeddings and thus lack fine-granularity; 2) The brain activities (biological neurons, BNs) are limited to neural recordings of isolated cortical unit (i.e., voxel/region) and thus lack integrations and interactions among brain functions. To address those problems, in this study, we 1) define ANs with fine-granularity in transformer-based NLP models (BERT in this study) and measure their temporal activations to input text sequences; 2) define BNs as functional brain networks (FBNs) extracted from functional magnetic resonance imaging (fMRI) data to capture functional interactions in the brain; 3) couple ANs and BNs by maximizing the synchronization of their temporal activations. Our experimental results demonstrate 1) The activations of ANs and BNs are significantly synchronized; 2) the ANs carry meaningful linguistic/semantic information and anchor to their BN signatures; 3) the anchored BNs are interpretable in a neurolinguistic context. Overall, our study introduces a novel, general, and effective framework to link transformer-based NLP models and neural activities in response to language and may provide novel insights for future studies such as brain-inspired evaluation and development of NLP models.

LGMar 27, 2023
Core-Periphery Principle Guided Redesign of Self-Attention in Transformers

Xiaowei Yu, Lu Zhang, Haixing Dai et al.

Designing more efficient, reliable, and explainable neural network architectures is critical to studies that are based on artificial intelligence (AI) techniques. Previous studies, by post-hoc analysis, have found that the best-performing ANNs surprisingly resemble biological neural networks (BNN), which indicates that ANNs and BNNs may share some common principles to achieve optimal performance in either machine learning or cognitive/behavior tasks. Inspired by this phenomenon, we proactively instill organizational principles of BNNs to guide the redesign of ANNs. We leverage the Core-Periphery (CP) organization, which is widely found in human brain networks, to guide the information communication mechanism in the self-attention of vision transformer (ViT) and name this novel framework as CP-ViT. In CP-ViT, the attention operation between nodes is defined by a sparse graph with a Core-Periphery structure (CP graph), where the core nodes are redesigned and reorganized to play an integrative role and serve as a center for other periphery nodes to exchange information. We evaluated the proposed CP-ViT on multiple public datasets, including medical image datasets (INbreast) and natural image datasets. Interestingly, by incorporating the BNN-derived principle (CP structure) into the redesign of ViT, our CP-ViT outperforms other state-of-the-art ANNs. In general, our work advances the state of the art in three aspects: 1) This work provides novel insights for brain-inspired AI: we can utilize the principles found in BNNs to guide and improve our ANN architecture design; 2) We show that there exist sweet spots of CP graphs that lead to CP-ViTs with significantly improved performance; and 3) The core nodes in CP-ViT correspond to task-related meaningful and important image patches, which can significantly enhance the interpretability of the trained deep model.

CLSep 10, 2023
Chat2Brain: A Method for Mapping Open-Ended Semantic Queries to Brain Activation Maps

Yaonai Wei, Tuo Zhang, Han Zhang et al.

Over decades, neuroscience has accumulated a wealth of research results in the text modality that can be used to explore cognitive processes. Meta-analysis is a typical method that successfully establishes a link from text queries to brain activation maps using these research results, but it still relies on an ideal query environment. In practical applications, text queries used for meta-analyses may encounter issues such as semantic redundancy and ambiguity, resulting in an inaccurate mapping to brain images. On the other hand, large language models (LLMs) like ChatGPT have shown great potential in tasks such as context understanding and reasoning, displaying a high degree of consistency with human natural language. Hence, LLMs could improve the connection between text modality and neuroscience, resolving existing challenges of meta-analyses. In this study, we propose a method called Chat2Brain that combines LLMs to basic text-2-image model, known as Text2Brain, to map open-ended semantic queries to brain activation maps in data-scarce and complex query environments. By utilizing the understanding and reasoning capabilities of LLMs, the performance of the mapping model is optimized by transferring text queries to semantic queries. We demonstrate that Chat2Brain can synthesize anatomically plausible neural activation patterns for more complex tasks of text queries.

NCApr 20, 2022
Disentangling Spatial-Temporal Functional Brain Networks via Twin-Transformers

Xiaowei Yu, Lu Zhang, Lin Zhao et al.

How to identify and characterize functional brain networks (BN) is fundamental to gain system-level insights into the mechanisms of brain organizational architecture. Current functional magnetic resonance (fMRI) analysis highly relies on prior knowledge of specific patterns in either spatial (e.g., resting-state network) or temporal (e.g., task stimulus) domain. In addition, most approaches aim to find group-wise common functional networks, individual-specific functional networks have been rarely studied. In this work, we propose a novel Twin-Transformers framework to simultaneously infer common and individual functional networks in both spatial and temporal space, in a self-supervised manner. The first transformer takes space-divided information as input and generates spatial features, while the second transformer takes time-related information as input and outputs temporal features. The spatial and temporal features are further separated into common and individual ones via interactions (weights sharing) and constraints between the two transformers. We applied our TwinTransformers to Human Connectome Project (HCP) motor task-fMRI dataset and identified multiple common brain networks, including both task-related and resting-state networks (e.g., default mode network). Interestingly, we also successfully recovered a set of individual-specific networks that are not related to task stimulus and only exist at the individual level.

IVNov 10, 2023
Holistic Evaluation of GPT-4V for Biomedical Imaging

Zhengliang Liu, Hanqi Jiang, Tianyang Zhong et al.

In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.

CVJun 22, 2022
Coupling Visual Semantics of Artificial Neural Networks and Human Brain Function via Synchronized Activations

Lin Zhao, Haixing Dai, Zihao Wu et al.

Artificial neural networks (ANNs), originally inspired by biological neural networks (BNNs), have achieved remarkable successes in many tasks such as visual representation learning. However, whether there exists semantic correlations/connections between the visual representations in ANNs and those in BNNs remains largely unexplored due to both the lack of an effective tool to link and couple two different domains, and the lack of a general and effective framework of representing the visual semantics in BNNs such as human functional brain networks (FBNs). To answer this question, we propose a novel computational framework, Synchronized Activations (Sync-ACT), to couple the visual representation spaces and semantics between ANNs and BNNs in human brain based on naturalistic functional magnetic resonance imaging (nfMRI) data. With this approach, we are able to semantically annotate the neurons in ANNs with biologically meaningful description derived from human brain imaging for the first time. We evaluated the Sync-ACT framework on two publicly available movie-watching nfMRI datasets. The experiments demonstrate a) the significant correlation and similarity of the semantics between the visual representations in FBNs and those in a variety of convolutional neural networks (CNNs) models; b) the close relationship between CNN's visual representation similarity to BNNs and its performance in image classification tasks. Overall, our study introduces a general and effective paradigm to couple the ANNs and BNNs and provides novel insights for future studies such as brain-inspired artificial intelligence.

CVJul 10, 2023
Hierarchical Semantic Tree Concept Whitening for Interpretable Image Classification

Haixing Dai, Lu Zhang, Lin Zhao et al.

With the popularity of deep neural networks (DNNs), model interpretability is becoming a critical concern. Many approaches have been developed to tackle the problem through post-hoc analysis, such as explaining how predictions are made or understanding the meaning of neurons in middle layers. Nevertheless, these methods can only discover the patterns or rules that naturally exist in models. In this work, rather than relying on post-hoc schemes, we proactively instill knowledge to alter the representation of human-understandable concepts in hidden layers. Specifically, we use a hierarchical tree of semantic concepts to store the knowledge, which is leveraged to regularize the representations of image data instances while training deep models. The axes of the latent space are aligned with the semantic concepts, where the hierarchical relations between concepts are also preserved. Experiments on real-world image datasets show that our method improves model interpretability, showing better disentanglement of semantic concepts, without negatively affecting model classification performance.

NEMay 20, 2022
A Unified and Biologically-Plausible Relational Graph Representation of Vision Transformers

Yuzhong Chen, Yu Du, Zhenxiang Xiao et al.

Vision transformer (ViT) and its variants have achieved remarkable successes in various visual tasks. The key characteristic of these ViT models is to adopt different aggregation strategies of spatial patch information within the artificial neural networks (ANNs). However, there is still a key lack of unified representation of different ViT architectures for systematic understanding and assessment of model representation performance. Moreover, how those well-performing ViT ANNs are similar to real biological neural networks (BNNs) is largely unexplored. To answer these fundamental questions, we, for the first time, propose a unified and biologically-plausible relational graph representation of ViT models. Specifically, the proposed relational graph representation consists of two key sub-graphs: aggregation graph and affine graph. The former one considers ViT tokens as nodes and describes their spatial interaction, while the latter one regards network channels as nodes and reflects the information communication between channels. Using this unified relational graph representation, we found that: a) a sweet spot of the aggregation graph leads to ViTs with significantly improved predictive performance; b) the graph measures of clustering coefficient and average path length are two effective indicators of model prediction performance, especially when applying on the datasets with small samples; c) our findings are consistent across various ViT architectures and multiple datasets; d) the proposed relational graph representation of ViT has high similarity with real BNNs derived from brain science data. Overall, our work provides a novel unified and biologically-plausible paradigm for more interpretable and effective representation of ViT ANNs.

CVJan 12, 2023Code
Towards High Performance One-Stage Human Pose Estimation

Ling Li, Lin Zhao, Linhao Xu et al.

Making top-down human pose estimation method present both good performance and high efficiency is appealing. Mask RCNN can largely improve the efficiency by conducting person detection and pose estimation in a single framework, as the features provided by the backbone are able to be shared by the two tasks. However, the performance is not as good as traditional two-stage methods. In this paper, we aim to largely advance the human pose estimation results of Mask-RCNN and still keep the efficiency. Specifically, we make improvements on the whole process of pose estimation, which contains feature extraction and keypoint detection. The part of feature extraction is ensured to get enough and valuable information of pose. Then, we introduce a Global Context Module into the keypoints detection branch to enlarge the receptive field, as it is crucial to successful human pose estimation. On the COCO val2017 set, our model using the ResNet-50 backbone achieves an AP of 68.1, which is 2.6 higher than Mask RCNN (AP of 65.5). Compared to the classic two-stage top-down method SimpleBaseline, our model largely narrows the performance gap (68.1 AP vs. 68.9 AP) with a much faster inference speed (77 ms vs. 168 ms), demonstrating the effectiveness of the proposed method. Code is available at: https://github.com/lingl_space/maskrcnn_keypoint_refined.

CVMay 28
OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

Lin Zhao, Yushu Wu, Yifan Gong et al.

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

LGJul 3, 2023
Identification of Causal Relationship between Amyloid-beta Accumulation and Alzheimer's Disease Progression via Counterfactual Inference

Haixing Dai, Mengxuan Hu, Qing Li et al.

Alzheimer's disease (AD) is a neurodegenerative disorder that is beginning with amyloidosis, followed by neuronal loss and deterioration in structure, function, and cognition. The accumulation of amyloid-beta in the brain, measured through 18F-florbetapir (AV45) positron emission tomography (PET) imaging, has been widely used for early diagnosis of AD. However, the relationship between amyloid-beta accumulation and AD pathophysiology remains unclear, and causal inference approaches are needed to uncover how amyloid-beta levels can impact AD development. In this paper, we propose a graph varying coefficient neural network (GVCNet) for estimating the individual treatment effect with continuous treatment levels using a graph convolutional neural network. We highlight the potential of causal inference approaches, including GVCNet, for measuring the regional causal connections between amyloid-beta accumulation and AD pathophysiology, which may serve as a robust tool for early diagnosis and tailored care.

CVOct 27, 2022
BI AVAN: Brain inspired Adversarial Visual Attention Network

Heng Huang, Lin Zhao, Xintao Hu et al.

Visual attention is a fundamental mechanism in the human brain, and it inspires the design of attention mechanisms in deep neural networks. However, most of the visual attention studies adopted eye-tracking data rather than the direct measurement of brain activity to characterize human visual attention. In addition, the adversarial relationship between the attention-related objects and attention-neglected background in the human visual system was not fully exploited. To bridge these gaps, we propose a novel brain-inspired adversarial visual attention network (BI-AVAN) to characterize human visual attention directly from functional brain activity. Our BI-AVAN model imitates the biased competition process between attention-related/neglected objects to identify and locate the visual objects in a movie frame the human brain focuses on in an unsupervised manner. We use independent eye-tracking data as ground truth for validation and experimental results show that our model achieves robust and promising results when inferring meaningful human visual attention and mapping the relationship between brain activities and visual stimuli. Our BI-AVAN model contributes to the emerging field of leveraging the brain's functional architecture to inspire and guide the model design in artificial intelligence (AI), e.g., deep neural networks.

LGAug 18, 2022
Global Convergence of Two-timescale Actor-Critic for Solving Linear Quadratic Regulator

Xuyang Chen, Jingliang Duan, Yingbin Liang et al.

The actor-critic (AC) reinforcement learning algorithms have been the powerhouse behind many challenging applications. Nevertheless, its convergence is fragile in general. To study its instability, existing works mostly consider the uncommon double-loop variant or basic models with finite state and action space. We investigate the more practical single-sample two-timescale AC for solving the canonical linear quadratic regulator (LQR) problem, where the actor and the critic update only once with a single sample in each iteration on an unbounded continuous state and action space. Existing analysis cannot conclude the convergence for such a challenging case. We develop a new analysis framework that allows establishing the global convergence to an $ε$-optimal solution with at most an $\mathcal{O}(ε^{-2.5})$ sample complexity. To our knowledge, this is the first finite-time convergence analysis for the single sample two-timescale AC for solving LQR with global optimality. The sample complexity improves those of other variants by orders, which sheds light on the practical wisdom of single sample algorithms. We also further validate our theoretical findings via comprehensive simulation comparisons.

CVAug 16, 2024
Retrieval-augmented Few-shot Medical Image Segmentation with Foundation Models

Lin Zhao, Xiao Chen, Eric Z. Chen et al.

Medical image segmentation is crucial for clinical decision-making, but the scarcity of annotated data presents significant challenges. Few-shot segmentation (FSS) methods show promise but often require training on the target domain and struggle to generalize across different modalities. Similarly, adapting foundation models like the Segment Anything Model (SAM) for medical imaging has limitations, including the need for finetuning and domain-specific adaptation. To address these issues, we propose a novel method that adapts DINOv2 and Segment Anything Model 2 (SAM 2) for retrieval-augmented few-shot medical image segmentation. Our approach uses DINOv2's feature as query to retrieve similar samples from limited annotated data, which are then encoded as memories and stored in memory bank. With the memory attention mechanism of SAM 2, the model leverages these memories as conditions to generate accurate segmentation of the target image. We evaluated our framework on three medical image segmentation tasks, demonstrating superior performance and generalizability across various modalities without the need for any retraining or finetuning. Overall, this method offers a practical and effective solution for few-shot medical image segmentation and holds significant potential as a valuable annotation tool in clinical applications.

NEMar 27, 2023
CP-CNN: Core-Periphery Principle Guided Convolutional Neural Network

Lin Zhao, Haixing Dai, Zihao Wu et al.

The evolution of convolutional neural networks (CNNs) can be largely attributed to the design of its architecture, i.e., the network wiring pattern. Neural architecture search (NAS) advances this by automating the search for the optimal network architecture, but the resulting network instance may not generalize well in different tasks. To overcome this, exploring network design principles that are generalizable across tasks is a more practical solution. In this study, We explore a novel brain-inspired design principle based on the core-periphery property of the human brain network to guide the design of CNNs. Our work draws inspiration from recent studies suggesting that artificial and biological neural networks may have common principles in optimizing network architecture. We implement the core-periphery principle in the design of network wiring patterns and the sparsification of the convolution operation. The resulting core-periphery principle guided CNNs (CP-CNNs) are evaluated on three different datasets. The experiments demonstrate the effectiveness and superiority compared to CNNs and ViT-based methods. Overall, our work contributes to the growing field of brain-inspired AI by incorporating insights from the human brain into the design of neural networks.

ROJun 21, 2022
Neural Moving Horizon Estimation for Robust Flight Control

Bingheng Wang, Zhengtian Ma, Shupeng Lai et al.

Estimating and reacting to disturbances is crucial for robust flight control of quadrotors. Existing estimators typically require significant tuning for a specific flight scenario or training with extensive ground-truth disturbance data to achieve satisfactory performance. In this paper, we propose a neural moving horizon estimator (NeuroMHE) that can automatically tune its key parameters modeled by a neural network and adapt to different flight scenarios. We achieve this by deriving the analytical gradients of the MHE estimates with respect to the MHE weighting matrices, which enables a seamless embedding of the MHE as a learnable layer into the neural network for highly effective learning. Interestingly, we show that the gradients can be computed efficiently using a Kalman filter in a recursive form. Moreover, we develop a model-based policy gradient algorithm to train NeuroMHE directly from the quadrotor trajectory tracking error without needing the ground-truth disturbance data. The effectiveness of NeuroMHE is verified extensively via both simulations and physical experiments on quadrotors in various challenging flights. Notably, NeuroMHE outperforms a state-of-the-art neural network-based estimator, reducing force estimation errors by up to 76.7%, while using a portable neural network that has only 7.7% of the learnable parameters of the latter. The proposed method is general and can be applied to robust adaptive control of other robotic systems.

CVSep 5, 2022
Forensicability Assessment of Questioned Images in Recapturing Detection

Changsheng Chen, Lin Zhao, Rizhao Cai et al.

Recapture detection of face and document images is an important forensic task. With deep learning, the performances of face anti-spoofing (FAS) and recaptured document detection have been improved significantly. However, the performances are not yet satisfactory on samples with weak forensic cues. The amount of forensic cues can be quantified to allow a reliable forensic result. In this work, we propose a forensicability assessment network to quantify the forensicability of the questioned samples. The low-forensicability samples are rejected before the actual recapturing detection process to improve the efficiency of recapturing detection systems. We first extract forensicability features related to both image quality assessment and forensic tasks. By exploiting domain knowledge of the forensic application in image quality and forensic features, we define three task-specific forensicability classes and the initialized locations in the feature space. Based on the extracted features and the defined centers, we train the proposed forensic assessment network (FANet) with cross-entropy loss and update the centers with a momentum-based update method. We integrate the trained FANet with practical recapturing detection schemes in face anti-spoofing and recaptured document detection tasks. Experimental results show that, for a generic CNN-based FAS scheme, FANet reduces the EERs from 33.75% to 19.23% under ROSE to IDIAP protocol by rejecting samples with the lowest 30% forensicability scores. The performance of FAS schemes is poor in the rejected samples, with EER as high as 56.48%. Similar performances in rejecting low-forensicability samples have been observed for the state-of-the-art approaches in FAS and recaptured document detection tasks. To the best of our knowledge, this is the first work that assesses the forensicability of recaptured document images and improves the system efficiency.

IVOct 22, 2022
JoJoNet: Joint-contrast and Joint-sampling-and-reconstruction Network for Multi-contrast MRI

Lin Zhao, Xiao Chen, Eric Z. Chen et al.

Multi-contrast Magnetic Resonance Imaging (MRI) generates multiple medical images with rich and complementary information for routine clinical use; however, it suffers from a long acquisition time. Recent works for accelerating MRI, mainly designed for single contrast, may not be optimal for multi-contrast scenario since the inherent correlations among the multi-contrast images are not exploited. In addition, independent reconstruction of each contrast usually does not translate to optimal performance of downstream tasks. Motivated by these aspects, in this paper we design an end-to-end framework for accelerating multi-contrast MRI which simultaneously optimizes the entire MR imaging workflow including sampling, reconstruction and downstream tasks to achieve the best overall outcomes. The proposed framework consists of a sampling mask generator for each image contrast and a reconstructor exploiting the inter-contrast correlations with a recurrent structure which enables the information sharing in a holistic way. The sampling mask generator and the reconstructor are trained jointly across the multiple image contrasts. The acceleration ratio of each image contrast is also learnable and can be driven by a downstream task performance. We validate our approach on a multi-contrast brain dataset and a multi-contrast knee dataset. Experiments show that (1) our framework consistently outperforms the baselines designed for single contrast on both datasets; (2) our newly designed recurrent reconstruction network effectively improves the reconstruction quality for multi-contrast images; (3) the learnable acceleration ratio improves the downstream task performance significantly. Overall, this work has potentials to open up new avenues for optimizing the entire multi-contrast MR imaging workflow.

ROMar 29, 2023
Tightly-coupled Visual-DVL-Inertial Odometry for Robot-based Ice-water Boundary Exploration

Lin Zhao, Mingxi Zhou, Brice Loose

Robotic underwater systems, e.g., Autonomous Underwater Vehicles (AUVs) and Remotely Operated Vehicles (ROVs), are promising tools for collecting biogeochemical data at the ice-water interface for scientific advancements. However, state estimation, i.e., localization, is a well-known problem for robotic systems, especially, for the ones that travel underwater. In this paper, we present a tightly-coupled multi-sensors fusion framework to increase localization accuracy that is robust to sensor failure. Visual images, Doppler Velocity Log (DVL), Inertial Measurement Unit (IMU) and Pressure sensor are integrated into the state-of-art Multi-State Constraint Kalman Filter (MSCKF) for state estimation. Besides that a new keyframe-based state clone mechanism and a new DVL-aided feature enhancement are presented to further improve the localization performance. The proposed method is validated with a data set collected in the field under frozen ice, and the result is compared with 6 other different sensor fusion setups. Overall, the result with the keyframe enabled and DVL-aided feature enhancement yields the best performance with a Root-mean-square error of less than 2 m compared to the ground truth path with a total traveling distance of about 200 m.

IVJun 6, 2022
Invertible Sharpening Network for MRI Reconstruction Enhancement

Siyuan Dong, Eric Z. Chen, Lin Zhao et al.

High-quality MRI reconstruction plays a critical role in clinical applications. Deep learning-based methods have achieved promising results on MRI reconstruction. However, most state-of-the-art methods were designed to optimize the evaluation metrics commonly used for natural images, such as PSNR and SSIM, whereas the visual quality is not primarily pursued. Compared to the fully-sampled images, the reconstructed images are often blurry, where high-frequency features might not be sharp enough for confident clinical diagnosis. To this end, we propose an invertible sharpening network (InvSharpNet) to improve the visual quality of MRI reconstructions. During training, unlike the traditional methods that learn to map the input data to the ground truth, InvSharpNet adapts a backward training strategy that learns a blurring transform from the ground truth (fully-sampled image) to the input data (blurry reconstruction). During inference, the learned blurring transform can be inverted to a sharpening transform leveraging the network's invertibility. The experiments on various MRI datasets demonstrate that InvSharpNet can improve reconstruction sharpness with few artifacts. The results were also evaluated by radiologists, indicating better visual quality and diagnostic confidence of our proposed method.

OCOct 29, 2023
Optimization Landscape of Policy Gradient Methods for Discrete-time Static Output Feedback

Jingliang Duan, Jie Li, Xuyang Chen et al.

In recent times, significant advancements have been made in delving into the optimization landscape of policy gradient methods for achieving optimal control in linear time-invariant (LTI) systems. Compared with state-feedback control, output-feedback control is more prevalent since the underlying state of the system may not be fully observed in many practical settings. This paper analyzes the optimization landscape inherent to policy gradient methods when applied to static output feedback (SOF) control in discrete-time LTI systems subject to quadratic cost. We begin by establishing crucial properties of the SOF cost, encompassing coercivity, L-smoothness, and M-Lipschitz continuous Hessian. Despite the absence of convexity, we leverage these properties to derive novel findings regarding convergence (and nearly dimension-free rate) to stationary points for three policy gradient methods, including the vanilla policy gradient method, the natural policy gradient method, and the Gauss-Newton method. Moreover, we provide proof that the vanilla policy gradient method exhibits linear convergence towards local minima when initialized near such minima. The paper concludes by presenting numerical examples that validate our theoretical findings. These results not only characterize the performance of gradient descent for optimizing the SOF problem but also provide insights into the effectiveness of general policy gradient methods within the realm of reinforcement learning.

NCApr 10
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems

Sohan Shankar, Yi Pan, Hanqi Jiang et al.

This position and survey paper identifies the emerging convergence of neuroscience, artificial general intelligence (AGI), and neuromorphic computing toward a unified research paradigm. Using a framework grounded in brain physiology, we highlight how synaptic plasticity, sparse spike-based communication, and multimodal association provide design principles for next-generation AGI systems that potentially combine both human and machine intelligences. The review traces this evolution from early connectionist models to state-of-the-art large language models, demonstrating how key innovations like transformer attention, foundation-model pre-training, and multi-agent architectures mirror neurobiological processes like cortical mechanisms, working memory, and episodic consolidation. We then discuss emerging physical substrates capable of breaking the von Neumann bottleneck to achieve brain-scale efficiency in silicon: memristive crossbars, in-memory compute arrays, and emerging quantum and photonic devices. There are four critical challenges at this intersection: 1) integrating spiking dynamics with foundation models, 2) maintaining lifelong plasticity without catastrophic forgetting, 3) unifying language with sensorimotor learning in embodied agents, and 4) enforcing ethical safeguards in advanced neuromorphic autonomous systems. This combined perspective across neuroscience, computation, and hardware offers an integrative agenda for in each of these fields.

CVApr 18
IMA-MoE: An Interpretable Modality-Aware Mixture-of-Experts Framework for Characterizing the Neurobiological Signatures of Binge Eating Disorder

Lin Zhao, Qiaohui Gao, Elizabeth Martin et al.

Binge eating disorder (BED) is the most prevalent eating disorder. However, current diagnostic frameworks remain largely grounded in symptom-based criteria rather than underlying biological mechanisms, thereby limiting early detection and the development of biologically-informed interventions. Emerging studies have begun to investigate the neurobiological signatures of BED, yet their findings are often difficult to generalize due to the reliance on hypothesis-driven parametric models, single-modality analyses, and limited data diversity. Therefore, there is a critical need for advanced data-driven frameworks capable of modeling multimodal data to uncover generalizable and biologically meaningful signatures of BED. In this study, we propose the Interpretable Modality-Aware Mixture-of-Experts (IMA-MoE), a novel architecture designed to integrate heterogeneous neuroimaging, behavioral, hormonal, and demographic measures within a unified predictive framework. By encoding each measure as a distinct token, IMA-MoE enables flexible modeling of cross-modal dependencies while preserving modality-specific characteristics. We further introduce a token-importance mechanism to enhance interpretability by quantifying the contribution of each measure to model predictions. Evaluated on the large-scale Adolescent Brain Cognitive Development (ABCD) dataset, IMA-MoE demonstrates superior performance in differentiating BED from healthy controls compared with baseline methods, while revealing sex-specific predictive patterns, with hormonal measures contributing more prominently to prediction in females. Collectively, these findings highlight the promise of interpretable, data-driven multimodal modeling in advancing biologically-informed characterization of BED and facilitating more precise and personalized interventions in neuropsychiatric disorders.

CVNov 28, 2023
THInImg: Cross-modal Steganography for Presenting Talking Heads in Images

Lin Zhao, Hongxuan Li, Xuefei Ning et al.

Cross-modal Steganography is the practice of concealing secret signals in publicly available cover signals (distinct from the modality of the secret signals) unobtrusively. While previous approaches primarily concentrated on concealing a relatively small amount of information, we propose THInImg, which manages to hide lengthy audio data (and subsequently decode talking head video) inside an identity image by leveraging the properties of human face, which can be effectively utilized for covert communication, transmission and copyright protection. THInImg consists of two parts: the encoder and decoder. Inside the encoder-decoder pipeline, we introduce a novel architecture that substantially increase the capacity of hiding audio in images. Moreover, our framework can be extended to iteratively hide multiple audio clips into an identity image, offering multiple levels of control over permissions. We conduct extensive experiments to prove the effectiveness of our method, demonstrating that THInImg can present up to 80 seconds of high quality talking-head video (including audio) in an identity image with 160x160 resolution.

LGSep 12, 2022
On the Optimization Landscape of Dynamic Output Feedback: A Case Study for Linear Quadratic Regulator

Jingliang Duan, Wenhan Cao, Yang Zheng et al.

The convergence of policy gradient algorithms in reinforcement learning hinges on the optimization landscape of the underlying optimal control problem. Theoretical insights into these algorithms can often be acquired from analyzing those of linear quadratic control. However, most of the existing literature only considers the optimization landscape for static full-state or output feedback policies (controllers). We investigate the more challenging case of dynamic output-feedback policies for linear quadratic regulation (abbreviated as dLQR), which is prevalent in practice but has a rather complicated optimization landscape. We first show how the dLQR cost varies with the coordinate transformation of the dynamic controller and then derive the optimal transformation for a given observable stabilizing controller. At the core of our results is the uniqueness of the stationary point of dLQR when it is observable, which is in a concise form of an observer-based controller with the optimal similarity transformation. These results shed light on designing efficient algorithms for general decision-making problems with partially observed information.

ROApr 22
JoyAI-RA 0.1: A Foundation Model for Robotic Autonomy

Tianle Zhang, Zhihao Yuan, Dafeng Chi et al.

Robotic autonomy in open-world environments is fundamentally limited by insufficient data diversity and poor cross-embodiment generalization. Existing robotic datasets are often limited in scale and task coverage, while relatively large differences across robot embodiments impede effective behavior knowledge transfer. To address these challenges, we propose JoyAI-RA, a vision-language-action (VLA) embodied foundation model tailored for generalizable robotic manipulation. JoyAI-RA presents a multi-source multi-level pretraining framework that integrates web data, large-scale egocentric human manipulation videos, simulation-generated trajectories, and real-robot data. Through training on heterogeneous multi-source data with explicit action-space unification, JoyAI-RA effectively bridges embodiment gaps, particularly between human manipulation and robotic control, thereby enhancing cross-embodiment behavior learning. JoyAI-RA outperforms state-of-the-art methods in both simulation and real-world benchmarks, especially on diverse tasks with generalization demands.

CLJul 19, 2023
PharmacyGPT: The AI Pharmacist

Zhengliang Liu, Zihao Wu, Mengxuan Hu et al.

In this study, we introduce PharmacyGPT, a novel framework to assess the capabilities of large language models (LLMs) such as ChatGPT and GPT-4 in emulating the role of clinical pharmacists. Our methodology encompasses the utilization of LLMs to generate comprehensible patient clusters, formulate medication plans, and forecast patient outcomes. We conduct our investigation using real data acquired from the intensive care unit (ICU) at the University of North Carolina Chapel Hill (UNC) Hospital. Our analysis offers valuable insights into the potential applications and limitations of LLMs in the field of clinical pharmacy, with implications for both patient care and the development of future AI-driven healthcare solutions. By evaluating the performance of PharmacyGPT, we aim to contribute to the ongoing discourse surrounding the integration of artificial intelligence in healthcare settings, ultimately promoting the responsible and efficacious use of such technologies.

SYApr 12
On the Optimization Landscape of Observer-based Dynamic Linear Quadratic Control

Jingliang Duan, Jie Li, Yinsong Ma et al.

Understanding the optimization landscape of linear quadratic regulation (LQR) problems is fundamental to the design of efficient reinforcement learning solutions. Recent work has made significant progress in characterizing the landscape of static output-feedback control and linear quadratic Gaussian (LQG) control. For LQG, much of the analysis leverages the separation principle, which allows the controller and estimator to be designed independently. However, this simplification breaks down when the gradients with respect to the estimator and controller parameters are inherently coupled, leading to a more intricate analysis. This paper investigates the optimization landscape of observer-based dynamic output-feedback control of LQR problems. We derive the optimal observer-controller pair in settings where transient quadratic performance cannot be neglected. Our analysis reveals that, in general, the combination of the standard LQR controller and the observer that minimizes the trace of the accumulated estimation error covariance does not correspond to a stationary point of the overall closed-loop performance objective. Moreover, we derive a pair of discrete-time Sylvester equations with symmetric structure, both involving the same set of matrix elements, that characterize the stationary point of the observer-based dynamic LQR problem. These equations offer analytical insight into the structure of the optimality conditions and provide a foundation for developing numerical policy gradient methods aimed at learning complex controllers that rely on reconstructed state information.

LGOct 18, 2022
Finite-time analysis of single-timescale actor-critic

Xuyang Chen, Lin Zhao

Actor-critic methods have achieved significant success in many challenging applications. However, its finite-time convergence is still poorly understood in the most practical single-timescale form. Existing works on analyzing single-timescale actor-critic have been limited to i.i.d. sampling or tabular setting for simplicity. We investigate the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic assumes linear function approximation and updates with a single Markovian sample per actor step. Previous analysis has been unable to establish the convergence for such a challenging scenario. We demonstrate that the online single-timescale actor-critic method provably finds an $ε$-approximate stationary point with $\widetilde{\mathcal{O}}(ε^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(ε^{-2})$ under the i.i.d. sampling. Our novel framework systematically evaluates and controls the error propagation between the actor and critic. It offers a promising approach for analyzing other single-timescale reinforcement learning algorithms as well.

ROMar 21Code
AERO-MPPI: Anchor-Guided Ensemble Trajectory Optimization for Agile Mapless Drone Navigation

Xin Chen, Rui Huang, Longbin Tang et al.

Agile mapless navigation in cluttered 3D environments poses significant challenges for autonomous drones. Conventional mapping-planning-control pipelines incur high computational cost and propagate estimation errors. We present AERO-MPPI, a fully GPU-accelerated framework that unifies perception and planning through an anchor-guided ensemble of Model Predictive Path Integral (MPPI) optimizers. Specifically, we design a multi-resolution LiDAR point-cloud representation that rapidly extracts spatially distributed "anchors" as look-ahead intermediate endpoints, from which we construct polynomial trajectory guides to explore distinct homotopy path classes. At each planning step, we run multiple MPPI instances in parallel and evaluate them with a two-stage multi-objective cost that balances collision avoidance and goal reaching. Implemented entirely with NVIDIA Warp GPU kernels, AERO-MPPI achieves real-time onboard operation and mitigates the local-minima failures of single-MPPI approaches. Extensive simulations in forests, verticals, and inclines demonstrate sustained reliable flight above 7 m/s, with success rates above 80% and smoother trajectories compared to state-of-the-art baselines. Real-world experiments on a LiDAR-equipped quadrotor with NVIDIA Jetson Orin NX 16G confirm that AERO-MPPI runs in real time onboard and consistently achieves safe, agile, and robust flight in complex cluttered environments. Code is available at https://github.com/XinChen-stars/AERO_MPPI.