Zhihong Chen

CL
h-index29
46papers
6,968citations
Novelty47%
AI Score55

46 Papers

CVJul 21, 2023Code
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

Zunnan Xu, Zhihong Chen, Yong Zhang et al. · tsinghua

Parameter Efficient Tuning (PET) has gained attention for reducing the number of parameters while maintaining performance and providing better hardware resource savings, but few studies investigate dense prediction tasks and interaction between modalities. In this paper, we do an investigation of efficient tuning problems on referring image segmentation. We propose a novel adapter called Bridger to facilitate cross-modal information exchange and inject task-specific information into the pre-trained model. We also design a lightweight decoder for image segmentation. Our approach achieves comparable or superior performance with only 1.61\% to 3.38\% backbone parameter updates, evaluated on challenging benchmarks. The code is available at \url{https://github.com/kkakkkka/ETRIS}.

CVSep 15, 2022Code
Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training

Zhihong Chen, Yuhao Du, Jinpeng Hu et al.

Medical vision-and-language pre-training provides a feasible solution to extract effective vision-and-language representations from medical images and texts. However, few studies have been dedicated to this field to facilitate medical vision-and-language understanding. In this paper, we propose a self-supervised learning paradigm with multi-modal masked autoencoders (M$^3$AE), which learn cross-modal domain knowledge by reconstructing missing pixels and tokens from randomly masked images and texts. There are three key designs to make this simple approach work. First, considering the different information densities of vision and language, we adopt different masking ratios for the input image and text, where a considerably larger masking ratio is used for images. Second, we use visual and textual features from different layers to perform the reconstruction to deal with different levels of abstraction in visual and language. Third, we develop different designs for vision and language decoders (i.e., a Transformer for vision and a multi-layer perceptron for language). To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results demonstrate the effectiveness of our approach, where state-of-the-art results are achieved on all downstream tasks. Besides, we conduct further analysis to better verify the effectiveness of different components of our approach and various settings of pre-training. The source code is available at~\url{https://github.com/zhjohnchan/M3AE}.

CLSep 21, 2023Code
AceGPT, Localizing Large Language Models in Arabic

Huang Huang, Fei Yu, Jianqing Zhu et al.

This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed `AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.

CLApr 20, 2023Code
Phoenix: Democratizing ChatGPT across Languages

Zhihong Chen, Feng Jiang, Junying Chen et al.

This paper presents our efforts to democratize ChatGPT across language. We release a large language model "Phoenix", achieving competitive performance among open-source English and Chinese models while excelling in languages with limited resources (covering both Latin and non-Latin languages). We believe this work will be beneficial to make ChatGPT more accessible, especially in countries where people cannot use ChatGPT due to restrictions from OpenAI or local goverments. Our data, code, and models are available at https://github.com/FreedomIntelligence/LLMZoo.

CLAug 17, 2023Code
CMB: A Comprehensive Medical Benchmark in Chinese

Xidong Wang, Guiming Hardy Chen, Dingjie Song et al.

Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.

CVJul 21, 2023Code
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Zhihong Chen, Ruifei Zhang, Yibing Song et al.

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{https://github.com/zhjohnchan/SK-VG}.

CLSep 25, 2024
Overview of the First Shared Task on Clinical Text Generation: RRG24 and "Discharge Me!"

Justin Xu, Zhihong Chen, Andrew Johnston et al. · oxford, stanford

Recent developments in natural language generation have tremendous implications for healthcare. For instance, state-of-the-art systems could automate the generation of sections in clinical reports to alleviate physician workload and streamline hospital documentation. To explore these applications, we present a shared task consisting of two subtasks: (1) Radiology Report Generation (RRG24) and (2) Discharge Summary Generation ("Discharge Me!"). RRG24 involves generating the 'Findings' and 'Impression' sections of radiology reports given chest X-rays. "Discharge Me!" involves generating the 'Brief Hospital Course' and 'Discharge Instructions' sections of discharge summaries for patients admitted through the emergency department. "Discharge Me!" submissions were subsequently reviewed by a team of clinicians. Both tasks emphasize the goal of reducing clinician burnout and repetitive workloads by generating documentation. We received 201 submissions from across 8 teams for RRG24, and 211 submissions from across 16 teams for "Discharge Me!".

97.3CVApr 24Code
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

Ashwin Kumar, Robbie Holland, Corey Barrett et al.

Recent medical multimodal foundation models are built as multimodal LLMs (MLLMs) by connecting a CLIP-pretrained vision encoder to an LLM using LLaVA-style finetuning. This two-stage, decoupled approach introduces a projection layer that can distort visual features. This is especially concerning in medical imaging where subtle cues are essential for accurate diagnoses. In contrast, early-fusion generative approaches such as Chameleon eliminate the projection bottleneck by processing image and text tokens within a single unified sequence, enabling joint representation learning that leverages the inductive priors of language models. We present CheXmix, a unified early-fusion generative model trained on a large corpus of chest X-rays paired with radiology reports. We expand on Chameleon's autoregressive framework by introducing a two-stage multimodal generative pretraining strategy that combines the representational strengths of masked autoencoders with MLLMs. The resulting models are highly flexible, supporting both discriminative and generative tasks at both coarse and fine-grained scales. Our approach outperforms well-established generative models across all masking ratios by 6.0% and surpasses CheXagent by 8.6% on AUROC at high image masking ratios on the CheXpert classification task. We further inpaint images over 51.0% better than text-only generative models and outperform CheXagent by 45% on the GREEN metric for radiology report generation. These results demonstrate that CheXmix captures fine-grained information across a broad spectrum of chest X-ray tasks. Our code is at: https://github.com/StanfordMIMI/CheXmix.

CLApr 28, 2022
Cross-modal Memory Networks for Radiology Report Generation

Zhihong Chen, Yaling Shen, Yan Song et al.

Medical imaging plays a significant role in clinical practice of medical diagnosis, where the text reports of the images are essential in understanding them and facilitating later treatments. By generating the reports automatically, it is beneficial to help lighten the burden of radiologists and significantly promote clinical automation, which already attracts much attention in applying artificial intelligence to medical domain. Previous studies mainly follow the encoder-decoder paradigm and focus on the aspect of text generation, with few studies considering the importance of cross-modal mappings and explicitly exploit such mappings to facilitate radiology report generation. In this paper, we propose a cross-modal memory networks (CMN) to enhance the encoder-decoder framework for radiology report generation, where a shared memory is designed to record the alignment between images and texts so as to facilitate the interaction and generation across modalities. Experimental results illustrate the effectiveness of our proposed model, where state-of-the-art performance is achieved on two widely used benchmark datasets, i.e., IU X-Ray and MIMIC-CXR. Further analyses also prove that our model is able to better align information from radiology images and texts so as to help generating more accurate reports in terms of clinical indicators.

CLApr 1, 2022
Graph Enhanced Contrastive Learning for Radiology Findings Summarization

Jinpeng Hu, Zhuo Li, Zhihong Chen et al.

The impression section of a radiology report summarizes the most prominent observation from the findings section and is the most important section for radiologists to communicate to physicians. Summarizing findings is time-consuming and can be prone to error for inexperienced radiologists, and thus automatic impression generation has attracted substantial attention. With the encoder-decoder framework, most previous studies explore incorporating extra knowledge (e.g., static pre-defined clinical ontologies or extra background information). Yet, they encode such knowledge by a separate encoder to treat it as an extra input to their models, which is limited in leveraging their relations with the original findings. To address the limitation, we propose a unified framework for exploiting both extra knowledge and the original findings in an integrated way so that the critical information (i.e., key words and their relations) can be extracted in an appropriate way to facilitate impression generation. In detail, for each input findings, it is encoded by a text encoder, and a graph is constructed through its entities and dependency tree. Then, a graph encoder (e.g., graph neural networks (GNNs)) is adopted to model relation information in the constructed graph. Finally, to emphasize the key words in the findings, contrastive learning is introduced to map positive samples (constructed by masking non-key words) closer and push apart negative ones (constructed by masking key words). The experimental results on OpenI and MIMIC-CXR confirm the effectiveness of our proposed method.

CLSep 15, 2022
Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge

Zhihong Chen, Guanbin Li, Xiang Wan

Medical vision-and-language pre-training (Med-VLP) has received considerable attention owing to its applicability to extracting generic vision-and-language representations from medical images and texts. Most existing methods mainly contain three elements: uni-modal encoders (i.e., a vision encoder and a language encoder), a multi-modal fusion module, and pretext tasks, with few studies considering the importance of medical domain expert knowledge and explicitly exploiting such knowledge to facilitate Med-VLP. Although there exist knowledge-enhanced vision-and-language pre-training (VLP) methods in the general domain, most require off-the-shelf toolkits (e.g., object detectors and scene graph parsers), which are unavailable in the medical domain. In this paper, we propose a systematic and effective approach to enhance Med-VLP by structured medical knowledge from three perspectives. First, considering knowledge can be regarded as the intermediate medium between vision and language, we align the representations of the vision encoder and the language encoder through knowledge. Second, we inject knowledge into the multi-modal fusion model to enable the model to perform reasoning using knowledge as the supplementation of the input image and text. Third, we guide the model to put emphasis on the most critical information in images and texts by designing knowledge-induced pretext tasks. To perform a comprehensive evaluation and facilitate further research, we construct a medical vision-and-language benchmark including three tasks. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on all downstream tasks. Further analyses explore the effects of different components of our approach and various settings of pre-training.

CVFeb 17, 2023
Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Zhihong Chen, Shizhe Diao, Benyou Wang et al.

Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, \textit{i.e.}, the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts. By doing so, a single model could serve as a \textit{foundation model} that processes various tasks adopting different input formats (\textit{i.e.}, image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (\textit{i.e.}, image/text classification and text summarization), cross-modal tasks (\textit{i.e.}, image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (\textit{i.e.}, visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches.

CVOct 15, 2022
Improving Radiology Summarization with Radiograph and Anatomy Prompts

Jinpeng Hu, Zhihong Chen, Yang Liu et al.

The impression is crucial for the referring physicians to grasp key information since it is concluded from the findings and reasoning of radiologists. To alleviate the workload of radiologists and reduce repetitive human labor in impression writing, many researchers have focused on automatic impression generation. However, recent works on this task mainly summarize the corresponding findings and pay less attention to the radiology images. In clinical, radiographs can provide more detailed valuable observations to enhance radiologists' impression writing, especially for complicated cases. Besides, each sentence in findings usually focuses on single anatomy, so they only need to be matched to corresponding anatomical regions instead of the whole image, which is beneficial for textual and visual features alignment. Therefore, we propose a novel anatomy-enhanced multimodal model to promote impression generation. In detail, we first construct a set of rules to extract anatomies and put these prompts into each sentence to highlight anatomy characteristics. Then, two separate encoders are applied to extract features from the radiograph and findings. Afterward, we utilize a contrastive learning module to align these two representations at the overall level and use a co-attention to fuse them at the sentence level with the help of anatomy-enhanced sentence representation. Finally, the decoder takes the fused information as the input to generate impressions. The experimental results on two benchmark datasets confirm the effectiveness of the proposed method, which achieves state-of-the-art results.

CLJun 6, 2023
On the Difference of BERT-style and CLIP-style Text Encoders

Zhihong Chen, Guiming Hardy Chen, Shizhe Diao et al.

Masked language modeling (MLM) has been one of the most popular pretraining recipes in natural language processing, e.g., BERT, one of the representative models. Recently, contrastive language-image pretraining (CLIP) has also attracted attention, especially its vision models that achieve excellent performance on a broad range of vision tasks. However, few studies are dedicated to studying the text encoders learned by CLIP. In this paper, we analyze the difference between BERT-style and CLIP-style text encoders from three experiments: (i) general text understanding, (ii) vision-centric text understanding, and (iii) text-to-image generation. Experimental analyses show that although CLIP-style text encoders underperform BERT-style ones for general text understanding tasks, they are equipped with a unique ability, i.e., synesthesia, for the cross-modal association, which is more similar to the senses of humans.

CLNov 15, 2022
Toward expanding the scope of radiology report summarization to multiple anatomies and modalities

Zhihong Chen, Maya Varma, Xiang Wan et al.

Radiology report summarization (RRS) is a growing area of research. Given the Findings section of a radiology report, the goal is to generate a summary (called an Impression section) that highlights the key observations and conclusions of the radiology study. However, RRS currently faces essential limitations.First, many prior studies conduct experiments on private datasets, preventing reproduction of results and fair comparisons across different systems and solutions. Second, most prior approaches are evaluated solely on chest X-rays. To address these limitations, we propose a dataset (MIMIC-RRS) involving three new modalities and seven new anatomies based on the MIMIC-III and MIMIC-CXR datasets. We then conduct extensive experiments to evaluate the performance of models both within and across modality-anatomy pairs in MIMIC-RRS. In addition, we evaluate their clinical efficacy via RadGraph, a factual correctness metric.

CLNov 23, 2023
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Wentao Ge, Shunian Chen, Guiming Hardy Chen et al.

Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences, inadequately addressing the nuances of creative and associative multimodal tasks. However, the open-ended and subjective nature of such tasks poses a significant challenge to the evaluation methodology, where it is difficult to define the ground-truth answers for them. To this end, in our paper, we propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge. To validate the feasibility and effectiveness of this paradigm, we design a benchmark, dubbed MLLM-Bench, by curating the evaluation samples across six comprehensive cognitive levels. We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models. Moreover, the validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation. We contend that the proposed paradigm explores the potential of MLLMs as effective evaluation tools with the help of per-sample criteria. See online leaderboard at \url{https://mllm-bench.llmzoo.com}.

CLFeb 18, 2024Code
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang et al.

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.

CVOct 19, 2023
Exploiting Low-confidence Pseudo-labels for Source-free Object Detection

Zhihong Chen, Zilei Wang, Yixin Zhang

Source-free object detection (SFOD) aims to adapt a source-trained detector to an unlabeled target domain without access to the labeled source data. Current SFOD methods utilize a threshold-based pseudo-label approach in the adaptation phase, which is typically limited to high-confidence pseudo-labels and results in a loss of information. To address this issue, we propose a new approach to take full advantage of pseudo-labels by introducing high and low confidence thresholds. Specifically, the pseudo-labels with confidence scores above the high threshold are used conventionally, while those between the low and high thresholds are exploited using the Low-confidence Pseudo-labels Utilization (LPU) module. The LPU module consists of Proposal Soft Training (PST) and Local Spatial Contrastive Learning (LSCL). PST generates soft labels of proposals for soft training, which can mitigate the label mismatch problem. LSCL exploits the local spatial relationship of proposals to improve the model's ability to differentiate between spatially adjacent proposals, thereby optimizing representational features further. Combining the two components overcomes the challenges faced by traditional methods in utilizing low-confidence pseudo-labels. Extensive experiments on five cross-domain object detection benchmarks demonstrate that our proposed method outperforms the previous SFOD methods, achieving state-of-the-art performance.

CVFeb 23, 2024Code
Large Multimodal Agents: A Survey

Junlin Xie, Zhihong Chen, Ruifei Zhang et al.

Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities akin to humans. Concurrently, there is an emerging research trend focused on extending these LLM-powered AI agents into the multimodal domain. This extension enables AI agents to interpret and respond to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents ( LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks integrating multiple LMAs , enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, hindering effective comparison among different LMAs . Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose possible future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field. An up-to-date resource list is available at https://github.com/jun0wanan/awesome-large-multimodal-agents.

LGJan 19, 2023
GIPA: A General Information Propagation Algorithm for Graph Learning

Houyi Li, Zhihong Chen, Zhao Li et al.

Graph neural networks (GNNs) have been widely used in graph-structured data computation, showing promising performance in various applications such as node classification, link prediction, and network recommendation. Existing works mainly focus on node-wise correlation when doing weighted aggregation of neighboring nodes based on attention, such as dot product by the dense vectors of two nodes. This may cause conflicting noise in nodes to be propagated when doing information propagation. To solve this problem, we propose a General Information Propagation Algorithm (GIPA in short), which exploits more fine-grained information fusion including bit-wise and feature-wise correlations based on edge features in their propagation. Specifically, the bit-wise correlation calculates the element-wise attention weight through a multi-layer perceptron (MLP) based on the dense representations of two nodes and their edge; The feature-wise correlation is based on the one-hot representations of node attribute features for feature selection. We evaluate the performance of GIPA on the Open Graph Benchmark proteins (OGBN-proteins for short) dataset and the Alipay dataset of Alibaba. Experimental results reveal that GIPA outperforms the state-of-the-art models in terms of prediction accuracy, e.g., GIPA achieves an average ROC-AUC of $0.8901\pm 0.0011$, which is better than that of all the existing methods listed in the OGBN-proteins leaderboard.

96.9ETApr 6
Experimental Demonstration of an On-Chip CMOS-Integrated 3T-1MTJ Probabilistic Bit - A P-Bit

Xuejian Zhang, John Arnesh Divakaruni Daniel, Neil Dilley et al.

Ongoing semiconductor scaling challenges and the rise of neuromorphic computing have sparked interest in exploring novel computing schemes to achieve higher power efficiency and computational capabilities. Probabilistic computing is one candidate that endows low power consumption, capability of solving probability-encoded computational problems, and the ease of integration with existing CMOS technology. A basic building block of this scheme is the probabilistic bit (P-Bit), which utilizes a novel device such as a stochastic magnetic tunnel junction (sMTJ) to generate tunable randomness by nature. This work presents the first experimental demonstration of a fully CMOS-integrated sMTJ-based P-Bit, capable of generating rail-to-rail stochastic output with a mere collection of 3 transistors + 1 sMTJ. Furthermore, simulations also confirm this P-Bit's functionality in probabilistic logic circuits. The demonstration of such P-Bit paves the way towards realizing monolithic large-scale probabilistic computing architecture on CMOS chips.

AIDec 19, 2022
Generalizing Multimodal Variational Methods to Sets

Jinzhao Zhou, Yiqun Duan, Zhihong Chen et al.

Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-of-experts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-of-the-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.

CVJun 10, 2024Code
Merlin: A Computed Tomography Vision-Language Foundation Model and Dataset

Louis Blankemeier, Ashwin Kumar, Joseph Paul Cohen et al.

The large volume of abdominal computed tomography (CT) scans coupled with the shortage of radiologists have intensified the need for automated medical image analysis tools. Previous state-of-the-art approaches for automated analysis leverage vision-language models (VLMs) that jointly model images and radiology reports. However, current medical VLMs are generally limited to 2D images and short reports. Here to overcome these shortcomings for abdominal CT interpretation, we introduce Merlin, a 3D VLM that learns from volumetric CT scans, electronic health record data and radiology reports. This approach is enabled by a multistage pretraining framework that does not require additional manual annotations. We trained Merlin using a high-quality clinical dataset of paired CT scans (>6 million images from 15,331 CT scans), diagnosis codes (>1.8 million codes) and radiology reports (>6 million tokens). We comprehensively evaluated Merlin on 6 task types and 752 individual tasks that covered diagnostic, prognostic and quality-related tasks. The non-adapted (off-the-shelf) tasks included zero-shot classification of findings (30 findings), phenotype classification (692 phenotypes) and zero-shot cross-modal retrieval (image-to-findings and image-to-impression). The model-adapted tasks included 5-year chronic disease prediction (6 diseases), radiology report generation and 3D semantic segmentation (20 organs). We validated Merlin at scale, with internal testing on 5,137 CT scans and external testing on 44,098 CT scans from 3 independent sites and 2 public datasets. The results demonstrated high generalization across institutions and anatomies. Merlin outperformed 2D VLMs, CT foundation models and off-the-shelf radiology models. We also release our trained models, code, and dataset, available at: https://github.com/StanfordMIMI/Merlin.

CLMay 6, 2024Code
GREEN: Generative Radiology Report Evaluation and Error Notation

Sophie Ostmeier, Justin Xu, Zhihong Chen et al.

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.

CLMay 24, 2023Code
HuatuoGPT, towards Taming Language Model to Be a Doctor

Hongbo Zhang, Junying Chen, Feng Jiang et al.

In this paper, we present HuatuoGPT, a large language model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both \textit{distilled data from ChatGPT} and \textit{real-world data from doctors} in the supervised fine-tuned stage. The responses of ChatGPT are usually detailed, well-presented and informative while it cannot perform like a doctor in many aspects, e.g. for integrative diagnosis. We argue that real-world data from doctors would be complementary to distilled data in the sense the former could tame a distilled language model to perform like doctors. To better leverage the strengths of both data, we train a reward model to align the language model with the merits that both data bring, following an RLAIF (reinforced learning from AI feedback) fashion. To evaluate and benchmark the models, we propose a comprehensive evaluation scheme (including automatic and manual metrics). Experimental results demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs in GPT-4 evaluation, human evaluation, and medical benchmark datasets. It is worth noting that by using additional real-world data and RLAIF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model ChatGPT in most cases. Our code, data, and models are publicly available at \url{https://github.com/FreedomIntelligence/HuatuoGPT}. The online demo is available at \url{https://www.HuatuoGPT.cn/}.

CVMay 30, 2020Code
Attention-Guided Discriminative Region Localization and Label Distribution Learning for Bone Age Assessment

Chao Chen, Zhihong Chen, Xinyu Jin et al.

Bone age assessment (BAA) is clinically important as it can be used to diagnose endocrine and metabolic disorders during child development. Existing deep learning based methods for classifying bone age use the global image as input, or exploit local information by annotating extra bounding boxes or key points. However, training with the global image underutilizes discriminative local information, while providing extra annotations is expensive and subjective. In this paper, we propose an attention-guided approach to automatically localize the discriminative regions for BAA without any extra annotations. Specifically, we first train a classification model to learn the attention maps of the discriminative regions, finding the hand region, the most discriminative region (the carpal bones), and the next most discriminative region (the metacarpal bones). Guided by those attention maps, we then crop the informative local regions from the original image and aggregate different regions for BAA. Instead of taking BAA as a general regression task, which is suboptimal due to the label ambiguity problem in the age label space, we propose using joint age distribution learning and expectation regression, which makes use of the ordinal relationship among hand images with different individual ages and leads to more robust age estimation. Extensive experiments are conducted on the RSNA pediatric bone age data set. Using no training annotations, our method achieves competitive results compared with existing state-of-the-art semi-automatic deep learning-based methods that require manual annotation. Code is available at https: //github.com/chenchao666/Bone-Age-Assessment.

IRMay 21, 2020Code
ESAM: Discriminative Domain Adaptation with Non-Displayed Items to Improve Long-Tail Performance

Zhihong Chen, Rong Xiao, Chenliang Li et al.

Most of ranking models are trained only with displayed items (most are hot items), but they are utilized to retrieve items in the entire space which consists of both displayed and non-displayed items (most are long-tail items). Due to the sample selection bias, the long-tail items lack sufficient records to learn good feature representations, i.e. data sparsity and cold start problems. The resultant distribution discrepancy between displayed and non-displayed items would cause poor long-tail performance. To this end, we propose an entire space adaptation model (ESAM) to address this problem from the perspective of domain adaptation (DA). ESAM regards displayed and non-displayed items as source and target domains respectively. Specifically, we design the attribute correlation alignment that considers the correlation between high-level attributes of the item to achieve distribution alignment. Furthermore, we introduce two effective regularization strategies, i.e. \textit{center-wise clustering} and \textit{self-training} to improve DA process. Without requiring any auxiliary information and auxiliary domains, ESAM transfers the knowledge from displayed items to non-displayed items for alleviating the distribution inconsistency. Experiments on two public datasets and a large-scale industrial dataset collected from Taobao demonstrate that ESAM achieves state-of-the-art performance, especially in the long-tail space. Besides, we deploy ESAM to the Taobao search engine, leading to significant improvement on online performance. The code is available at \url{https://github.com/A-bone1/ESAM.git}

CVDec 27, 2019Code
HoMM: Higher-order Moment Matching for Unsupervised Domain Adaptation

Chao Chen, Zhihang Fu, Zhihong Chen et al.

Minimizing the discrepancy of feature distributions between different domains is one of the most promising directions in unsupervised domain adaptation. From the perspective of distribution matching, most existing discrepancy-based methods are designed to match the second-order or lower statistics, which however, have limited expression of statistical characteristic for non-Gaussian distributions. In this work, we explore the benefits of using higher-order statistics (mainly refer to third-order and fourth-order statistics) for domain matching. We propose a Higher-order Moment Matching (HoMM) method, and further extend the HoMM into reproducing kernel Hilbert spaces (RKHS). In particular, our proposed HoMM can perform arbitrary-order moment tensor matching, we show that the first-order HoMM is equivalent to Maximum Mean Discrepancy (MMD) and the second-order HoMM is equivalent to Correlation Alignment (CORAL). Moreover, the third-order and the fourth-order moment tensor matching are expected to perform comprehensive domain alignment as higher-order statistics can approximate more complex, non-Gaussian distributions. Besides, we also exploit the pseudo-labeled target samples to learn discriminative representations in the target domain, which further improves the transfer performance. Extensive experiments are conducted, showing that our proposed HoMM consistently outperforms the existing moment matching methods by a large margin. Codes are available at \url{https://github.com/chenchao666/HoMM-Master}

CVJan 22, 2024
A Vision-Language Foundation Model to Enhance Efficiency of Chest X-ray Interpretation

Zhihong Chen, Maya Varma, Justin Xu et al. · mila, oxford

Over 1.4 billion chest X-rays (CXRs) are performed annually due to their cost-effectiveness as an initial diagnostic test. This scale of radiological studies provides a significant opportunity to streamline CXR interpretation and documentation. While foundation models are a promising solution, the lack of publicly available large-scale datasets and benchmarks inhibits their iterative development and real-world evaluation. To overcome these challenges, we constructed a large-scale dataset (CheXinstruct), which we utilized to train a vision-language foundation model (CheXagent). We systematically demonstrated competitive performance across eight distinct task types on our novel evaluation benchmark (CheXbench). Beyond technical validation, we assessed the real-world utility of CheXagent in directly drafting radiology reports. Our clinical assessment with eight radiologists revealed a 36% time saving for residents using CheXagent-drafted reports, while attending radiologists showed no significant time difference editing resident-drafted or CheXagent-drafted reports. The CheXagent-drafted reports improved the writing efficiency of both radiology residents and attending radiologists in 81% and 61% of cases, respectively, without loss of quality. Overall, we demonstrate that CheXagent can effectively perform a variety of CXR interpretation tasks and holds potential to assist radiologists in routine clinical workflows.

LGNov 27, 2024
Foundation Models in Radiology: What, How, When, Why and Why Not

Magdalini Paschali, Zhihong Chen, Louis Blankemeier et al. · stanford

Recent advances in artificial intelligence have witnessed the emergence of large-scale deep learning models capable of interpreting and generating both textual and imaging data. Such models, typically referred to as foundation models, are trained on extensive corpora of unlabeled data and demonstrate high performance across various tasks. Foundation models have recently received extensive attention from academic, industry, and regulatory bodies. Given the potentially transformative impact that foundation models can have on the field of radiology, this review aims to establish a standardized terminology concerning foundation models, with a specific focus on the requirements of training data, model training paradigms, model capabilities, and evaluation strategies. We further outline potential pathways to facilitate the training of radiology-specific foundation models, with a critical emphasis on elucidating both the benefits and challenges associated with such models. Overall, we envision that this review can unify technical advances and clinical needs in the training of foundation models for radiology in a safe and responsible manner, for ultimately benefiting patients, providers, and radiologists.

CVNov 6, 2024
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

Maya Varma, Jean-Benoit Delbrouck, Zhihong Chen et al.

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

LGFeb 16, 2025
Simplify RLHF as Reward-Weighted SFT: A Variational Method

Yuhao Du, Zhuo Li, Pengyu Cheng et al.

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning Large Language Models (LLMs) with human values. However, RLHF has been continuously challenged by its high complexity in implementation and computation consumption. Even with recent simplifications, such as Direct Preference Optimization (DPO) and Advantage Leftover Lunch (A-LoL), the problems of over-fitting and training instability remain hindering the alignment process from the expected optimal performance. To address the existing challenges, we propose a novel simplification of RLHF from the perspective of variational inference, called $\textbf{V}$ariational $\textbf{A}$lignment with $\textbf{R}$e-weighting ($\textbf{VAR}$). More specifically, by directly minimizing the distribution gap between the learning LLM policy and the optimal solution of RLHF, we transform the alignment objective into a reward-driven re-weighted supervised fine-tuning (SFT) form, which only requires minor adjustment on the SFT loss to obtain noticeable improvement on training stability and effectiveness. On comprehensive alignment and generation benchmarks, our VAR method has numerically achieved competitive performance in LLM alignment helpfulness and harmlessness.

CLMay 30, 2025
Automated Structured Radiology Report Generation

Jean-Benoit Delbrouck, Justin Xu, Johannes Moll et al. · oxford, stanford

Automated radiology report generation from chest X-ray (CXR) images has the potential to improve clinical efficiency and reduce radiologists' workload. However, most datasets, including the publicly available MIMIC-CXR and CheXpert Plus, consist entirely of free-form reports, which are inherently variable and unstructured. This variability poses challenges for both generation and evaluation: existing models struggle to produce consistent, clinically meaningful reports, and standard evaluation metrics fail to capture the nuances of radiological interpretation. To address this, we introduce Structured Radiology Report Generation (SRRG), a new task that reformulates free-text radiology reports into a standardized format, ensuring clarity, consistency, and structured clinical reporting. We create a novel dataset by restructuring reports using large language models (LLMs) following strict structured reporting desiderata. Additionally, we introduce SRR-BERT, a fine-grained disease classification model trained on 55 labels, enabling more precise and clinically informed evaluation of structured reports. To assess report quality, we propose F1-SRR-BERT, a metric that leverages SRR-BERT's hierarchical disease taxonomy to bridge the gap between free-text variability and structured clinical reporting. We validate our dataset through a reader study conducted by five board-certified radiologists and extensive benchmarking experiments.

CVSep 29, 2025
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

Zhihong Chen, Xuehai Bai, Yang Shi et al.

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and simple object manipulation, they often lack the systematic structure and challenging scenarios required for real-world applications. To address this bottleneck, we introduce OpenGPT-4o-Image, a large-scale dataset constructed using a novel methodology that combines hierarchical task taxonomy with automated data generation. Our taxonomy not only includes fundamental capabilities such as text rendering and style control but also introduces highly practical yet challenging categories like scientific imagery for chemistry illustrations and complex instruction editing requiring simultaneous execution of multiple operations. Through an automated pipeline leveraging structured resource pools and GPT-4o, we generate 80k high-quality instruction-image pairs with controlled diversity, covering 11 major domains and 51 subtasks. Extensive experiments show that fine-tuning leading models on our dataset achieves significant performance gains across multiple benchmarks, with improvements of up to 18\% on editing tasks (UniWorld-V1 on ImgEdit-Bench) and 13% on generation tasks (Harmon on GenEval). Our work demonstrates that systematic data construction is key to advancing multimodal AI capabilities.

IVNov 27, 2024
Evaluating and Improving the Effectiveness of Synthetic Chest X-Rays for Medical Image Analysis

Eva Prakash, Jeya Maria Jose Valanarasu, Zhihong Chen et al.

Purpose: To explore best-practice approaches for generating synthetic chest X-ray images and augmenting medical imaging datasets to optimize the performance of deep learning models in downstream tasks like classification and segmentation. Materials and Methods: We utilized a latent diffusion model to condition the generation of synthetic chest X-rays on text prompts and/or segmentation masks. We explored methods like using a proxy model and using radiologist feedback to improve the quality of synthetic data. These synthetic images were then generated from relevant disease information or geometrically transformed segmentation masks and added to ground truth training set images from the CheXpert, CANDID-PTX, SIIM, and RSNA Pneumonia datasets to measure improvements in classification and segmentation model performance on the test sets. F1 and Dice scores were used to evaluate classification and segmentation respectively. One-tailed t-tests with Bonferroni correction assessed the statistical significance of performance improvements with synthetic data. Results: Across all experiments, the synthetic data we generated resulted in a maximum mean classification F1 score improvement of 0.150453 (CI: 0.099108-0.201798; P=0.0031) compared to using only real data. For segmentation, the maximum Dice score improvement was 0.14575 (CI: 0.108267-0.183233; P=0.0064). Conclusion: Best practices for generating synthetic chest X-ray images for downstream tasks include conditioning on single-disease labels or geometrically transformed segmentation masks, as well as potentially using proxy modeling for fine-tuning such generations.

LGDec 17, 2024
An Advantage-based Optimization Method for Reinforcement Learning in Large Action Space

Hai Lin, Cheng Huang, Zhihong Chen

Reinforcement learning tasks in real-world scenarios often involve large, high-dimensional action spaces, leading to challenges such as convergence difficulties, instability, and high computational complexity. It is widely acknowledged that traditional value-based reinforcement learning algorithms struggle to address these issues effectively. A prevalent approach involves generating independent sub-actions within each dimension of the action space. However, this method introduces bias, hindering the learning of optimal policies. In this paper, we propose an advantage-based optimization method and an algorithm named Advantage Branching Dueling Q-network (ABQ). ABQ incorporates a baseline mechanism to tune the action value of each dimension, leveraging the advantage relationship across different sub-actions. With this approach, the learned policy can be optimized for each dimension. Empirical results demonstrate that ABQ outperforms BDQ, achieving 3%, 171%, and 84% more cumulative rewards in HalfCheetah, Ant, and Humanoid environments, respectively. Furthermore, ABQ exhibits competitive performance when compared against two continuous action benchmark algorithms, DDPG and TD3.

CLDec 18, 2021
Word Graph Guided Summarization for Radiology Findings

Jinpeng Hu, Jianling Li, Zhihong Chen et al.

Radiology reports play a critical role in communicating medical findings to physicians. In each report, the impression section summarizes essential radiology findings. In clinical practice, writing impression is highly demanded yet time-consuming and prone to errors for radiologists. Therefore, automatic impression generation has emerged as an attractive research direction to facilitate such clinical practice. Existing studies mainly focused on introducing salient word information to the general text summarization framework to guide the selection of the key content in radiology findings. However, for this task, a model needs not only capture the important words in findings but also accurately describe their relations so as to generate high-quality impressions. In this paper, we propose a novel method for automatic impression generation, where a word graph is constructed from the findings to record the critical words and their relations, then a Word Graph guided Summarization model (WGSum) is designed to generate impressions with the help of the word graph. Experimental results on two datasets, OpenI and MIMIC-CXR, confirm the validity and effectiveness of our proposed approach, where the state-of-the-art results are achieved on both datasets. Further experiments are also conducted to analyze the impact of different graph designs to the performance of our method.

CLOct 11, 2021
Pre-trained Language Models in Biomedical Domain: A Systematic Survey

Benyou Wang, Qianqian Xie, Jiahuan Pei et al.

Pre-trained language models (PLMs) have been the de facto paradigm for most natural language processing (NLP) tasks. This also benefits biomedical domain: researchers from informatics, medicine, and computer science (CS) communities propose various PLMs trained on biomedical datasets, e.g., biomedical text, electronic health records, protein, and DNA sequences for various biomedical tasks. However, the cross-discipline characteristics of biomedical PLMs hinder their spreading among communities; some existing works are isolated from each other without comprehensive comparison and discussions. It expects a survey that not only systematically reviews recent advances of biomedical PLMs and their applications but also standardizes terminology and benchmarks. In this paper, we summarize the recent progress of pre-trained language models in the biomedical domain and their applications in biomedical downstream tasks. Particularly, we discuss the motivations and propose a taxonomy of existing biomedical PLMs. Their applications in biomedical downstream tasks are exhaustively discussed. At last, we illustrate various limitations and future trends, which we hope can provide inspiration for the future research of the research community.

IRMay 18, 2021
Path-based Deep Network for Candidate Item Matching in Recommenders

Houyi Li, Zhihong Chen, Chenliang Li et al.

The large-scale recommender system mainly consists of two stages: matching and ranking. The matching stage (also known as the retrieval step) identifies a small fraction of relevant items from billion-scale item corpus in low latency and computational cost. Item-to-item collaborative filter (item-based CF) and embedding-based retrieval (EBR) have been long used in the industrial matching stage owing to its efficiency. However, item-based CF is hard to meet personalization, while EBR has difficulty in satisfying diversity. In this paper, we propose a novel matching architecture, Path-based Deep Network (named PDN), which can incorporate both personalization and diversity to enhance matching performance. Specifically, PDN is comprised of two modules: Trigger Net and Similarity Net. PDN utilizes Trigger Net to capture the user's interest in each of his/her interacted item, and Similarity Net to evaluate the similarity between each interacted item and the target item based on these items' profile and CF information. The final relevance between the user and the target item is calculated by explicitly considering user's diverse interests, \ie aggregating the relevance weights of the related two-hop paths (one hop of a path corresponds to user-item interaction and the other to item-item relevance). Furthermore, we describe the architecture design of a matching system with the proposed PDN in a leading real-world E-Commerce service (Mobile Taobao App). Based on offline evaluations and online A/B test, we show that PDN outperforms the existing solutions for the same task. The online results also demonstrate that PDN can retrieve more personalized and more diverse relevant items to significantly improve user engagement. Currently, PDN system has been successfully deployed at Mobile Taobao App and handling major online traffic.

CVMay 6, 2021
Generalizable Representation Learning for Mixture Domain Face Anti-Spoofing

Zhihong Chen, Taiping Yao, Kekai Sheng et al.

Face anti-spoofing approach based on domain generalization(DG) has drawn growing attention due to its robustness forunseen scenarios. Existing DG methods assume that the do-main label is known.However, in real-world applications, thecollected dataset always contains mixture domains, where thedomain label is unknown. In this case, most of existing meth-ods may not work. Further, even if we can obtain the domainlabel as existing methods, we think this is just a sub-optimalpartition. To overcome the limitation, we propose domain dy-namic adjustment meta-learning (D2AM) without using do-main labels, which iteratively divides mixture domains viadiscriminative domain representation and trains a generaliz-able face anti-spoofing with meta-learning. Specifically, wedesign a domain feature based on Instance Normalization(IN) and propose a domain representation learning module(DRLM) to extract discriminative domain features for cluster-ing. Moreover, to reduce the side effect of outliers on cluster-ing performance, we additionally utilize maximum mean dis-crepancy (MMD) to align the distribution of sample featuresto a prior distribution, which improves the reliability of clus tering. Extensive experiments show that the proposed methodoutperforms conventional DG-based face anti-spoofing meth-ods, including those utilizing domain labels. Furthermore, weenhance the interpretability through visualizatio

CLOct 30, 2020
Generating Radiology Reports via Memory-driven Transformer

Zhihong Chen, Yan Song, Tsung-Hui Chang et al.

Medical imaging is frequently used in clinical practice and trials for diagnosis and treatment. Writing imaging reports is time-consuming and can be error-prone for inexperienced radiologists. Therefore, automatically generating radiology reports is highly desired to lighten the workload of radiologists and accordingly promote clinical automation, which is an essential task to apply artificial intelligence to the medical domain. In this paper, we propose to generate radiology reports with memory-driven Transformer, where a relational memory is designed to record key information of the generation process and a memory-driven conditional layer normalization is applied to incorporating the memory into the decoder of Transformer. Experimental results on two prevailing radiology report datasets, IU X-Ray and MIMIC-CXR, show that our proposed approach outperforms previous models with respect to both language generation metrics and clinical evaluations. Particularly, this is the first work reporting the generation results on MIMIC-CXR to the best of our knowledge. Further analyses also demonstrate that our approach is able to generate long reports with necessary medical terms as well as meaningful image-text attention mappings.

ETMay 2, 2020
Electrically-Tunable Stochasticity for Spin-based Neuromorphic Circuits: Self-Adjusting to Variation

Hossein Pourmeidani, Punyashloka Debashis, Zhihong Chen et al.

Energy-efficient methods are addressed for leveraging low energy barrier nanomagnetic devices within neuromorphic architectures. Using a Magnetoresistive Random Access Memory (MRAM) probabilistic device (p-bit) as the basis of neuronal structures in Deep Belief Networks (DBNs), the impact of reducing the Magnetic Tunnel Junction's (MTJ's) energy barrier is assessed and optimized for the resulting stochasticity present in the learning system. This can mitigate the process variation sensitivity of stochastic DBNs which encounter a sharp drop-off when energy barriers exceed near-zero kT. As evaluated for the MNIST dataset for energy barriers at near-zero kT to 2.0 kT in increments of 0.5 kT, it is shown that the stability factor changes by 5 orders of magnitude. The self-compensating circuit developed herein provides a compact, and low complexity approach to mitigating process variation impacts towards practical implementation and fabrication.

LGMay 26, 2019
Selective Transfer with Reinforced Transfer Network for Partial Domain Adaptation

Zhihong Chen, Chao Chen, Zhaowei Cheng et al.

One crucial aspect of partial domain adaptation (PDA) is how to select the relevant source samples in the shared classes for knowledge transfer. Previous PDA methods tackle this problem by re-weighting the source samples based on their high-level information (deep features). However, since the domain shift between source and target domains, only using the deep features for sample selection is defective. We argue that it is more reasonable to additionally exploit the pixel-level information for PDA problem, as the appearance difference between outlier source classes and target classes is significantly large. In this paper, we propose a reinforced transfer network (RTNet), which utilizes both high-level and pixel-level information for PDA problem. Our RTNet is composed of a reinforced data selector (RDS) based on reinforcement learning (RL), which filters out the outlier source samples, and a domain adaptation model which minimizes the domain discrepancy in the shared label space. Specifically, in the RDS, we design a novel reward based on the reconstruct errors of selected source samples on the target generator, which introduces the pixel-level information to guide the learning of RDS. Besides, we develope a state containing high-level information, which used by the RDS for sample selection. The proposed RDS is a general module, which can be easily integrated into existing DA models to make them fit the PDA situation. Extensive experiments indicate that RTNet can achieve state-of-the-art performance for PDA tasks on several benchmark datasets.

CVApr 13, 2019
Towards Self-similarity Consistency and Feature Discrimination for Unsupervised Domain Adaptation

Chao Chen, Zhihang Fu, Zhihong Chen et al.

Recent advances in unsupervised domain adaptation mainly focus on learning shared representations by global distribution alignment without considering class information across domains. The neglect of class information, however, may lead to partial alignment (or even misalignment) and poor generalization performance. For comprehensive alignment, we argue that the similarities across different features in the source domain should be consistent with that of in the target domain. Based on this assumption, we propose a new domain discrepancy metric, i.e., Self-similarity Consistency (SSC), to enforce the feature structure being consistent across domains. The renowned correlation alignment (CORAL) is proven to be a special case, and a sub-optimal measure of our proposed SSC. Furthermore, we also propose to mitigate the side effect of the partial alignment and misalignment by incorporating the discriminative information of the deep representations. Specifically, an embarrassingly simple and effective feature norm constraint is exploited to enlarge the discrepancy of inter-class samples. It relieves the requirements of strict alignment when performing adaptation, therefore improving the adaptation performance significantly. Extensive experiments on visual domain adaptation tasks demonstrate the effectiveness of our proposed SSC metric and feature discrimination approach.

LGAug 28, 2018
Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation

Chao Chen, Zhihong Chen, Boyuan Jiang et al.

Recently, considerable effort has been devoted to deep domain adaptation in computer vision and machine learning communities. However, most of existing work only concentrates on learning shared feature representation by minimizing the distribution discrepancy across different domains. Due to the fact that all the domain alignment approaches can only reduce, but not remove the domain shift. Target domain samples distributed near the edge of the clusters, or far from their corresponding class centers are easily to be misclassified by the hyperplane learned from the source domain. To alleviate this issue, we propose to joint domain alignment and discriminative feature learning, which could benefit both domain alignment and final classification. Specifically, an instance-based discriminative feature learning method and a center-based discriminative feature learning method are proposed, both of which guarantee the domain invariant features with better intra-class compactness and inter-class separability. Extensive experiments show that learning the discriminative features in the shared feature space can significantly boost the performance of deep domain adaptation methods.

CVMay 15, 2018
Ro-SOS: Metric Expression Network (MEnet) for Robust Salient Object Segmentation

Delu Zeng, Yixuan He, Li Liu et al.

Although deep CNNs have brought significant improvement to image saliency detection, most CNN based models are sensitive to distortion such as compression and noise. In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to deal with saliency detection with the tolerance of distortion. Within MEnet, a new topological metric space is constructed, whose implicit metric is determined by the deep network. As a result, we manage to group all the pixels in the observed image semantically within this latent space into two regions: a salient region and a non-salient region. With this architecture, all feature extractions are carried out at the pixel level, enabling fine granularity of output boundaries of the salient objects. What's more, we try to give a general analysis for the noise robustness of the network in the sense of Lipschitz and Jacobian literature. Experiments demonstrate that robust salient maps facilitating object segmentation can be generated by the proposed metric. Tests on several public benchmarks show that MEnet has achieved desirable performance. Furthermore, by direct computation and measuring the robustness, the proposed method outperforms previous CNN-based methods on distorted inputs.