CVNov 21, 2022
Expectation-Maximization Contrastive Learning for Compact Video-and-Language RepresentationsPeng Jin, Jinfa Huang, Fenglin Liu et al. · oxford
Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.
IVNov 12, 2022Code
DeltaNet:Conditional Medical Report Generation for COVID-19 DiagnosisXian Wu, Shuxin Yang, Zhaopeng Qiu et al.
Fast screening and diagnosis are critical in COVID-19 patient treatment. In addition to the gold standard RT-PCR, radiological imaging like X-ray and CT also works as an important means in patient screening and follow-up. However, due to the excessive number of patients, writing reports becomes a heavy burden for radiologists. To reduce the workload of radiologists, we propose DeltaNet to generate medical reports automatically. Different from typical image captioning approaches that generate reports with an encoder and a decoder, DeltaNet applies a conditional generation process. In particular, given a medical image, DeltaNet employs three steps to generate a report: 1) first retrieving related medical reports, i.e., the historical reports from the same or similar patients; 2) then comparing retrieved images and current image to find the differences; 3) finally generating a new report to accommodate identified differences based on the conditional report. We evaluate DeltaNet on a COVID-19 dataset, where DeltaNet outperforms state-of-the-art approaches. Besides COVID-19, the proposed DeltaNet can be applied to other diseases as well. We validate its generalization capabilities on the public IU-Xray and MIMIC-CXR datasets for chest-related diseases. Code is available at \url{https://github.com/LX-doctorAI1/DeltaNet}.
CLJun 24, 2022
Competence-based Multimodal Curriculum Learning for Medical Report GenerationFenglin Liu, Shen Ge, Yuexian Zou et al. · oxford
Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model's performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
IVMar 18, 2022
AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report GenerationDi You, Fenglin Liu, Shen Ge et al. · oxford
Recently, medical report generation, which aims to automatically generate a long and coherent descriptive paragraph of a given medical image, has received growing research interests. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias: the normal visual regions dominate the dataset over the abnormal visual regions, and 2) the very long sequence. To alleviate above two problems, we propose an AlignTransformer framework, which includes the Align Hierarchical Attention (AHA) and the Multi-Grained Transformer (MGT) modules: 1) AHA module first predicts the disease tags from the input image and then learns the multi-grained visual features by hierarchically aligning the visual regions and disease tags. The acquired disease-grounded visual features can better represent the abnormal regions of the input image, which could alleviate data bias problem; 2) MGT module effectively uses the multi-grained features and Transformer framework to generate the long medical report. The experiments on the public IU-Xray and MIMIC-CXR datasets show that the AlignTransformer can achieve results competitive with state-of-the-art methods on the two datasets. Moreover, the human evaluation conducted by professional radiologists further proves the effectiveness of our approach.
CLApr 29, 2022
End-to-end Spoken Conversational Question Answering: Task, Dataset and ModelChenyu You, Nuo Chen, Fenglin Liu et al. · oxford
In spoken question answering, the systems are designed to answer questions from contiguous text spans within the related speech transcripts. However, the most natural way that human seek or test their knowledge is via human conversations. Therefore, we propose a new Spoken Conversational Question Answering task (SCQA), aiming at enabling the systems to model complex dialogue flows given the speech documents. In this task, our main objective is to build the system to deal with conversational questions based on the audio recordings, and to explore the plausibility of providing more cues from different modalities with systems in information gathering. To this end, instead of directly adopting automatically generated speech transcripts with highly noisy data, we propose a novel unified data distillation approach, DDNet, which effectively ingests cross-modal information to achieve fine-grained representations of the speech and language modalities. Moreover, we propose a simple and novel mechanism, termed Dual Attention, by encouraging better alignments between audio and text to ease the process of knowledge transfer. To evaluate the capacity of SCQA systems in a dialogue-style interaction, we assemble a Spoken Conversational Question Answering (Spoken-CoQA) dataset with more than 40k question-answer pairs from 4k conversations. The performance of the existing state-of-the-art methods significantly degrade on our dataset, hence demonstrating the necessity of cross-modal information integration. Our experimental results demonstrate that our proposed method achieves superior performance in spoken conversational question answering tasks.
CVNov 22, 2022
Aligning Source Visual and Target Language Domains for Unpaired Video CaptioningFenglin Liu, Xian Wu, Chenyu You et al. · oxford
Training supervised video captioning model requires coupled video-caption pairs. However, for many targeted languages, sufficient paired data are not available. To this end, we introduce the unpaired video captioning task aiming to train models without coupled video-caption pairs in target language. To solve the task, a natural choice is to employ a two-step pipeline system: first utilizing video-to-pivot captioning model to generate captions in pivot language and then utilizing pivot-to-target translation model to translate the pivot captions to the target language. However, in such a pipeline system, 1) visual information cannot reach the translation model, generating visual irrelevant target captions; 2) the errors in the generated pivot captions will be propagated to the translation model, resulting in disfluent target captions. To address these problems, we propose the Unpaired Video Captioning with Visual Injection system (UVC-VI). UVC-VI first introduces the Visual Injection Module (VIM), which aligns source visual and target language domains to inject the source visual information into the target language domain. Meanwhile, VIM directly connects the encoder of the video-to-pivot model and the decoder of the pivot-to-target model, allowing end-to-end inference by completely skipping the generation of pivot captions. To enhance the cross-modality injection of the VIM, UVC-VI further introduces a pluggable video encoder, i.e., Multimodal Collaborative Encoder (MCE). The experiments show that UVC-VI outperforms pipeline systems and exceeds several supervised systems. Furthermore, equipping existing supervised systems with our MCE can achieve 4% and 7% relative margins on the CIDEr scores to current state-of-the-art models on the benchmark MSVD and MSR-VTT datasets, respectively.
CLOct 23, 2022
Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical MedicineFenglin Liu, Bang Yang, Chenyu You et al. · oxford
Language models (LMs), including large language models (such as ChatGPT), have the potential to assist clinicians in generating various clinical notes. However, LMs are prone to produce ``hallucinations'', i.e., generated content that is not aligned with facts and knowledge. In this paper, we propose the Re$^3$Writer method with retrieval-augmented generation and knowledge-grounded reasoning to enable LMs to generate faithful clinical texts. We demonstrate the effectiveness of our method in generating patient discharge instructions. It requires the LMs not to only understand the patients' long clinical documents, i.e., the health records during hospitalization, but also to generate critical instructional information provided both to carers and to the patient at the time of discharge. The proposed Re$^3$Writer imitates the working patterns of physicians to first \textbf{re}trieve related working experience from historical instructions written by physicians, then \textbf{re}ason related medical knowledge. Finally, it \textbf{re}fines the retrieved working experience and reasoned medical knowledge to extract useful information, which is used to generate the discharge instructions for previously-unseen patients. Our experiments show that, using our method, the performance of five representative LMs can be substantially boosted across all metrics. Meanwhile, we show results from human evaluations to measure the effectiveness in terms of fluency, faithfulness, and comprehensiveness.
AIJun 10, 2022
Graph-in-Graph Network for Automatic Gene Ontology Description GenerationFenglin Liu, Bang Yang, Chenyu You et al. · oxford
Gene Ontology (GO) is the primary gene function knowledge base that enables computational tasks in biomedicine. The basic element of GO is a term, which includes a set of genes with the same function. Existing research efforts of GO mainly focus on predicting gene term associations. Other tasks, such as generating descriptions of new terms, are rarely pursued. In this paper, we propose a novel task: GO term description generation. This task aims to automatically generate a sentence that describes the function of a GO term belonging to one of the three categories, i.e., molecular function, biological process, and cellular component. To address this task, we propose a Graph-in-Graph network that can efficiently leverage the structural information of GO. The proposed network introduces a two-layer graph: the first layer is a graph of GO terms where each node is also a graph (gene graph). Such a Graph-in-Graph network can derive the biological functions of GO terms and generate proper descriptions. To validate the effectiveness of the proposed network, we build three large-scale benchmark datasets. By incorporating the proposed Graph-in-Graph network, the performances of seven different sequence-to-sequence models can be substantially boosted across all evaluation metrics, with up to 34.7%, 14.5%, and 39.1% relative improvements in BLEU, ROUGE-L, and METEOR, respectively.
CVOct 28, 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionFenglin Liu, Xian Wu, Shen Ge et al. · oxford
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space. To tackle this problem, we propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image-sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Disentangled Multimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.
CVApr 1, 2022
Hazard Detection And Avoidance For The Nova-C LanderJoel Getchius, Devin Renshaw, Daniel Posada et al.
In early 2022, Intuitive Machines' NOVA-C Lander will touch down on the lunar surface becoming the first commercial endeavor to visit a celestial body. NOVA-C will deliver six payloads to the lunar surface with various scientific and engineering objectives, ushering in a new era of commercial space exploration and utilization. However, to safely accomplish the mission, the NOVA-C lander must ensure its landing site is free of hazards larger than 30 cm and the slope of local terrain at touchdown is less than 10 degrees off vertical. To accomplish this, NOVA-C utilizes Intuitive Machines' precision navigation system, coupled with machine vision algorithms for scene reduction and landing site characterization. A unique aspect to the NOVA-C approach is the real-time nature of the hazard detection and avoidance algorithms--which are performed 400 meters above and down range of the intended landing site and completed within 15 seconds. In this paper, we review the theoretical foundations for the hazard detection and avoidance algorithms, describe the practical challenges of implementation on the NOVA-C flight computer, and present test and analysis results.
IVDec 30, 2021
Radiology Report Generation with a Learned Knowledge Base and Multi-modal AlignmentShuxin Yang, Xian Wu, Shen Ge et al.
In clinics, a radiology report is crucial for guiding a patient's treatment. However, writing radiology reports is a heavy burden for radiologists. To this end, we present an automatic, multi-modal approach for report generation from a chest x-ray. Our approach, motivated by the observation that the descriptions in radiology reports are highly correlated with specific information of the x-ray images, features two distinct modules: (i) Learned knowledge base: To absorb the knowledge embedded in the radiology reports, we build a knowledge base that can automatically distil and restore medical knowledge from textual embedding without manual labour; (ii) Multi-modal alignment: to promote the semantic alignment among reports, disease labels, and images, we explicitly utilize textual embedding to guide the learning of the visual feature space. We evaluate the performance of the proposed model using metrics from both natural language generation and clinic efficacy on the public IU-Xray and MIMIC-CXR datasets. Our ablation study shows that each module contributes to improving the quality of generated reports. Furthermore, with the assistance of both modules, our approach outperforms state-of-the-art methods over almost all the metrics.
IVDec 30, 2021
Knowledge Matters: Radiology Report Generation with General and Specific KnowledgeShuxin Yang, Xian Wu, Shen Ge et al.
Automatic radiology report generation is critical in clinics which can relieve experienced radiologists from the heavy workload and remind inexperienced radiologists of misdiagnosis or missed diagnose. Existing approaches mainly formulate radiology report generation as an image captioning task and adopt the encoder-decoder framework. However, in the medical domain, such pure data-driven approaches suffer from the following problems: 1) visual and textual bias problem; 2) lack of expert knowledge. In this paper, we propose a knowledge-enhanced radiology report generation approach introduces two types of medical knowledge: 1) General knowledge, which is input independent and provides the broad knowledge for report generation; 2) Specific knowledge, which is input dependent and provides the fine-grained knowledge for report generation. To fully utilize both the general and specific knowledge, we also propose a knowledge-enhanced multi-head attention mechanism. By merging the visual features of the radiology image with general knowledge and specific knowledge, the proposed model can improve the quality of generated reports. Experimental results on two publicly available datasets IU-Xray and MIMIC-CXR show that the proposed knowledge enhanced approach outperforms state-of-the-art image captioning based methods. Ablation studies also demonstrate that both general and specific knowledge can help to improve the performance of radiology report generation.
LGNov 8, 2021
Auto-Encoding Knowledge Graph for Unsupervised Medical Report GenerationFenglin Liu, Chenyu You, Xian Wu et al.
Medical report generation, which aims to automatically generate a long and coherent report of a given medical image, has been receiving growing research interests. Existing approaches mainly adopt a supervised manner and heavily rely on coupled image-report pairs. However, in the medical domain, building a large-scale image-report paired dataset is both time-consuming and expensive. To relax the dependency on paired data, we propose an unsupervised model Knowledge Graph Auto-Encoder (KGAE) which accepts independent sets of images and reports in training. KGAE consists of a pre-constructed knowledge graph, a knowledge-driven encoder and a knowledge-driven decoder. The knowledge graph works as the shared latent space to bridge the visual and textual domains; The knowledge-driven encoder projects medical images and reports to the corresponding coordinates in this latent space and the knowledge-driven decoder generates a medical report given a coordinate in this space. Since the knowledge-driven encoder and decoder can be trained with independent sets of images and reports, KGAE is unsupervised. The experiments show that the unsupervised KGAE generates desirable medical reports without using any image-report training pairs. Moreover, KGAE can also work in both semi-supervised and supervised settings, and accept paired images and reports in training. By further fine-tuning with image-report pairs, KGAE consistently outperforms the current state-of-the-art models on two datasets.
CLSep 18, 2021
Towards Joint Intent Detection and Slot Filling via Higher-order AttentionDongsheng Chen, Zhiqi Huang, Xian Wu et al.
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU). Recently, attention mechanism has been shown to be effective in jointly optimizing these two tasks in an interactive manner. However, latest attention-based works concentrated only on the first-order attention design, while ignoring the exploration of higher-order attention mechanisms. In this paper, we propose a BiLinear attention block, which leverages bilinear pooling to simultaneously exploit both the contextual and channel-wise bilinear attention distributions to capture the second-order interactions between the input intent or slot features. Higher and even infinity order interactions are built by stacking numerous blocks and assigning Exponential Linear Unit (ELU) to blocks. Before the decoding stage, we introduce the Dynamic Feature Fusion Layer to implicitly fuse intent and slot information in a more effective way. Technically, instead of simply concatenating intent and slot features, we first compute two correlation matrices to weight on two features. Furthermore, we present Higher-order Attention Network for the SLU tasks. Experiments on two benchmark datasets show that our approach yields improvements compared with the state-of-the-art approach. We also provide discussion to demonstrate the effectiveness of the proposed approach.
CLAug 5, 2021
O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video CaptioningFenglin Liu, Xuancheng Ren, Xian Wu et al.
Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed.
CLJul 4, 2021
Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and ModelZhiqi Huang, Fenglin Liu, Xian Wu et al.
While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data.
CVJun 13, 2021
Contrastive Attention for Automatic Chest X-ray Report GenerationFenglin Liu, Changchang Yin, Xian Wu et al.
Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we propose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Specifically, we achieve the state-of-the-art results on the two public datasets.
CVJun 13, 2021
Exploring and Distilling Posterior and Prior Knowledge for Radiology Report GenerationFenglin Liu, Xian Wu, Shen Ge et al.
Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets.
AINov 1, 2018
On the Generation of Medical Question-Answer PairsSheng Shen, Yaliang Li, Nan Du et al.
Question answering (QA) has achieved promising progress recently. However, answering a question in real-world scenarios like the medical domain is still challenging, due to the requirement of external knowledge and the insufficient quantity of high-quality training data. In the light of these challenges, we study the task of generating medical QA pairs in this paper. With the insight that each medical question can be considered as a sample from the latent distribution of questions given answers, we propose an automated medical QA pair generation framework, consisting of an unsupervised key phrase detector that explores unstructured material for validity, and a generator that involves a multi-pass decoder to integrate structural knowledge for diversity. A series of experiments have been conducted on a real-world dataset collected from the National Medical Licensing Examination of China. Both automatic evaluation and human annotation demonstrate the effectiveness of the proposed method. Further investigation shows that, by incorporating the generated QA pairs for training, significant improvement in terms of accuracy can be achieved for the examination QA system.