Zewei Wang

CL
h-index38
6papers
96citations
Novelty35%
AI Score41

6 Papers

CVJul 21, 2023
Robust Visual Question Answering: Datasets, Methods, and Future Challenges

Jie Ma, Pinghui Wang, Dechen Kong et al.

Visual question answering requires a system to provide an accurate natural language answer given an image and a natural language question. However, it is widely recognized that previous generic VQA methods often exhibit a tendency to memorize biases present in the training data rather than learning proper behaviors, such as grounding images before predicting answers. Therefore, these methods usually achieve high in-distribution but poor out-of-distribution performance. In recent years, various datasets and debiasing methods have been proposed to evaluate and enhance the VQA robustness, respectively. This paper provides the first comprehensive survey focused on this emerging fashion. Specifically, we first provide an overview of the development process of datasets from in-distribution and out-of-distribution perspectives. Then, we examine the evaluation metrics employed by these datasets. Thirdly, we propose a typology that presents the development process, similarities and differences, robustness comparison, and technical features of existing debiasing methods. Furthermore, we analyze and discuss the robustness of representative vision-and-language pre-training models on VQA. Finally, through a thorough review of the available literature and experimental analysis, we discuss the key areas for future research from various viewpoints.

IVNov 10, 2023
Holistic Evaluation of GPT-4V for Biomedical Imaging

Zhengliang Liu, Hanqi Jiang, Tianyang Zhong et al.

In this paper, we present a large-scale evaluation probing GPT-4V's capabilities and limitations for biomedical image analysis. GPT-4V represents a breakthrough in artificial general intelligence (AGI) for computer vision, with applications in the biomedical domain. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Tasks include modality recognition, anatomy localization, disease diagnosis, report generation, and lesion detection. The extensive experiments provide insights into GPT-4V's strengths and weaknesses. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization. GPT-4V excels at diagnostic report generation, indicating strong image captioning skills. While promising for biomedical imaging AI, GPT-4V requires further enhancement and validation before clinical deployment. We emphasize responsible development and testing for trustworthy integration of biomedical AGI. This rigorous evaluation of GPT-4V on diverse medical images advances understanding of multimodal large language models (LLMs) and guides future work toward impactful healthcare applications.

CLFeb 12
Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays

Yijie Zhong, Mengying Guo, Zewei Wang et al.

Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.

CLMay 6, 2023Code
Adaptive loose optimization for robust question answering

Jie Ma, Pinghui Wang, Zewei Wang et al.

Question answering methods are well-known for leveraging data bias, such as the language prior in visual question answering and the position bias in machine reading comprehension (extractive question answering). Current debiasing methods often come at the cost of significant in-distribution performance to achieve favorable out-of-distribution generalizability, while non-debiasing methods sacrifice a considerable amount of out-of-distribution performance in order to obtain high in-distribution performance. Therefore, it is challenging for them to deal with the complicated changing real-world situations. In this paper, we propose a simple yet effective novel loss function with adaptive loose optimization, which seeks to make the best of both worlds for question answering. Our main technical contribution is to reduce the loss adaptively according to the ratio between the previous and current optimization state on mini-batch training data. This loose optimization can be used to prevent non-debiasing methods from overlearning data bias while enabling debiasing methods to maintain slight bias learning. Experiments on the visual question answering datasets, including VQA v2, VQA-CP v1, VQA-CP v2, GQA-OOD, and the extractive question answering dataset SQuAD demonstrate that our approach enables QA methods to obtain state-of-the-art in- and out-of-distribution performance in most cases. The source code has been released publicly in \url{https://github.com/reml-group/ALO}.

CLOct 17, 2024
Are LLMs Models of Distributional Semantics? A Case Study on Quantifiers

Zhang Enyan, Zewei Wang, Michael A. Lepori et al.

Distributional semantics is the linguistic theory that a word's meaning can be derived from its distribution in natural language (i.e., its use). Language models are commonly viewed as an implementation of distributional semantics, as they are optimized to capture the statistical features of natural language. It is often argued that distributional semantics models should excel at capturing graded/vague meaning based on linguistic conventions, but struggle with truth-conditional reasoning and symbolic processing. We evaluate this claim with a case study on vague (e.g. "many") and exact (e.g. "more than half") quantifiers. Contrary to expectations, we find that, across a broad range of models of various types, LLMs align more closely with human judgements on exact quantifiers versus vague ones. These findings call for a re-evaluation of the assumptions underpinning what distributional semantics models are, as well as what they can capture.

CLApr 8
Curation and Extraction of Drug-Related Entities from Reddit Platform

Zewei Wang, Zihan Xu, Yishu Wei et al.

Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.