Ryota Tanaka

h-index34

6papers

416citations

Novelty42%

AI Score40

Ranked #73,435 of 194,257 authors (top 38%)#14,099 in CL (top 46%)

6 Papers

22.8CLJan 12, 2023Code

SlideVQA: A Dataset for Document Visual Question Answering on Multiple Images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida et al.

Visual question answering on document images that contain textual, visual, and layout information, called document VQA, has received much attention recently. Although many datasets have been proposed for developing document VQA systems, most of the existing datasets focus on understanding the content relationships within a single image and not across multiple images. In this study, we propose a new multi-image document VQA dataset, SlideVQA, containing 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions about a slide deck. SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning, and also provides annotated arithmetic expressions of numerical answers for enhancing the ability of numerical reasoning. Moreover, we developed a new end-to-end document VQA model that treats evidence selection and question answering in a unified sequence-to-sequence format. Experiments on SlideVQA show that our model outperformed existing state-of-the-art QA models, but that it still has a large gap behind human performance. We believe that our dataset will facilitate research on document VQA.

10.4CLAug 7, 2024

Empirical Analysis of Large Vision-Language Models against Goal Hijacking via Visual Prompt Injection

Subaru Kimura, Ryota Tanaka, Shumpei Miyawaki et al.

We explore visual prompt injection (VPI) that maliciously exploits the ability of large vision-language models (LVLMs) to follow instructions drawn onto the input image. We propose a new VPI method, "goal hijacking via visual prompt injection" (GHVPI), that swaps the execution task of LVLMs from an original task to an alternative task designated by an attacker. The quantitative analysis indicates that GPT-4V is vulnerable to the GHVPI and demonstrates a notable attack success rate of 15.8%, which is an unignorable security risk. Our analysis also shows that successful GHVPI requires high character recognition capability and instruction-following ability in LVLMs.

4.9CLDec 29, 2025

Instruction-Following Evaluation of Large Vision-Language Models

Daiki Shiono, Shumpei Miyawaki, Ryota Tanaka et al.

Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning to visual instruction using commonly used training datasets, often fail to exhibit the instruction-following ability that was present in the LLM before integration, leading to results in which they do not follow task instructions as expected. This study quantitatively demonstrates that LVLMs' instruction-following ability declines after fine-tuning and analyzes its underlying causes. In particular, we constructed new training datasets highlighting whether the output format is specified. Then, we investigated how explicitly indicating the output format during fine-tuning affects LVLMs' instruction-following ability. Our quantitative evaluation confirmed that LVLMs' instruction-following ability declines after fine-tuning with commonly used datasets. Furthermore, we found that LVLMs trained with datasets, including instructions on output format, tend to follow instructions more accurately than models that do not. These findings suggest that including samples with instructions on output format during (visual) instruction tuning may help mitigate the decline in instruction-following abilities.

13.9CLJan 27, 2021Code

VisualMRC: Machine Reading Comprehension on Document Images

Ryota Tanaka, Kyosuke Nishida, Sen Yoshida

Recent studies on machine reading comprehension have focused on text-level understanding but have not yet reached the level of human understanding of the visual layout and content of real-world documents. In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering (VQA) datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. We also introduce a new model that extends existing sequence-to-sequence models, pre-trained with large-scale text corpora, to take into account the visual layout and content of documents. Experiments with VisualMRC show that this model outperformed the base sequence-to-sequence models and a state-of-the-art VQA model. However, its performance is still below that of humans on most automatic evaluation metrics. The dataset will facilitate research aimed at connecting vision and language understanding.

0.2CLMay 6, 2020

Fact-based Dialogue Generation with Convergent and Divergent Decoding

Ryota Tanaka, Akinobu Lee

Fact-based dialogue generation is a task of generating a human-like response based on both dialogue context and factual texts. Various methods were proposed to focus on generating informative words that contain facts effectively. However, previous works implicitly assume a topic to be kept on a dialogue and usually converse passively, therefore the systems have a difficulty to generate diverse responses that provide meaningful information proactively. This paper proposes an end-to-end fact-based dialogue system augmented with the ability of convergent and divergent thinking over both context and facts, which can converse about the current topic or introduce a new topic. Specifically, our model incorporates a novel convergent and divergent decoding that can generate informative and diverse responses considering not only given inputs (context and facts) but also inputs-related topics. Both automatic and human evaluation results on DSTC7 dataset show that our model significantly outperforms state-of-the-art baselines, indicating that our model can generate more appropriate, informative, and diverse responses.

0.6CLFeb 5, 2019

An Ensemble Dialogue System for Facts-Based Sentence Generation

Ryota Tanaka, Akihide Ozeki, Shugo Kato et al.

This study aims to generate responses based on real-world facts by conditioning context and external facts extracted from information websites. Our system is an ensemble system that combines three modules: generated-based module, retrieval-based module, and reranking module. Therefore, this system can return diverse and meaningful responses from various perspectives. The experiments and evaluations are conducted with the sentence generation task in Dialog System Technology Challenges 7 (DSTC7-Task2). As a result, the proposed system performed significantly better than sole modules, and worked fine at the DSTC7-Task2, specifically on the objective evaluation.