Hirotaka Kameko

CL
h-index12
7papers
901citations
Novelty35%
AI Score40

7 Papers

CLSep 13, 2022
Visual Recipe Flow: A Dataset for Learning Visual State Changes of Objects with Recipe Flows

Keisuke Shirai, Atsushi Hashimoto, Taichi Nishimura et al.

We present a new multimodal dataset called Visual Recipe Flow, which enables us to learn each cooking action result in a recipe text. The dataset consists of object state changes and the workflow of the recipe text. The state change is represented as an image pair, while the workflow is represented as a recipe flow graph (r-FG). The image pairs are grounded in the r-FG, which provides the cross-modal relation. With our dataset, one can try a range of applications, from multimodal commonsense reasoning and procedural text generation.

MMSep 21, 2022
Recipe Generation from Unsegmented Cooking Videos

Taichi Nishimura, Atsushi Hashimoto, Yoshitaka Ushiku et al.

This paper tackles recipe generation from unsegmented cooking videos, a task that requires agents to (1) extract key events in completing the dish and (2) generate sentences for the extracted events. Our task is similar to dense video captioning (DVC), which aims at detecting events thoroughly and generating sentences for them. However, unlike DVC, in recipe generation, recipe story awareness is crucial, and a model should extract an appropriate number of events in the correct order and generate accurate sentences based on them. We analyze the output of the DVC model and confirm that although (1) several events are adoptable as a recipe story, (2) the generated sentences for such events are not grounded in the visual content. Based on this, we set our goal to obtain correct recipes by selecting oracle events from the output events and re-generating sentences for them. To achieve this, we propose a transformer-based multimodal recurrent approach of training an event selector and sentence generator for selecting oracle events from the DVC's events and generating sentences for them. In addition, we extend the model by including ingredients to generate more accurate recipes. The experimental results show that the proposed method outperforms state-of-the-art DVC models. We also confirm that, by modeling the recipe in a story-aware manner, the proposed model outputs the appropriate number of events in the correct order.

CVFeb 22
Restoration-Guided Kuzushiji Character Recognition Framework under Seal Interference

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko et al.

Kuzushiji was one of the most popular writing styles in pre-modern Japan and was widely used in both personal letters and official documents. However, due to its highly cursive forms and extensive glyph variations, most modern Japanese readers cannot directly interpret Kuzushiji characters. Therefore, recent research has focused on developing automated Kuzushiji character recognition methods, which have achieved satisfactory performance on relatively clean Kuzushiji document images. However, existing methods struggle to maintain recognition accuracy under seal interference (e.g., when seals overlap characters), despite the frequent occurrence of seals in pre-modern Japanese documents. To address this challenge, we propose a three-stage restoration-guided Kuzushiji character recognition (RG-KCR) framework specifically designed to mitigate seal interference. We construct datasets for evaluating Kuzushiji character detection (Stage 1) and classification (Stage 3). Experimental results show that the YOLOv12-medium model achieves a precision of 98.0% and a recall of 93.3% on the constructed test set. We quantitatively evaluate the restoration performance of Stage 2 using PSNR and SSIM. In addition, we conduct an ablation study to demonstrate that Stage 2 improves the Top-1 accuracy of Metom, a Vision Transformer (ViT)-based Kuzushiji classifier employed in Stage 3, from 93.45% to 95.33%. The implementation code of this work is available at https://ruiyangju.github.io/RG-KCR.

CVNov 12, 2025
DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko et al.

Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using multiple versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, and Generative Adversarial Network (GAN)-based methods. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

CVApr 4, 2024
BioVL-QR: Egocentric Biochemical Vision-and-Language Dataset Using Micro QR Codes

Tomohiro Nishimoto, Taichi Nishimura, Koki Yamamoto et al.

This paper introduces BioVL-QR, a biochemical vision-and-language dataset comprising 23 egocentric experiment videos, corresponding protocols, and vision-and-language alignments. A major challenge in understanding biochemical videos is detecting equipment, reagents, and containers because of the cluttered environment and indistinguishable objects. Previous studies assumed manual object annotation, which is costly and time-consuming. To address the issue, we focus on Micro QR Codes. However, detecting objects using only Micro QR Codes is still difficult due to blur and occlusion caused by object manipulation. To overcome this, we propose an object labeling method combining a Micro QR Code detector with an off-the-shelf hand object detector. As an application of the method and BioVL-QR, we tackled the task of localizing the procedural steps in an instructional video. The experimental results show that using Micro QR Codes and our method improves biochemical video understanding. Data and code are available through https://nishi10mo.github.io/BioVL-QR/

CLMar 25, 2024
Automatic Construction of a Large-Scale Corpus for Geoparsing Using Wikipedia Hyperlinks

Keyaki Ohno, Hirotaka Kameko, Keisuke Shirai et al.

Geoparsing is the task of estimating the latitude and longitude (coordinates) of location expressions in texts. Geoparsing must deal with the ambiguity of the expressions that indicate multiple locations with the same notation. For evaluating geoparsing systems, several corpora have been proposed in previous work. However, these corpora are small-scale and suffer from the coverage of location expressions on general domains. In this paper, we propose Wikipedia Hyperlink-based Location Linking (WHLL), a novel method to construct a large-scale corpus for geoparsing from Wikipedia articles. WHLL leverages hyperlinks in Wikipedia to annotate multiple location expressions with coordinates. With this method, we constructed the WHLL corpus, a new large-scale corpus for geoparsing. The WHLL corpus consists of 1.3M articles, each containing about 7.8 unique location expressions. 45.6% of location expressions are ambiguous and refer to more than one location with the same notation. In each article, location expressions of the article title and those hyperlinks to other articles are assigned with coordinates. By utilizing hyperlinks, we can accurately assign location expressions with coordinates even with ambiguous location expressions in the texts. Experimental results show that there remains room for improvement by disambiguating location expressions.

CLMay 31, 2023
Towards Flow Graph Prediction of Open-Domain Procedural Texts

Keisuke Shirai, Hirotaka Kameko, Shinsuke Mori

Machine comprehension of procedural texts is essential for reasoning about the steps and automating the procedures. However, this requires identifying entities within a text and resolving the relationships between the entities. Previous work focused on the cooking domain and proposed a framework to convert a recipe text into a flow graph (FG) representation. In this work, we propose a framework based on the recipe FG for flow graph prediction of open-domain procedural texts. To investigate flow graph prediction performance in non-cooking domains, we introduce the wikiHow-FG corpus from articles on wikiHow, a website of how-to instruction articles. In experiments, we consider using the existing recipe corpus and performing domain adaptation from the cooking to the target domain. Experimental results show that the domain adaptation models achieve higher performance than those trained only on the cooking or target domain data.