Hiroyuki Shindo

CL
h-index14
21papers
7,973citations
Novelty42%
AI Score35

21 Papers

CLOct 2, 2020Code
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Ikuya Yamada, Akari Asai, Hiroyuki Shindo et al.

Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.

CLSep 3, 2019Code
Neural Attentive Bag-of-Entities Model for Text Classification

Ikuya Yamada, Hiroyuki Shindo

This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for capturing semantics in texts. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model is available online at https://github.com/wikipedia2vec/wikipedia2vec.

CLSep 1, 2019Code
Global Entity Disambiguation with BERT

Ikuya Yamada, Koki Washio, Hiroyuki Shindo et al.

We propose a global entity disambiguation (ED) model based on BERT. To capture global contextual information for ED, our model treats not only words but also entities as input tokens, and solves the task by sequentially resolving mentions to their referent entities and using resolved entities as inputs at each step. We train the model using a large entity-annotated corpus obtained from Wikipedia. We achieve new state-of-the-art results on five standard ED datasets: AIDA-CoNLL, MSNBC, AQUAINT, ACE2004, and WNED-WIKI. The source code and model checkpoint are available at https://github.com/studio-ousia/luke.

CLDec 15, 2018Code
Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia

Ikuya Yamada, Akari Asai, Jin Sakuma et al.

The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io.

CLMar 26, 2024
Introducing Syllable Tokenization for Low-resource Languages: A Case Study with Swahili

Jesse Atuhurra, Hiroyuki Shindo, Hidetaka Kamigaito et al.

Many attempts have been made in multilingual NLP to ensure that pre-trained language models, such as mBERT or GPT2 get better and become applicable to low-resource languages. To achieve multilingualism for pre-trained language models (PLMs), we need techniques to create word embeddings that capture the linguistic characteristics of any language. Tokenization is one such technique because it allows for the words to be split based on characters or subwords, creating word embeddings that best represent the structure of the language. Creating such word embeddings is essential to applying PLMs to other languages where the model was not trained, enabling multilingual NLP. However, most PLMs use generic tokenization methods like BPE, wordpiece, or unigram which may not suit specific languages. We hypothesize that tokenization based on syllables within the input text, which we call syllable tokenization, should facilitate the development of syllable-aware language models. The syllable-aware language models make it possible to apply PLMs to languages that are rich in syllables, for instance, Swahili. Previous works introduced subword tokenization. Our work extends such efforts. Notably, we propose a syllable tokenizer and adopt an experiment-centric approach to validate the proposed tokenizer based on the Swahili language. We conducted text-generation experiments with GPT2 to evaluate the effectiveness of the syllable tokenizer. Our results show that the proposed syllable tokenizer generates syllable embeddings that effectively represent the Swahili language.

CLMar 13, 2024
Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito et al.

Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

CLOct 22, 2024
Graph-Structured Trajectory Extraction from Travelogues

Aitaro Yamamoto, Hiroyuki Otomo, Hiroki Ouchi et al.

Previous studies on sequence-based extraction of human movement trajectories have an issue of inadequate trajectory representation. Specifically, a pair of locations may not be lined up in a sequence especially when one location includes the other geographically. In this study, we propose a graph representation that retains information on the geographic hierarchy as well as the temporal order of visited locations, and have constructed a benchmark dataset for graph-structured trajectory extraction. The experiments with our baselines have demonstrated that it is possible to accurately predict visited locations and the order among them, but it remains a challenge to predict the hierarchical relations.

CLNov 28, 2024
NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER

Jesse Atuhurra, Hidetaka Kamigaito, Hiroki Ouchi et al.

Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER's capability to expedite dataset creation.

CLMay 23, 2023
Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

Shohei Higashiyama, Hiroki Ouchi, Hiroki Teranishi et al.

Geoparsing is a fundamental technique for analyzing geo-entity information in text. We focus on document-level geoparsing, which considers geographic relatedness among geo-entity mentions, and presents a Japanese travelogue dataset designed for evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

CLMay 19, 2023
NAIST Academic Travelogue Dataset

Hiroki Ouchi, Hiroyuki Shindo, Shoko Wakamiya et al.

We have constructed NAIST Academic Travelogue Dataset (ATD) and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.

CLJan 21, 2020
Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Itsumi Saito, Kyosuke Nishida, Kosuke Nishida et al.

We propose a new length-controllable abstractive summarization model. Recent state-of-the-art abstractive summarization models based on encoder-decoder models generate only one summary per source text. However, controllable summarization, especially of the length, is an important aspect for practical applications. Previous studies on length-controllable abstractive summarization incorporate length embeddings in the decoder module for controlling the summary length. Although the length embeddings can control where to stop decoding, they do not decide which information should be included in the summary within the length constraint. Unlike the previous models, our length-controllable abstractive summarization model incorporates a word-level extractive module in the encoder-decoder model instead of length embeddings. Our model generates a summary in two steps. First, our word-level extractor extracts a sequence of important words (we call it the "prototype text") from the source text according to the word-level importance scores and the length constraint. Second, the prototype text is used as additional input to the encoder-decoder model, which generates a summary by jointly encoding and copying words from both the prototype text and source text. Since the prototype text is a guide to both the content and length of the summary, our model can generate an informative and length-controlled summary. Experiments with the CNN/Daily Mail dataset and the NEWSROOM dataset show that our model outperformed previous models in length-controlled settings.

LGAug 31, 2019
Gated Graph Recursive Neural Networks for Molecular Property Prediction

Hiroyuki Shindo, Yuji Matsumoto

Molecule property prediction is a fundamental problem for computer-aided drug discovery and materials science. Quantum-chemical simulations such as density functional theory (DFT) have been widely used for calculating the molecule properties, however, because of the heavy computational cost, it is difficult to search a huge number of potential chemical compounds. Machine learning methods for molecular modeling are attractive alternatives, however, the development of expressive, accurate, and scalable graph neural networks for learning molecular representations is still challenging. In this work, we propose a simple and powerful graph neural networks for molecular property prediction. We model a molecular as a directed complete graph in which each atom has a spatial position, and introduce a recursive neural network with simple gating function. We also feed input embeddings for every layers as skip connections to accelerate the training. Experimental results show that our model achieves the state-of-the-art performance on the standard benchmark dataset for molecular property prediction.

CLAug 15, 2019
Improving Multi-Word Entity Recognition for Biomedical Texts

Hamada A. Nayel, H. L. Shashirekha, Hiroyuki Shindo et al.

Biomedical Named Entity Recognition (BioNER) is a crucial step for analyzing Biomedical texts, which aims at extracting biomedical named entities from a given text. Different supervised machine learning algorithms have been applied for BioNER by various researchers. The main requirement of these approaches is an annotated dataset used for learning the parameters of machine learning algorithms. Segment Representation (SR) models comprise of different tag sets used for representing the annotated data, such as IOB2, IOE2 and IOBES. In this paper, we propose an extension of IOBES model to improve the performance of BioNER. The proposed SR model, FROBES, improves the representation of multi-word entities. We used Bidirectional Long Short-Term Memory (BiLSTM) network; an instance of Recurrent Neural Networks (RNN), to design a baseline system for BioNER and evaluated the new SR model on two datasets, i2b2/VA 2010 challenge dataset and JNLPBA 2004 shared task dataset. The proposed SR model outperforms other models for multi-word entities with length greater than two. Further, the outputs of different SR models have been combined using majority voting ensemble method which outperforms the baseline models performance.

LGNov 10, 2018
Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text

Ronen Tamari, Hiroyuki Shindo, Dafna Shahaf et al.

Understanding procedural text requires tracking entities, actions and effects as the narrative unfolds. We focus on the challenging real-world problem of action-graph extraction from material science papers, where language is highly specialized and data annotation is expensive and scarce. We propose a novel approach, Text2Quest, where procedural text is interpreted as instructions for an interactive game. A learning agent completes the game by executing the procedure correctly in a text-based simulated lab environment. The framework can complement existing approaches and enables richer forms of learning compared to static texts. We discuss potential limitations and advantages of the approach, and release a prototype proof-of-concept, hoping to encourage research in this direction.

CLOct 4, 2018
A Span Selection Model for Semantic Role Labeling

Hiroki Ouchi, Hiroyuki Shindo, Yuji Matsumoto

We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.

CLJun 8, 2018
Representation Learning of Entities and Documents from Knowledge Base Descriptions

Ikuya Yamada, Hiroyuki Shindo, Yoshiyasu Takefuji

In this paper, we describe TextEnt, a neural network model that learns distributed representations of entities and documents directly from a knowledge base (KB). Given a document in a KB consisting of words and entity annotations, we train our model to predict the entity that the document describes and map the document and its target entity close to each other in a continuous vector space. Our model is trained using a large number of documents extracted from Wikipedia. The performance of the proposed model is evaluated using two tasks, namely fine-grained entity typing and multiclass text classification. The results demonstrate that our model achieves state-of-the-art performance on both tasks. The code and the trained representations are made available online for further academic research.

LGMay 8, 2018
Interpretable Adversarial Perturbation in Input Embedding Space for Text

Motoki Sato, Jun Suzuki, Hiroyuki Shindo et al.

Following great success in the image processing field, the idea of adversarial training has been applied to tasks in the natural language processing (NLP) field. One promising approach directly applies adversarial training developed in the image processing field to the input word embedding space instead of the discrete input space of texts. However, this approach abandons such interpretability as generating adversarial texts to significantly improve the performance of NLP tasks. This paper restores interpretability to such methods by restricting the directions of perturbations toward the existing words in the input embedding space. As a result, we can straightforwardly reconstruct each input with perturbations to an actual text by considering the perturbations to be the replacement of words in the sentence while maintaining or even improving the task performance.

CLMar 23, 2018
Studio Ousia's Quiz Bowl Question Answering System

Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo et al.

In this chapter, we describe our question answering system, which was the winning system at the Human-Computer Question Answering (HCQA) Competition at the Thirty-first Annual Conference on Neural Information Processing Systems (NIPS). The competition requires participants to address a factoid question answering task referred to as quiz bowl. To address this task, we use two novel neural network models and combine these models with conventional information retrieval models using a supervised machine learning model. Our system achieved the best performance among the systems submitted in the competition and won a match against six top human quiz experts by a wide margin.

CLMay 6, 2017
Learning Distributed Representations of Texts and Entities from Knowledge Base

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda et al.

We describe a neural network model that jointly learns distributed representations of texts and knowledge base (KB) entities. Given a text in the KB, we train our proposed model to predict entities that are relevant to the text. Our model is designed to be generic with the ability to address various NLP tasks with ease. We train the model using a large corpus of texts and their entity annotations extracted from Wikipedia. We evaluated the model on three important NLP tasks (i.e., sentence textual similarity, entity linking, and factoid question answering) involving both unsupervised and supervised settings. As a result, we achieved state-of-the-art results on all three of these tasks. Our code and trained models are publicly available for further academic research.

CLMar 15, 2017
Ensemble of Neural Classifiers for Scoring Knowledge Base Triples

Ikuya Yamada, Motoki Sato, Hiroyuki Shindo

This paper describes our approach for the triple scoring task at the WSDM Cup 2017. The task required participants to assign a relevance score for each pair of entities and their types in a knowledge base in order to enhance the ranking results in entity retrieval tasks. We propose an approach wherein the outputs of multiple neural network classifiers are combined using a supervised machine learning model. The experimental results showed that our proposed method achieved the best performance in one out of three measures (i.e., Kendall's tau), and performed competitively in the other two measures (i.e., accuracy and average score difference).

CLJan 6, 2016
Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda et al.

Named Entity Disambiguation (NED) refers to the task of resolving multiple named entity mentions in a document to their correct references in a knowledge base (KB) (e.g., Wikipedia). In this paper, we propose a novel embedding method specifically designed for NED. The proposed method jointly maps words and entities into the same continuous vector space. We extend the skip-gram model by using two models. The KB graph model learns the relatedness of entities using the link structure of the KB, whereas the anchor context model aims to align vectors such that similar words and entities occur close to one another in the vector space by leveraging KB anchors and their context words. By combining contexts based on the proposed embedding with standard NED features, we achieved state-of-the-art accuracy of 93.1% on the standard CoNLL dataset and 85.2% on the TAC 2010 dataset.