Shuntaro Yada

CL
h-index23
7papers
911citations
Novelty23%
AI Score24

7 Papers

CLJul 4, 2024Code
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs

LLM-jp, Akiko Aizawa, Eiji Aramaki et al.

This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs). LLM-jp aims to develop open-source and strong Japanese LLMs, and as of this writing, more than 1,500 participants from academia and industry are working together for this purpose. This paper presents the background of the establishment of LLM-jp, summaries of its activities, and technical reports on the LLMs developed by LLM-jp. For the latest activities, visit https://llm-jp.nii.ac.jp/en/.

CLMar 27, 2024
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages

Lisa Raithel, Hui-Syuan Yeh, Shuntaro Yada et al.

User-generated data sources have gained significance in uncovering Adverse Drug Reactions (ADRs), with an increasing number of discussions occurring in the digital world. However, the existing clinical corpora predominantly revolve around scientific articles in English. This work presents a multilingual corpus of texts concerning ADRs gathered from diverse sources, including patient fora, social media, and clinical reports in German, French, and Japanese. Our corpus contains annotations covering 12 entity types, four attribute types, and 13 relation types. It contributes to the development of real-world multilingual language models for healthcare. We provide statistics to highlight certain challenges associated with the corpus and conduct preliminary experiments resulting in strong baselines for extracting entities and relations between these entities, both within and across languages.

CLNov 8, 2021
JaMIE: A Pipeline Japanese Medical Information Extraction System

Fei Cheng, Shuntaro Yada, Ribeka Tanaka et al.

We present an open-access natural language processing toolkit for Japanese medical information extraction. We first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the effective annotation strategy for targeting report types, and the superiority of the latest contextual embedding models.

CLApr 21, 2021
End-to-end Biomedical Entity Linking with Span-based Dictionary Matching

Shogo Ujiie, Hayate Iso, Shuntaro Yada et al.

Disease name recognition and normalization, which is generally called biomedical entity linking, is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset cannot be accurately predicted. This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features to address this problem. Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models, in an end-to-end fashion. Experiments using two major datasets demonstrate that our model achieved competitive results with strong baselines, especially for unseen concepts during training.

CLDec 31, 2020
KART: Parameterization of Privacy Leakage Scenarios from Pre-trained Language Models

Yuta Nakamura, Shouhei Hanaoka, Yukihiro Nomura et al.

For the safe sharing pre-trained language models, no guidelines exist at present owing to the difficulty in estimating the upper bound of the risk of privacy leakage. One problem is that previous studies have assessed the risk for different real-world privacy leakage scenarios and attack methods, which reduces the portability of the findings. To tackle this problem, we represent complex real-world privacy leakage scenarios under a universal parameterization, \textit{Knowledge, Anonymization, Resource, and Target} (KART). KART parameterization has two merits: (i) it clarifies the definition of privacy leakage in each experiment and (ii) it improves the comparability of the findings of risk assessments. We show that previous studies can be simply reviewed by parameterizing the scenarios with KART. We also demonstrate privacy risk assessments in different scenarios under the same attack method, which suggests that KART helps approximate the upper bound of risk under a specific attack or scenario. We believe that KART helps integrate past and future findings on privacy risk and will contribute to a standard for sharing language models.

IRApr 21, 2020
Syndromic surveillance using search query logs and user location information from smartphones against COVID-19 clusters in Japan

Shohei Hisada, Taichi Murayama, Kota Tsubouchi et al.

[Background] Two clusters of coronavirus disease 2019 (COVID-19) were confirmed in Hokkaido, Japan in February 2020. To capture the clusters, this study employs Web search query logs and user location information from smartphones. [Material and Methods] First, we anonymously identified smartphone users who used a Web search engine (Yahoo! JAPAN Search) for the COVID-19 or its symptoms via its companion application for smartphones (Yahoo Japan App). We regard these searchers as Web searchers who are suspicious of their own COVID-19 infection (WSSCI). Second, we extracted the location of the WSSCI via the smartphone application. The spatio-temporal distribution of the number of WSSCI are compared with the actual location of the known two clusters. [Result and Discussion] Before the early stage of the cluster development, we could confirm several WSSCI, which demonstrated the basic feasibility of our WSSCI-based approach. However, it is accurate only in the early stage, and it was biased after the public announcement of the cluster development. For the case where the other cluster-related resources, such as fine-grained population statistics, are not available, the proposed metric would be helpful to catch the hint of emerging clusters.

SIApr 17, 2020
NAIST COVID: Multilingual COVID-19 Twitter and Weibo Dataset

Zhiwei Gao, Shuntaro Yada, Shoko Wakamiya et al.

Since the outbreak of coronavirus disease 2019 (COVID-19) in the late 2019, it has affected over 200 countries and billions of people worldwide. This has affected the social life of people owing to enforcements, such as "social distancing" and "stay at home." This has resulted in an increasing interaction through social media. Given that social media can bring us valuable information about COVID-19 at a global scale, it is important to share the data and encourage social media studies against COVID-19 or other infectious diseases. Therefore, we have released a multilingual dataset of social media posts related to COVID-19, consisting of microblogs in English and Japanese from Twitter and those in Chinese from Weibo. The data cover microblogs from January 20, 2020, to March 24, 2020. This paper also provides a quantitative as well as qualitative analysis of these datasets by creating daily word clouds as an example of text-mining analysis. The dataset is now available on Github. This dataset can be analyzed in a multitude of ways and is expected to help in efficient communication of precautions related to COVID-19.