Toshiyuki Amagasa

CL
4papers
44citations
Novelty45%
AI Score24

4 Papers

CLSep 19, 2023
Semi-automatic staging area for high-quality structured data extraction from scientific literature

Luca Foppiano, Tomoya Mato, Kensei Terashima et al.

We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. Here we report our curation interface (SuperCon2 Interface) and a workflow managing the state transitions of each examined record, to validate the dataset of superconductors from PDF documents collected using Grobid-superconductors in a previous work. This curation workflow allows both automatic and manual operations, the former contains ``anomaly detection'' that scans new data identifying outliers, and a ``training data collector'' mechanism that collects training data examples based on manual corrections. Such training data collection policy is effective in improving the machine-learning models with a reduced number of examples. For manual operations, the interface (SuperCon2 interface) is developed to increase efficiency during manual correction by providing a smart interface and an enhanced PDF document viewer. We show that our interface significantly improves the curation quality by boosting precision and recall as compared with the traditional ``manual correction''. Our semi-automatic approach would provide a solution for achieving a reliable database with text-data mining of scientific documents.

CLJan 19, 2024
Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Luca Foppiano, Guillaume Lambard, Toshiyuki Amagasa et al.

This study is dedicated to assessing the capabilities of large language models (LLMs) such as GPT-3.5-Turbo, GPT-4, and GPT-4-Turbo in extracting structured information from scientific documents in materials science. To this end, we primarily focus on two critical tasks of information extraction: (i) a named entity recognition (NER) of studied materials and physical properties and (ii) a relation extraction (RE) between these entities. Due to the evident lack of datasets within Materials Informatics (MI), we evaluated using SuperMat, based on superconductor research, and MeasEval, a generic measurement evaluation corpus. The performance of LLMs in executing these tasks is benchmarked against traditional models based on the BERT architecture and rule-based approaches (baseline). We introduce a novel methodology for the comparative analysis of intricate material expressions, emphasising the standardisation of chemical formulas to tackle the complexities inherent in materials science information assessment. For NER, LLMs fail to outperform the baseline with zero-shot prompting and exhibit only limited improvement with few-shot prompting. However, a GPT-3.5-Turbo fine-tuned with the appropriate strategy for RE outperforms all models, including the baseline. Without any fine-tuning, GPT-4 and GPT-4-Turbo display remarkable reasoning and relationship extraction capabilities after being provided with merely a couple of examples, surpassing the baseline. Overall, the results suggest that although LLMs demonstrate relevant reasoning skills in connecting concepts, specialised models are currently a better choice for tasks requiring extracting complex domain-specific entities like materials. These insights provide initial guidance applicable to other materials science sub-domains in future work.

CLMay 30, 2023
AdapterEM: Pre-trained Language Model Adaptation for Generalized Entity Matching using Adapter-tuning

John Bosco Mugeni, Steven Lynden, Toshiyuki Amagasa et al.

Entity Matching (EM) involves identifying different data representations referring to the same entity from multiple data sources and is typically formulated as a binary classification problem. It is a challenging problem in data integration due to the heterogeneity of data representations. State-of-the-art solutions have adopted NLP techniques based on pre-trained language models (PrLMs) via the fine-tuning paradigm, however, sequential fine-tuning of overparameterized PrLMs can lead to catastrophic forgetting, especially in low-resource scenarios. In this study, we propose a parameter-efficient paradigm for fine-tuning PrLMs based on adapters, small neural networks encapsulated between layers of a PrLM, by optimizing only the adapter and classifier weights while the PrLMs parameters are frozen. Adapter-based methods have been successfully applied to multilingual speech problems achieving promising results, however, the effectiveness of these methods when applied to EM is not yet well understood, particularly for generalized EM with heterogeneous data. Furthermore, we explore using (i) pre-trained adapters and (ii) invertible adapters to capture token-level language representations and demonstrate their benefits for transfer learning on the generalized EM benchmark. Our results show that our solution achieves comparable or superior performance to full-scale PrLM fine-tuning and prompt-tuning baselines while utilizing a significantly smaller computational footprint $\approx 13\%$ of the PrLM parameters.

SEAug 25, 2021
Recommending Extract Method Refactoring Based on Confidence of Predicted Method Name

Jinto Yamanaka, Yasuhiro Hayase, Toshiyuki Amagasa

Refactoring is an important activity that is frequently performed in software development, and among them, Extract Method is known to be one of the most frequently performed refactorings. The existing techniques for recommending Extract Method refactoring calculate metrics from the source method and the code fragments to be extracted to order the recommendation candidates. This paper proposes a new technique for accurately recommending Extract Method refactoring by considering whether code fragments are semantically coherent chunks that can be given clear method names, in addition to the metrics used in previous studies. As a criterion for the semantic coherency, the proposed technique employs the probability (i.e. confidence) of the predicted method names for the code fragments output by code2seq, which is a state-of-the-art method name prediction technique. The evaluation experiment confirmed that the proposed technique has higher correctness of recommendation than the existing techniques.