CL AINov 18, 2024

ByteScience: Bridging Unstructured Scientific Literature and Structured Data with Auto Fine-tuned Large Language Model in Token Granularity

Tong Xie, Hanzhi Zhang, Shaozhou Wang, Yuwei Wan, Imran Razzak, Chunyu Kit, Wenjie Zhang, Bram Hoex

arXiv:2411.12000v21.93 citationsh-index: 39Has Code2024 IEEE International Conference on Data Mining Workshops (ICDMW)

Originality Incremental advance

AI Analysis

This tool streamlines data extraction for scientific research, though it appears incremental as it builds on existing fine-tuned models.

The authors tackled the challenge of extracting structured data from scientific literature by introducing ByteScience, an auto fine-tuned LLM platform that achieves high accuracy with minimal annotated articles.

Natural Language Processing (NLP) is widely used to supply summarization ability from long context to structured information. However, extracting structured knowledge from scientific text by NLP models remains a challenge because of its domain-specific nature to complex data preprocessing and the granularity of multi-layered device-level information. To address this, we introduce ByteScience, a non-profit cloud-based auto fine-tuned Large Language Model (LLM) platform, which is designed to extract structured scientific data and synthesize new scientific knowledge from vast scientific corpora. The platform capitalizes on DARWIN, an open-source, fine-tuned LLM dedicated to natural science. The platform was built on Amazon Web Services (AWS) and provides an automated, user-friendly workflow for custom model development and data extraction. The platform achieves remarkable accuracy with only a small amount of well-annotated articles. This innovative tool streamlines the transition from the science literature to structured knowledge and data and benefits the advancements in natural informatics.

View on arXiv PDF

Similar