The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms
This work addresses the problem of limited annotated datasets for rare diseases in the medical NLP domain, enabling improved diagnosis and treatment for patients, though it is incremental as it focuses on data creation rather than novel methods.
The researchers tackled the scarcity of annotated data for rare diseases by creating the RareDis corpus, which includes over 5,000 rare diseases and nearly 6,000 clinical manifestations, achieving high inter-annotator agreement with F1-measures of 83.5% for entities and 81.3% for relations.
The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.