CLApr 25

Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing

Arthur Amalvy, Vincent Labatut, Xavier Bost, Hen-Hsen Huang

arXiv:2604.2341281.4

Predicted impact top 65% in CL · last 90 daysOriginality Incremental advance

AI Analysis

This addresses the problem of sharing copyrighted corpora for NLP researchers, enabling broader access to diverse data while respecting copyright law.

The paper proposes a method for sharing annotations of copyrighted literary texts by distributing non-reversible hashed versions of the source material, enabling lawful exchange while protecting copyright. The method achieves 98.7-99.79% token alignment accuracy across different editions of novels.

While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully and publicly share the annotations of copyrighted literary texts. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator's version. We publicly release novelshare, a Python implementation of our method.

View on arXiv PDF

Similar