Validating UTF-8 In Less Than One Instruction Per Byte
It addresses the bottleneck of UTF-8 validation for software that ingests large amounts of text, offering a significant performance improvement.
The paper presents a SIMD-based algorithm for validating UTF-8 that is over 10x faster than existing routines used in many libraries and languages.
The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.