CLDBIRJul 11, 2023

Duncode Characters Shorter

arXiv:2307.05414v1h-index: 1Has Code
Originality Incremental advance
AI Analysis

This addresses the need for more compact text encoding in applications where storage or bandwidth is limited, representing an incremental improvement over existing encoders.

The paper tackles the problem of encoding the entire Unicode character set efficiently by introducing Duncode, a method that achieves higher space efficiency than UTF-8, though with reduced self-synchronizing capabilities.

This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal encoders like UTF-8 and UTF-16, which can encode the complete Unicode set with greater space requirements and are gaining widespread acceptance. Other encoders, including SCSU, BOCU-1, and binary encoders, however, lack self-synchronizing capabilities. Duncode is introduced as an innovative encoding method that aims to encode the entire Unicode character set with high space efficiency, akin to local encoders. It has the potential to compress multiple characters of a string into a Duncode unit using fewer bytes. Despite offering less self-synchronizing identification information, Duncode surpasses UTF8 in terms of space efficiency. The application is available at \url{https://github.com/laohur/duncode}. Additionally, we have developed a benchmark for evaluating character encoders across different languages. It encompasses 179 languages and can be accessed at \url{https://github.com/laohur/wiki2txt}.

Code Implementations2 repos
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes