A Few-shot Learning Approach for Historical Ciphered Manuscript Recognition
This addresses the problem of digitizing and analyzing historical encrypted documents for historians and archivists, representing a domain-specific incremental advance.
The paper tackles the problem of automatically recognizing historical ciphered manuscripts, which is challenging due to varying cipher alphabets, lack of annotated data, and touching symbols. The result is a novel few-shot learning approach that, when fine-tuned with few labeled pages, surpasses existing methods for cipher recognition.
Encoded (or ciphered) manuscripts are a special type of historical documents that contain encrypted text. The automatic recognition of this kind of documents is challenging because: 1) the cipher alphabet changes from one document to another, 2) there is a lack of annotated corpus for training and 3) touching symbols make the symbol segmentation difficult and complex. To overcome these difficulties, we propose a novel method for handwritten ciphers recognition based on few-shot object detection. Our method first detects all symbols of a given alphabet in a line image, and then a decoding step maps the symbol similarity scores to the final sequence of transcribed symbols. By training on synthetic data, we show that the proposed architecture is able to recognize handwritten ciphers with unseen alphabets. In addition, if few labeled pages with the same alphabet are used for fine tuning, our method surpasses existing unsupervised and supervised HTR methods for ciphers recognition.