Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?

arXiv:2601.09433v1h-index: 11

Originality Synthesis-oriented

AI Analysis

This work addresses the need for automated analysis to aid researchers and collectors in understanding ancient coins, but it is incremental as it applies an existing method (ViT) to a new domain.

The paper tackled the problem of identifying semantic elements on ancient Roman coins by applying Vision Transformers (ViT) to multi-modal data, finding that ViT models outperformed newly trained CNNs in accuracy.

Automated analysis of ancient coins has the potential to help researchers extract more historical insights from large collections of coins and to help collectors understand what they are buying or selling. Recent research in this area has shown promise in focusing on identification of semantic elements as they are commonly depicted on ancient coins, by using convolutional neural networks (CNNs). This paper is the first to apply the recently proposed Vision Transformer (ViT) deep learning architecture to the task of identification of semantic elements on coins, using fully automatic learning from multi-modal data (images and unstructured text). This article summarises previous research in the area, discusses the training and implementation of ViT and CNN models for ancient coins analysis and provides an evaluation of their performance. The ViT models were found to outperform the newly trained CNN models in accuracy.

View on arXiv PDF

Similar