Comparison of Different Deep Neural Network Models in the Cultural Heritage Domain
This work addresses the need for effective computer vision models in cultural heritage documentation and visitor experiences, but it is incremental as it compares existing methods on new data.
The study compared deep neural network models, including VGG, ResNet, DenseNet, Visual Transformer, Swin Transformer, and PoolFormer, for transferring knowledge from ImageNet to cultural heritage tasks, finding that DenseNet achieved the best efficiency-computability ratio.
The integration of computer vision and deep learning is an essential part of documenting and preserving cultural heritage, as well as improving visitor experiences. In recent years, two deep learning paradigms have been established in the field of computer vision: convolutional neural networks and transformer architectures. The present study aims to make a comparative analysis of some representatives of these two techniques of their ability to transfer knowledge from generic dataset, such as ImageNet, to cultural heritage specific tasks. The results of testing examples of the architectures VGG, ResNet, DenseNet, Visual Transformer, Swin Transformer, and PoolFormer, showed that DenseNet is the best in terms of efficiency-computability ratio.