Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description
This work addresses data deficiency in zero-shot learning for medical image description, though it appears incremental as it builds on existing methods with specific improvements.
The paper tackles overfitting to known classes in zero-shot learning by using category semantic similarity to incorporate unknown classes into the vector space, and addresses the lack of semantic information in feature extraction by employing ELMo-MCT with self-attention for multimodal visual features, achieving the best harmonic mean accuracy on three benchmark datasets.
Zero sample learning is an effective method for data deficiency. The existing embedded zero sample learning methods only use the known classes to construct the embedded space, so there is an overfitting of the known classes in the testing process. This project uses category semantic similarity measures to classify multiple tags. This enables it to incorporate unknown classes that have the same meaning as currently known classes into the vector space when it is built. At the same time, most of the existing zero sample learning algorithms directly use the depth features of medical images as input, and the feature extraction process does not consider semantic information. This project intends to take ELMo-MCT as the main task and obtain multiple visual features related to the original image through self-attention mechanism. In this paper, a large number of experiments are carried out on three zero-shot learning reference datasets, and the best harmonic average accuracy is obtained compared with the most advanced algorithms.