CVMay 31, 2021

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

arXiv:2105.15089v326 citations
Originality Incremental advance
AI Analysis

This work addresses the challenge of creating efficient and flexible models for multi-modal tasks in AI, though it appears incremental as it builds upon existing transformer structures.

The authors tackled the problem of designing a unified sequence model for multi-modal tasks by drawing an analogy between Vision Transformers and Evolutionary Algorithms, proposing an EAT model that improves efficiency and flexibility. Their approach achieved state-of-the-art results on ImageNet classification with smaller parameters and greater throughput, and improved rank-1 by +3.7 points on the CSS dataset for Text-Based Image Retrieval.

Inspired by biological evolution, we explain the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derive that both of them have consistent mathematical representation. Analogous to the dynamic local population in EA, we improve the existing transformer structure and propose a more efficient EAT model, and design task-related heads to deal with different tasks more flexibly. Moreover, we introduce the spatial-filling curve into the current vision transformer to sequence image data into a uniform sequential format. Thus we can design a unified EAT framework to address multi-modal tasks, separating the network architecture from the data format adaptation. Our approach achieves state-of-the-art results on the ImageNet classification task compared with recent vision transformer works while having smaller parameters and greater throughput. We further conduct multi-modal tasks to demonstrate the superiority of the unified EAT, e.g., Text-Based Image Retrieval, and our approach improves the rank-1 by +3.7 points over the baseline on the CSS dataset.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes