AMES: Approximate Multi-modal Enterprise Search via Late Interaction Retrieval
It addresses the challenge of efficient multimodal search for enterprise applications, though it is incremental as it builds on existing late interaction methods.
The paper tackles the problem of multimodal enterprise search by introducing AMES, a unified late interaction retrieval architecture that enables cross-modal retrieval without modality-specific logic, achieving competitive ranking performance on the ViDoRe V3 benchmark within a scalable, production-ready Solr-based system.
We present AMES (Approximate Multimodal Enterprise Search), a unified multimodal late interaction retrieval architecture which is backend agnostic. AMES demonstrates that fine-grained multimodal late interaction retrieval can be deployed within a production grade enterprise search engine without architectural redesign. Text tokens, image patches, and video frames are embedded into a shared representation space using multi-vector encoders, enabling cross-modal retrieval without modality specific retrieval logic. AMES employs a two-stage pipeline: parallel token level ANN search with per document Top-M MaxSim approximation, followed by accelerator optimized Exact MaxSim re-ranking. Experiments on the ViDoRe V3 benchmark show that AMES achieves competitive ranking performance within a scalable, production ready Solr based system.