FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
This addresses hallucinations in VQA models for real-world deployment, though it appears incremental as it builds on existing methods like BLIP-VQA and RAG.
The paper tackled the problem of hallucinations in Visual Question Answering by introducing FilterRAG, a retrieval-augmented framework that integrates external knowledge sources, achieving 36.5% accuracy on the OK-VQA dataset to reduce incorrect answers.
Visual Question Answering requires models to generate accurate answers by integrating visual and textual understanding. However, VQA models still struggle with hallucinations, producing convincing but incorrect answers, particularly in knowledge-driven and Out-of-Distribution scenarios. We introduce FilterRAG, a retrieval-augmented framework that combines BLIP-VQA with Retrieval-Augmented Generation to ground answers in external knowledge sources like Wikipedia and DBpedia. FilterRAG achieves 36.5% accuracy on the OK-VQA dataset, demonstrating its effectiveness in reducing hallucinations and improving robustness in both in-domain and Out-of-Distribution settings. These findings highlight the potential of FilterRAG to improve Visual Question Answering systems for real-world deployment.