Feng Zhao

h-index21

3papers

891citations

Novelty48%

AI Score47

Ranked #31,257 of 194,257 authors (top 16%)#11,192 in CV (top 19%)

3 Papers

34.6CVFeb 25, 2025Code

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Qiuchen Wang, Ruixue Ding, Zehui Chen et al.

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.

56.1CVMar 29, 2024Code

Are We on the Right Way for Evaluating Large Vision-Language Models?

Lin Chen, Jinsong Li, Xiaoyi Dong et al. · pku

Large vision-language models (LVLMs) have recently achieved rapid progress, sparking numerous studies to evaluate their multi-modal capabilities. However, we dig into current evaluation works and identify two primary issues: 1) Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs. This phenomenon is prevalent across current benchmarks. For instance, GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, and outperforms the random choice baseline across six benchmarks over 24% on average. 2) Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data. For example, Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%. Both problems lead to misjudgments of actual multi-modal gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 samples meticulously selected by humans. MMStar benchmarks 6 core capabilities and 18 detailed axes, aiming to evaluate LVLMs' multi-modal capacities with carefully balanced and purified samples. These samples are first roughly selected from current benchmarks with an automated pipeline, human review is then involved to ensure each curated sample exhibits visual dependency, minimal data leakage, and requires advanced multi-modal capabilities. Moreover, two metrics are developed to measure data leakage and actual performance gain in multi-modal training. We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

1.6LGNov 6, 2013

Structural Learning for Template-free Protein Folding

Feng Zhao

The thesis is aimed to solve the template-free protein folding problem by tackling two important components: efficient sampling in vast conformation space, and design of knowledge-based potentials with high accuracy. We have proposed the first-order and second-order CRF-Sampler to sample structures from the continuous local dihedral angles space by modeling the lower and higher order conditional dependency between neighboring dihedral angles given the primary sequence information. A framework combining the Conditional Random Fields and the energy function is introduced to guide the local conformation sampling using long range constraints with the energy function. The relationship between the sequence profile and the local dihedral angle distribution is nonlinear. Hence we proposed the CNF-Folder to model this complex relationship by applying a novel machine learning model Conditional Neural Fields which utilizes the structural graphical model with the neural network. CRF-Samplers and CNF-Folder perform very well in CASP8 and CASP9. Further, a novel pairwise distance statistical potential (EPAD) is designed to capture the dependency of the energy profile on the positions of the interacting amino acids as well as the types of those amino acids, opposing the common assumption that this energy profile depends only on the types of amino acids. EPAD has also been successfully applied in the CASP 10 Free Modeling experiment with CNF-Folder, especially outstanding on some uncommon structured targets.