Adi Raz Goldfarb

IR
h-index40
6papers
66citations
Novelty48%
AI Score47

6 Papers

IRMay 28
FLASH-MAXSIM: IO-Aware Fused Kernels for Late-Interaction Scoring

Roi Pony, Adi Raz Goldfarb, Idan Friedman et al.

Late-interaction retrieval (ColBERT, ColPali) scores a query against a document with the MaxSim operator: for every query token, the maximum similarity over the document tokens, summed over query tokens. The standard implementation materializes the full query-token x document-token similarity tensor in GPU memory; for visual ColPali at 10K documents this tensor alone is 21 GB in FP16, created only to be reduced to one score per document and discarded. It exhausts a 40 GB GPU and bounds the achievable batch size in both inference and training. We present Flash-MaxSim, an IO-aware fused GPU kernel that computes exactly the same scores without ever materializing the tensor, by streaming query and document tiles through on-chip SRAM and folding the row-maximum reduction into the same pass. We extend the IO-aware principle through the training backward pass, an inverse-grid CSR construction that reuses the forward argmax for an atomic-free, destination-owned gradient reduction, and through INT8xINT8 quantization and variable-length (padding-free) scoring. Flash-MaxSim is up to 3.9x faster on an A100 (4.7x on an H100) than naive PyTorch at matched precision, uses up to 16x less inference memory and ~28x less training memory, unlocks corpus and batch sizes that exhaust PyTorch entirely, preserves the exact ranking (100% top-20 agreement with an FP32 reference)

IRJun 2
Col-Bandit: Query-Time Top-$K$ Estimation for Late-Interaction Retrieval

Roi Pony, Adi Raz Goldfarb, Oshri Naparstek et al.

Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art quality, but their query-time cost is dominated by exhaustively computing token-level MaxSim interactions for every candidate document. The MaxSim scores of $N$ candidates against $T$ query tokens form an $N\times T$ matrix whose row-sums are the late-interaction scores, and identifying the top-$K$ rarely requires every entry. We introduce Col-Bandit, a query-time estimator of the exhaustive-MaxSim top-$K$: it reveals matrix entries in batches, maintains a finite-population Bernstein-Serfling confidence interval on each candidate's score, and permanently drops any document whose upper bound falls below the $K$-th largest lower bound, computing only the cells needed to separate the top-$K$. A single relaxation knob $α_{\mathrm{ef}}\in(0,1]$ tunes the compute-fidelity trade-off. We deploy $α_{\mathrm{ef}}{=}0.2$, while $α_{\mathrm{ef}}{=}1$ admits a $δ$-PAC guarantee under a simplified radius. On BEIR and REAL-MM-RAG, Col-Bandit preserves $\geq 90\%$ fidelity to the exhaustive top-$5$ on every corpus while cutting MaxSim FLOPs by up to ${\sim}8\times$, for up to ${\sim}13\times$ single-thread CPU speedups across x86 and ARM. A drop-in reranking layer, it needs no retraining or index changes.

CVFeb 14, 2025Code
Granite Vision: a lightweight, open-source multimodal model for enterprise Intelligence

Granite Vision Team, Leonid Karlinsky, Assaf Arbelle et al.

We introduce Granite Vision, a lightweight large language model with vision capabilities, specifically designed to excel in enterprise use cases, particularly in visual document understanding. Our model is trained on a comprehensive instruction-following dataset, including document-related tasks, such as content extraction from tables, charts, diagrams, sketches, and infographics, as well as general image tasks. The architecture of Granite Vision is centered around visual modality alignment with a decoder-only, 2 billion parameter Granite large language model. Additionally, we introduce a dedicated safety classification approach in test-time that leverages a sparse set of attention vectors to identify potential harmful inputs. Despite its lightweight architecture, Granite Vision achieves strong results in standard benchmarks related to visual document understanding, as well as on the LiveXiv benchmark, which is designed to avoid test set contamination by using a constantly updated corpus of recently published Arxiv papers. We are releasing the model under the Apache-2 license, allowing for both research and commercial use, while offering complete visibility into the training data and other relevant details. See https://huggingface.co/ibm-granite/ for model weights.

IRFeb 17, 2025
REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark

Navve Wasserman, Roi Pony, Oshri Naparstek et al.

Accurate multi-modal document retrieval is crucial for Retrieval-Augmented Generation (RAG), yet existing benchmarks do not fully capture real-world challenges with their current design. We introduce REAL-MM-RAG, an automatically generated benchmark designed to address four key properties essential for real-world retrieval: (i) multi-modal documents, (ii) enhanced difficulty, (iii) Realistic-RAG queries and (iv) accurate labeling. Additionally, we propose a multi-difficulty-level scheme based on query rephrasing to evaluate models' semantic understanding beyond keyword matching. Our benchmark reveals significant model weaknesses, particularly in handling table-heavy documents and robustness to query rephrasing. To mitigate these shortcomings, we curate a rephrased training set and introduce a new finance-focused, table-heavy dataset. Fine-tuning on these datasets enables models to achieve state-of-the-art retrieval performance on REAL-MM-RAG benchmark. Our work offers a better way to evaluate and improve retrieval in multi-modal RAG systems while also providing training data and models that address current limitations.

IRMay 1, 2024
KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

Oshri Naparstek, Roi Pony, Inbar Shapira et al.

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

CVNov 28, 2021
CHARTER: heatmap-based multi-type chart data extraction

Joseph Shtok, Sivan Harary, Ophir Azulai et al.

The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the digital domain. Our approach extracts and analyses charts along with their graphical elements and supporting structures such as legends, axes, titles, and captions. Our detection system is based on neural networks, trained solely on synthetic data, eliminating the limiting factor of data collection. As opposed to previous methods, which detect graphical elements using bounding-boxes, our networks feature auxiliary domain specific heatmaps prediction enabling the precise detection of pie charts, line and scatter plots which do not fit the rectangular bounding-box presumption. Qualitative and quantitative results show high robustness and precision, improving upon previous works on popular benchmarks