DBMar 3, 2024
ReMatch: Retrieval Enhanced Schema Matching with LLMsEitam Sheetrit, Menachem Brief, Moshik Mishaeli et al. · microsoft-research
Schema matching is a crucial task in data integration, involving the alignment of a source schema with a target schema to establish correspondence between their elements. This task is challenging due to textual and semantic heterogeneity, as well as differences in schema sizes. Although machine-learning-based solutions have been explored in numerous studies, they often suffer from low accuracy, require manual mapping of the schemas for model training, or need access to source schema data which might be unavailable due to privacy concerns. In this paper we present a novel method, named ReMatch, for matching schemas using retrieval-enhanced Large Language Models (LLMs). Our method avoids the need for predefined mapping, any model training, or access to data in the source database. Our experimental results on large real-world schemas demonstrate that ReMatch is an effective matcher. By eliminating the requirement for training data, ReMatch becomes a viable solution for real-world scenarios.
CLApr 8, 2025
Knowledge-Instruct: Effective Continual Pre-training from Limited Data using InstructionsOded Ovadia, Meni Brief, Rachel Lemberg et al. · microsoft-research
While Large Language Models (LLMs) acquire vast knowledge during pre-training, they often lack domain-specific, new, or niche information. Continual pre-training (CPT) attempts to address this gap but suffers from catastrophic forgetting and inefficiencies in low-data regimes. We introduce Knowledge-Instruct, a novel approach to efficiently inject knowledge from limited corpora through pure instruction-tuning. By generating information-dense synthetic instruction data, it effectively integrates new knowledge while preserving general reasoning and instruction-following abilities. Knowledge-Instruct demonstrates superior factual memorization, minimizes catastrophic forgetting, and remains scalable by leveraging synthetic data from relatively small language models. Additionally, it enhances contextual understanding, including complex multi-hop reasoning, facilitating integration with retrieval systems. We validate its effectiveness across diverse benchmarks, including Companies, a new dataset that we release to measure knowledge injection capabilities.
AIApr 6, 2025
SECQUE: A Benchmark for Evaluating Real-World Financial Analysis CapabilitiesNoga Ben Yoash, Meni Brief, Oded Ovadia et al. · microsoft-research
We introduce SECQUE, a comprehensive benchmark for evaluating large language models (LLMs) in financial analysis tasks. SECQUE comprises 565 expert-written questions covering SEC filings analysis across four key categories: comparison analysis, ratio calculation, risk assessment, and financial insight generation. To assess model performance, we develop SECQUE-Judge, an evaluation mechanism leveraging multiple LLM-based judges, which demonstrates strong alignment with human evaluations. Additionally, we provide an extensive analysis of various models' performance on our benchmark. By making SECQUE publicly available, we aim to facilitate further research and advancements in financial AI.
LGMay 14, 2023
Predicting Unplanned Readmissions in the Intensive Care Unit: A Multimodality EvaluationEitam Sheetrit, Menachem Brief, Oren Elisha
A hospital readmission is when a patient who was discharged from the hospital is admitted again for the same or related care within a certain period. Hospital readmissions are a significant problem in the healthcare domain, as they lead to increased hospitalization costs, decreased patient satisfaction, and increased risk of adverse outcomes such as infections, medication errors, and even death. The problem of hospital readmissions is particularly acute in intensive care units (ICUs), due to the severity of the patients' conditions, and the substantial risk of complications. Predicting Unplanned Readmissions in ICUs is a challenging task, as it involves analyzing different data modalities, such as static data, unstructured free text, sequences of diagnoses and procedures, and multivariate time-series. Here, we investigate the effectiveness of each data modality separately, then alongside with others, using state-of-the-art machine learning approaches in time-series analysis and natural language processing. Using our evaluation process, we are able to determine the contribution of each data modality, and for the first time in the context of readmission, establish a hierarchy of their predictive value. Additionally, we demonstrate the impact of Temporal Abstractions in enhancing the performance of time-series approaches to readmission prediction. Due to conflicting definitions in the literature, we also provide a clear definition of the term Unplanned Readmission to enhance reproducibility and consistency of future research and to prevent any potential misunderstandings that could result from diverse interpretations of the term. Our experimental results on a large benchmark clinical data set show that Discharge Notes written by physicians, have better capabilities for readmission prediction than all other modalities.
LGSep 6, 2017
Temporal Pattern Discovery for Accurate Sepsis Diagnosis in ICU PatientsEitam Sheetrit, Nir Nissim, Denis Klimov et al.
Sepsis is a condition caused by the body's overwhelming and life-threatening response to infection, which can lead to tissue damage, organ failure, and finally death. Common signs and symptoms include fever, increased heart rate, increased breathing rate, and confusion. Sepsis is difficult to predict, diagnose, and treat. Patients who develop sepsis have an increased risk of complications and death and face higher health care costs and longer hospitalization. Today, sepsis is one of the leading causes of mortality among populations in intensive care units (ICUs). In this paper, we look at the problem of early detection of sepsis by using temporal data mining. We focus on the use of knowledge-based temporal abstraction to create meaningful interval-based abstractions, and on time-interval mining to discover frequent interval-based patterns. We used 2,560 cases derived from the MIMIC-III database. We found that the distribution of the temporal patterns whose frequency is above 10% discovered in the records of septic patients during the last 6 and 12 hours before onset of sepsis is significantly different from that distribution within a similar period, during an equivalent time window during hospitalization, in the records of non-septic patients. This discovery is encouraging for the purpose of performing an early diagnosis of sepsis using the discovered patterns as constructed features.