IR AIMar 21, 2019

Empirical Evaluations of Seed Set Selection Strategies for Predictive Coding

Christian J. Mahoney, Nathaniel Huber-Fliflet, Katie Jensen, Haozhen Zhao, Robert Neary, Shi Ye

arXiv:1903.08816v13.12 citationsh-index: 48

Originality Synthesis-oriented

AI Analysis

This work addresses the need for better training document selection strategies for attorneys in legal predictive coding, though it is incremental as it builds on limited prior research.

The paper tackled the problem of selecting effective seed sets for predictive coding in legal document review, finding that the choice of seed set strategy significantly impacts model precision, with evaluations on four real cases showing varying performance across eight strategies.

Training documents have a significant impact on the performance of predictive models in the legal domain. Yet, there is limited research that explores the effectiveness of the training document selection strategy - in particular, the strategy used to select the seed set, or the set of documents an attorney reviews first to establish an initial model. Since there is limited research on this important component of predictive coding, the authors of this paper set out to identify strategies that consistently perform well. Our research demonstrated that the seed set selection strategy can have a significant impact on the precision of a predictive model. Enabling attorneys with the results of this study will allow them to initiate the most effective predictive modeling process to comb through the terabytes of data typically present in modern litigation. This study used documents from four actual legal cases to evaluate eight different seed set selection strategies. Attorneys can use the results contained within this paper to enhance their approach to predictive coding.

View on arXiv PDF

Similar