Umut Acar

IRFeb 12

SPIRE: Structure-Preserving Interpretable Retrieval of Evidence

Mike Rainey, Umut Acar, Muhammed Sezer

Retrieval-augmented generation over semi-structured sources such as HTML is constrained by a mismatch between document structure and the flat, sequence-based interfaces of today's embedding and generative models. Retrieval pipelines often linearize documents into fixed-size chunks before indexing, which obscures section structure, lists, and tables, and makes it difficult to return small, citation-ready evidence without losing the surrounding context that makes it interpretable. We present a structure-aware retrieval pipeline that operates over tree-structured documents. The core idea is to represent candidates as subdocuments: precise, addressable selections that preserve structural identity while deferring the choice of surrounding context. We define a small set of document primitives--paths and path sets, subdocument extraction by pruning, and two contextualization mechanisms. Global contextualization adds the non-local scaffolding needed to make a selection intelligible (e.g., titles, headers, list and table structure). Local contextualization expands a seed selection within its structural neighborhood to obtain a compact, context-rich view under a target budget. Building on these primitives, we describe an embedding-based candidate generator that indexes sentence-seeded subdocuments and a query-time, document-aware aggregation step that amortizes shared structural context. We then introduce a contextual filtering stage that re-scores retrieved candidates using locally contextualized views. Across experiments on HTML question-answering benchmarks, we find that preserving structure while contextualizing selections yields higher-quality, more diverse citations under fixed budgets than strong passage-based baselines, while maintaining scalability.

CYMay 31, 2020

Analyzing Student Strategies In Blended Courses Using Clickstream Data

Nil-Jana Akpinar, Aaditya Ramdas, Umut Acar

Educational software data promises unique insights into students' study behaviors and drivers of success. While much work has been dedicated to performance prediction in massive open online courses, it is unclear if the same methods can be applied to blended courses and a deeper understanding of student strategies is often missing. We use pattern mining and models borrowed from Natural Language Processing (NLP) to understand student interactions and extract frequent strategies from a blended college course. Fine-grained clickstream data is collected through Diderot, a non-commercial educational support system that spans a wide range of functionalities. We find that interaction patterns differ considerably based on the assessment type students are preparing for, and many of the extracted features can be used for reliable performance prediction. Our results suggest that the proposed hybrid NLP methods can provide valuable insights even in the low-data setting of blended courses given enough data granularity.

Umut Acar

2 Papers