Reference String Extraction Using Line-Based Conditional Random Fields
This work addresses a specific step in citation extraction for researchers and librarians, but it is incremental as it builds on existing two-step approaches by modifying the model structure.
The paper tackled the problem of extracting individual reference strings from scientific publications by proposing a line-based conditional random fields model that treats each line as a potential part of a reference string, resulting in reduced model complexity while leveraging dependencies typical in reference sections.
The extraction of individual reference strings from the reference section of scientific publications is an important step in the citation extraction pipeline. Current approaches divide this task into two steps by first detecting the reference section areas and then grouping the text lines in such areas into reference strings. We propose a classification model that considers every line in a publication as a potential part of a reference string. By applying line-based conditional random fields rather than constructing the graphical model based on the individual words, dependencies and patterns that are typical in reference sections provide strong features while the overall complexity of the model is reduced.