Mining Measured Information from Text
This addresses the need for efficient search and exploration of measured data in scientific and technical domains, though it appears incremental as it builds on existing extraction techniques.
The paper tackles the problem of extracting measured information (e.g., numeric values with units and properties) from text, such as scientific documents, and presents MQSearch, a search engine that supports this functionality, demonstrating robustness in handling format conversion errors.
We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 μm"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.