CLLGMay 25, 2023

LFTK: Handcrafted Features in Computational Linguistics

arXiv:2305.15878v2234 citationsHas Code
Originality Incremental advance
AI Analysis

This work addresses a practical bottleneck for researchers in computational linguistics by providing a standardized, expandable tool to reduce redundancy and confusion in feature extraction.

The paper tackles the problem of inconsistent and inaccessible handcrafted linguistic features in computational linguistics by collecting over 220 features, analyzing their correlations, and developing LFTK, an open-source multilingual extraction system that is the largest of its kind.

Past research has identified a rich set of handcrafted linguistic features that can potentially assist various tasks. However, their extensive number makes it difficult to effectively select and utilize existing handcrafted features. Coupled with the problem of inconsistent implementation across research works, there has been no categorization scheme or generally-accepted feature names. This creates unwanted confusion. Also, most existing handcrafted feature extraction libraries are not open-source or not actively maintained. As a result, a researcher often has to build such an extraction system from the ground up. We collect and categorize more than 220 popular handcrafted features grounded on past literature. Then, we conduct a correlation analysis study on several task-specific datasets and report the potential use cases of each feature. Lastly, we devise a multilingual handcrafted linguistic feature extraction system in a systematically expandable manner. We open-source our system for public access to a rich set of pre-implemented handcrafted features. Our system is coined LFTK and is the largest of its kind. Find it at github.com/brucewlee/lftk.

Code Implementations1 repo
Foundations

The foundational work for this paper's niche, ranked by how specifically the neighbourhood builds on it — not by global fame.

Your Notes