CLMar 26, 2021
Data Augmentation in Natural Language Processing: A Novel Text Generation Approach for Long and Short Text ClassifiersMarkus Bayer, Marc-André Kaufhold, Björn Buchhold et al.
In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.
IRDec 21, 2017
Overview of the Triple Scoring Task at the WSDM Cup 2017Hannah Bast, Björn Buchhold, Elmar Haussmann
This paper provides an overview of the triple scoring task at the WSDM Cup 2017, including a description of the task and the dataset, an overview of the participating teams and their results, and a brief account of the methods employed. In a nutshell, the task was to compute relevance scores for knowledge-base triples from relations, where such scores make sense. Due to the way the ground truth was constructed, scores were required to be integers from the range 0..7. For example, reasonable scores for the triples "Tim Burton profession Director" and "Tim Burton profession Actor" would be 7 and 2, respectively, because Tim Burton is well-known as a director, but he acted only in a few lesser known movies. The triple scoring task attracted considerable interest, with 52 initial registrations and 21 teams who submitted a valid run before the deadline. The winning team achieved an accuracy of 87%, that is, for that fraction of the triples from the test set (which was revealed only after the deadline) the difference to the score from the ground truth was at most 2. The best result for the average difference from the test set scores was 1.50.
IRJul 11, 2012
Broccoli: Semantic Full-Text Search at your FingertipsHannah Bast, Florian Bäurle, Björn Buchhold et al.
We present Broccoli, a fast and easy-to-use search engine for what we call semantic full-text search. Semantic full-text search combines the capabilities of standard full-text search and ontology search. The search operates on four kinds of objects: ordinary words (e.g., edible), classes (e.g., plants), instances (e.g., Broccoli), and relations (e.g., occurs-with or native-to). Queries are trees, where nodes are arbitrary bags of these objects, and arcs are relations. The user interface guides the user in incrementally constructing such trees by instant (search-as-you-type) suggestions of words, classes, instances, or relations that lead to good hits. Both standard full-text search and pure ontology search are included as special cases. In this paper, we describe the query language of Broccoli, the main idea behind a new kind of index that enables fast processing of queries from that language as well as fast query suggestion, the natural language processing required, and the user interface. We evaluated query times and result quality on the full version of the English Wikipedia (40 GB XML dump) combined with the YAGO ontology (26 million facts). We have implemented a fully functional prototype based on our ideas and provide a web application to reproduce our quality experiments. Both are accessible via http://broccoli.informatik.uni-freiburg.de/repro-corr/ .