79.0CLMay 30
French parsing enhanced with a word clustering method based on a syntactic lexiconAnthony Sigogne, Matthieu Constant, Eric Laporte
This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).
2.7CLJun 2
Lexicons and grammars for language processing: industrial or handcrafted products?Eric Laporte
During the recent years, the use of linguistic data for language processing increased progressively. Such data are now commonly called language resources. Most of the language resources used for this purpose are collections of texts as the Brown Corpus and the Penn Treebank, but electronic lexicons (WordNet, FrameNet, VerbNet, ComLex, Lexicon-Grammar...) and formal grammars (TAG...) developed recently. Most processes of construction of lexicons and grammars are manual, whereas the construction of corpora has always been highly automated. However, more and more specialists of language processing realize that the information content of lexicons and grammars is richer than that of corpora, and hence the former make more elaborate processing possible. The difference in construction time is likely to be connected with the difference in information content: the handcrafting of lexicons and grammars by linguists would make them more informative than automatically generated data. This situation can evolve into two directions: either specialists of language technology get progressively used to handling manually constructed resources, which are more informative and more complex, or the process of construction of lexicons and grammars is automated and industrialized, which is the mainstream perspective. Both evolutions are already in progress, and a tension exists between them. The relation between linguists and computer scientists depends on the future of these evolutions, since the first implies training and hiring numerous linguists, whereas the other depends essentially on solutions elaborated by computer engineers. The aim of this article is to analyse practical examples of the language resources in question, and to discuss about which of the two trends, handcrafting or generating industrially, or a combination of both, can give the best results or is the most realistic.
81.2CLJun 3
A French Corpus Annotated for Multiword Expressions with Adverbial FunctionEric Laporte, Takuya Nakamura, Stavroula Voyatzi
This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.
73.4CLMay 28
Classification of non-analyzable word types in web documents to implement an effective Korean e-learning systemSang-Taek Park, Ae-Lim Ahn, Eric Laporte et al.
E-learning systems should deliver contents that reflect various phenomena of the language as it is used. In addition to formal Korean, e-learning systems that would include real-world Korean expressions such as those in web documents, mobile text messages, or twitter posts, would be useful to high-level learners. We construct two types of corpora: one is made of formal documents like online news articles; the other is made of informal documents like customer reviews about new products in web blogs. By comparing these corpora, we show how expressions differ in these two types of corpora. We survey the main characteristics of the informal corpus. Given that a significant proportion of text is informal, we propose Local Grammar Graphs (LGG) as an appropriate model to treat them effectively in Korean e-learning systems.
75.2CLMay 27
A new semantically annotated corpus with syntactic-semantic and cross-lingual sensesMyriam Rakho, Eric Laporte, Matthieu Constant
We describe a new sense-tagged corpus for word sense disambiguation. The corpus is constituted of instances of 20 French polysemous verbs. Each verb instance is annotated with three sense labels: (1) the actual translation of the verb in the english version of this instance in a parallel corpus, (2) an entry of the verb in a computational dictionary of French (the Lexicon-Grammar tables) and (3) a fine-grained sense label resulting from the concatenation of the translation and the Lexicon-Grammar entry.
55.5CLMay 26
Formalization of Malagasy conjugationJoro Ny Aina Ranaivoarison, Eric Laporte, Baholisoa Simone Ralalaoherivony
This paper reports the core linguistic work performed to construct a dictionary-based morphological analyser for Malagasy simple verbs. It uses the Unitex platform and comprised the contruction of an electronic dictionary for Malagasy simple verbs. The data is encoded on the basis of morphological features. The morphological variations of verb stems and their combination with inflectional affixes are formalized in finite-state transducers represented by editable graphs. 78 transducers allow Unitex to generate a dictionary of allomorphs of stems. 271 other transducers are used by the morphological analyser of Unitex to recognize the stem and the affixes in conjugated verbs. The design of the dictionary and transducers prioritizes readability, so that they can be extended and updated by linguists.
2.3CLMay 8Code
Generating training datasets for legal chatbots in KoreanChanghoe Hwang, Jee-Sun Nam, Eric Laporte
Chatbots are robots that can communicate with humans using text or voice signals. Legal chatbots improve access to justice, since legal representation and legal advice by lawyers come with a high cost that excludes disadvantaged and vulnerable people. However, capturing the diversity of actual user input in datasets for deep-learning dialog systems (chatbots) is a technical challenge. Diversity requires large volumes of data, which must also be labelled in order to classify the user's intent, while the cost of labelling datasets increases with volume. Instead of labelling large volumes of authentic data from users, our approach consists in jointly generating large volumes of utterances and high-quality labels. The generator of labelled datasets is based on language resources that take the form of local grammar graphs (LGG), which capture and generalize the vocabulary and local syntax observed by linguists in text. The LGGs associate labels to the utterances according to a domain-specific classification system. We tested this approach by implementing LIGA, a legal chatbot in Korean. The chatbot answers users' conversational queries on legal situations by providing information on similar legal cases, made publicly available by the Korean government. We generated labelled utterances from the LGGs with the aid of the open-source Unitex platform. This process produced 700 million utterances. We trained a DIET classifier on a dataset made of these utterances, and the trained model reached 91% f1-score performance. We implemented a chatbot called LIGA, which uses the results of the model to select a link to a web page that documents similar legal cases.
3.0CLMay 21
Pattern-and-root inflectional morphology: the Arabic broken pluralAlexis Amid Neme, Eric Laporte
We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.
39.4CLMay 14
Conversion of Lexicon-Grammar tables to LMF. Application to FrenchEric Laporte, Elsa Tolone, Mathieu Constant
We describe the first experiment of conversion of Lexicon-Grammar tables for French verbs into the Lexical Markup Framework (LMF) format. The Lexicon-Grammar of the French language is currently one of the major sources of lexical and syntactic information for French. Its conversion into an interoperable representation format according to the LMF standard makes it usable in different contexts, thus contributing to the standardization and interoperability of natural language processing dictionaries. We briefly introduce the Lexicon-Grammar and the derived dictionaries; we analyse the main difficulties faced during the conversion; and we describe the resulting resource.
13.3CLMay 11
Building Korean linguistic resource for NLU data generation of banking app CS dialog systemJeongwoo Yoon, On-yu Park, Changhoe Hwang et al.
Natural language understanding (NLU) is integral to task-oriented dialog systems, but demands a considerable amount of annotated training data to increase the coverage of diverse utterances. In this study, we report the construction of a linguistic resource named FIAD (Financial Annotated Dataset) and its use to generate a Korean annotated training data for NLU in the banking customer service (CS) domain. By an empirical examination of a corpus of banking app reviews, we identified three linguistic patterns occurring in Korean request utterances: TOPIC (ENTITY, FEATURE), EVENT, and DISCOURSE MARKER. We represented them in LGGs (Local Grammar Graphs) to generate annotated data covering diverse intents and entities. To assess the practicality of the resource, we evaluate the performances of DIET-only (Intent: 0.91 /Topic [entity+feature]: 0.83), DIET+ HANBERT (I:0.94/T:0.85), DIET+ KoBERT (I:0.94/T:0.86), and DIET+ KorBERT (I:0.95/T:0.84) models trained on FIAD-generated data to extract various types of semantic items.
42.1CLMay 12
Choosing features for classifying multiword expressionsEric Laporte
Multiword expressions (MWEs) are a heterogeneous set with a glaring need for classifications. Designing a satisfactory classification involves choosing features. In the case of MWEs, many features are a priori available. Not all features are equal in terms of how reliably MWEs can be assigned to classes. Accordingly, resulting classifications may be more or less fruitful for computational use. I outline an enhanced classification. In order to increase its suitability for many languages, I use previous works taking into account various languages.
34.2CLMay 12
Concordance Comparison as a Means of Assembling Local GrammarsJuliana Pirovani, Elias de Oliveira, Eric Laporte
Named Entity Recognition for person names is an important but non-trivial task in information extraction. This article uses a tool that compares the concordances obtained from two local grammars (LG) and highlights the differences. We used the results as an aid to select the best of a set of LGs. By analyzing the comparisons, we observed relationships of inclusion, intersection and disjunction within each pair of LGs, which helped us to assemble those that yielded the best results. This approach was used in a case study on extraction of person names from texts written in Portuguese. We applied the enhanced grammar to the Gold Collection of the Second HAREM. The F-Measure obtained was 76.86, representing a gain of 6 points in relation to the state-of-the-art for Portuguese.
6.4CLMay 11
DECO-MWE: building a linguistic resource of Korean multiword expressions for feature-based sentiment analysisJaeho Han, Changhoe Hwang, Seongyong Choi et al.
This paper aims to construct a linguistic resource of Korean Multiword Expressions for Feature-Based Sentiment Analysis (FBSA): DECO-MWE. Dealing with multiword expressions (MWEs) has been a critical issue in FBSA since many constructs reveal lexical idiosyncrasy. To construct linguistic resources of sentiment MWEs efficiently, we utilize the Local Grammar Graph (LGG) methodology: DECO-MWE is formalized as a Finite-State Transducer that represents lexical-syntactic restrictions on MWEs. In this study, we built a corpus of cosmetics review texts, which show particularly frequent occurrences of MWEs. Based on an empirical examination of the corpus, four types of MWEs have been distinguished. The DECO-MWE thus covers the following four categories: Standard Polarity MWEs (SMWEs), Domain-Dependent Polarity MWEs (DMWEs), Compound Named Entity MWEs (EMWEs) and Compound Feature MWEs (FMWEs). The retrieval performance of the DECO-MWE shows 0.806 f-measure in the test corpus. This study brings a twofold outcome: first, a sizeable general-purpose polarity MWE lexicon, which may be broadly used in FBSA; second, a finite-state methodology adopted in this study to treat domain-dependent MWEs such as idiosyncratic polarity expressions, named entity expressions or feature expressions, and which may be reused in describing linguistic properties of other corpus domains.
27.3CLMay 11
Where do aspectual variants of light verb constructions belong?Aggeliki Fotopoulou, Eric Laporte, Takuya Nakamura
Expressions with an aspectual variant of a light verb, e.g. 'take on debt' vs. 'have debt', are frequent in texts but often difficult to classify between verbal idioms, light verb constructions or compositional phrases. We investigate the properties of such expressions with a disputed membership and propose a selection of features that determine more satisfactory boundaries between the three categories in this zone, assigning the expressions to one of them.
1.0CLMay 8
SSP-based construction of evaluation-annotated data for fine-grained aspect-based sentiment analysisSuwon Choi, Shinwoo Kim, Changhoe Hwang et al.
We report the construction of a Korean evaluation-annotated corpus, hereafter called 'Evaluation Annotated Dataset (EVAD)', and its use in Aspect-Based Sentiment Analysis (ABSA) extended in order to cover e-commerce reviews containing sentiment and non-sentiment linguistic patterns. The annotation process uses Semi-Automatic Symbolic Propagation (SSP). We built extensive linguistic resources formalized as a Finite-State Transducer (FST) to annotate corpora with detailed ABSA components in the fashion e-commerce domain. The ABSA approach is extended, in order to analyze user opinions more accurately and extract more detailed features of targets, by including aspect values in addition to topics and aspects, and by classifying aspectvalue pairs depending whether values are unary, binary, or multiple. For evaluation, the KoBERT and KcBERT models are trained on the annotated dataset, showing robust performances of F1 0.88 and F1 0.90, respectively, on recognition of aspect-value pairs.
CLApr 19, 2019
Recognizing the vocabulary of Brazilian popular newspapers with a free-access computational dictionaryMaria José Finatto, Oto Vale, Eric Laporte
We report an experiment to check the identification of a set of words in popular written Portuguese with two versions of a computational dictionary of Brazilian Portuguese, DELAF PB 2004 and DELAF PB 2015. This dictionary is freely available for use in linguistic analyses of Brazilian Portuguese and other researches, which justifies critical study. The vocabulary comes from the PorPopular corpus, made of popular newspapers Di{á}rio Ga{ú}cho (DG) and Massa! (MA). From DG, we retained a set of texts with 984.465 words (tokens), published in 2008, with the spelling used before the Portuguese Language Orthographic Agreement adopted in 2009. From MA, we examined papers of 2012, 2014 e 2015, with 215.776 words (tokens), all with the new spelling. The checking involved: a) generating lists of words (types) occurring in DG and MA; b) comparing them with the entry lists of both versions of DELAF PB; c) assessing the coverage of this vocabulary; d) proposing ways of incorporating the items not covered. The results of the work show that an average of 19% of the types in DG were not found in DELAF PB 2004 or 2015. In MA, this average is 13%. Switching versions of the dictionary affected slightly the performance in recognizing the words.
CLApr 7, 2014
Intégration des données d'un lexique syntaxique dans un analyseur syntaxique probabilisteAnthony Sigogne, Matthieu Constant, Eric Laporte
This article reports the evaluation of the integration of data from a syntactic-semantic lexicon, the Lexicon-Grammar of French, into a syntactic parser. We show that by changing the set of labels for verbs and predicational nouns, we can improve the performance on French of a non-lexicalized probabilistic parser.