44.0CLMay 28
Refining Word-Based Grammatical Error Annotation for L2 KoreanJungyeul Park, Kyungtae Lim, Wonjun Oh et al.
Korean grammatical error correction (K-GEC) presents a structural mismatch between word-based evaluation and the morpheme-level locus of many learner errors. Postpositions and verbal endings are bound to lexical hosts, but they encode grammatical relations that must be represented in correction and evaluation. This paper refines word-based grammatical error annotation for L2 Korean by addressing three connected problems in existing resources: surface target realization, Korean-specific edit annotation, and single-reference evaluation. We reconstruct target sentences from the National Institute of Korean Language (NIKL) L2 corpus under morphologically constrained realization rules and convert its morpheme-level annotations into word-level \texttt{m2} edits. We then define a Korean ERRANT-style annotation scheme that preserves the MRU core while distinguishing functional morpheme errors, spelling errors, word boundary errors, and word order errors. We also augment the KoLLA corpus with an additional reference correction, yielding a multi-reference evaluation setting for Korean GEC. Empirical validation shows that the refined NIKL targets yield lower perplexity, the converted \texttt{m2} files achieve higher agreement with source-target edit representations, and the refined resources improve KoBART-based correction under the same model setting. Multi-reference KoLLA evaluation further reduces the penalty imposed on valid corrections that diverge from a single reference, especially for neural and prompted GEC systems. These results show that Korean GEC evaluation depends not only on correction models, but also on reference data and edit annotations that reflect Korean morphology, spacing, and correction variability.
CLApr 6, 2022
Distributed Transition Systems with Tags for Privacy AnalysisSiva Anantharaman, Sabine Frittella, Benjamin Nguyen
We present a logical framework that formally models how a given private information P stored on a given database D, can get captured progressively, by an agent/adversary querying the database repeatedly. Named DLTTS (Distributed Labeled Tagged Transition System), the framework borrows ideas from several domains: Probabilistic Automata of Segala, Probabilistic Concurrent Systems, and Probabilistic labelled transition systems. To every node on a DLTTS is attached a tag that represents the 'current' knowledge of the adversary, acquired from the responses of the answering mechanism of the DBMS to his/her queries, at the nodes traversed earlier, along any given run; this knowledge is completed at the same node, with further relational deductions, possibly in combination with 'public' information from other databases given in advance. A 'blackbox' mechanism is also part of a DLTTS, and it is meant as an oracle; its role is to tell if the private information has been deduced by the adversary at the current node, and if so terminate the run. An additional special feature is that the blackbox also gives information on how 'close', or how 'far', the knowledge of the adversary is, from the private information P , at the current node. A metric is defined for that purpose, on the set of all 'type compatible' tuples from the given database, the data themselves being typed with the headers of the base. Despite the transition systems flavor of our framework, this metric is not 'behavioral' in the sense presented in some other works. It is exclusively database oriented, and allows to define new notions of adjacency and of indistinguishabilty between databases, more generally than those usually based on the Hamming metric (and a restricted notion of adjacency). Examples are given all along to illustrate how our framework works. Keywords:Database, Privacy, Transition System, Probability, Distribution.
CRJan 8, 2020
Techniques d'anonymisation tabulaire : concepts et mise en oeuvreBenjamin Nguyen, Claude Castelluccia
In this document, we present a state of the art of anonymization techniques for classical tabular datasets. This article is geared towards a general public having some knowledge of mathematics and computer science, but with no need for specific knowledge in anonymization. The objective of this document it to explain anonymization concepts in order to be able to sanitize a dataset and compute reindentification risk. The document contains a large number of examples to help understand the calculations. ----- Dans ce document, nous présentons l'état de l'art des techniques d'anonymisation pour des bases de données classiques (i.e. des tables), à destination d'un public technique ayant une formation universitaire de base en mathématiques et informatique, mais non spécialiste. L'objectif de ce document est d'expliquer les concepts permettant de réaliser une anonymisation de données tabulaires, et de calculer les risques de réidentification. Le document est largement composé d'exemples permettant au lecteur de comprendre comment mettre en oeuvre les calculs.
CRSep 11, 2015
Key Exchange Protocol in the Trusted Data Servers ContextQuoc-Cuong To, Benjamin Nguyen, Philippe Pucheral
The aim of this technical report is to complement the work in [To et al. 2014] by proposing a Group Key Exchange protocol so that the Querier and TDSs (and TDSs themselves) can securely create and exchange the shared key. Then, the security of this protocol is formally proved using the game-based model. Finally, we perform the comparison between this protocol and other related works.