CLApr 8, 2022Code
Enhance Incomplete Utterance Restoration by Joint Learning Token Extraction and Text GenerationShumpei Inoue, Tsungwei Liu, Nguyen Hong Son et al.
This paper introduces a model for incomplete utterance restoration (IUR) called JET (\textbf{J}oint learning token \textbf{E}xtraction and \textbf{T}ext generation). Different from prior studies that only work on extraction or abstraction datasets, we design a simple but effective model, working for both scenarios of IUR. Our design simulates the nature of IUR, where omitted tokens from the context contribute to restoration. From this, we construct a Picker that identifies the omitted tokens. To support the picker, we design two label creation methods (soft and hard labels), which can work in cases of no annotation data for the omitted tokens. The restoration is done by using a Generator with the help of the Picker on joint learning. Promising results on four benchmark datasets in extraction and abstraction scenarios show that our model is better than the pretrained T5 and non-generative language model methods in both rich and limited training data settings.\footnote{The code is available at \url{https://github.com/shumpei19/JET}}
CLMay 26, 2022
Jointly Learning Span Extraction and Sequence Labeling for Information Extraction from Business DocumentsNguyen Hong Son, Hieu M. Vu, Tuan-Anh D. Nguyen et al.
This paper introduces a new information extraction model for business documents. Different from prior studies which only base on span extraction or sequence labeling, the model takes into account advantage of both span extraction and sequence labeling. The combination allows the model to deal with long documents with sparse information (the small amount of extracted information). The model is trained end-to-end to jointly optimize the two tasks in a unified manner. Experimental results on four business datasets in English and Japanese show that the model achieves promising results and is significantly faster than the normal span-based extraction method. The code is also available.
AIJun 2, 2021
A Span Extraction Approach for Information Extraction on Visually-Rich DocumentsTuan-Anh D. Nguyen, Hieu M. Vu, Nguyen Hong Son et al.
Information extraction (IE) for visually-rich documents (VRDs) has achieved SOTA performance recently thanks to the adaptation of Transformer-based language models, which shows the great potential of pre-training methods. In this paper, we present a new approach to improve the capability of language model pre-training on VRDs. Firstly, we introduce a new query-based IE model that employs span extraction instead of using the common sequence labeling approach. Secondly, to further extend the span extraction formulation, we propose a new training task that focuses on modelling the relationships among semantic entities within a document. This task enables target spans to be extracted recursively and can be used to pre-train the model or as an IE downstream task. Evaluation on three datasets of popular business documents (invoices, receipts) shows that our proposed method achieves significant improvements compared to existing models. The method also provides a mechanism for knowledge accumulation from multiple downstream IE tasks.
IRMar 6, 2020
Transfer Learning for Information Extraction with Limited DataMinh-Tien Nguyen, Viet-Anh Phan, Le Thai Linh et al.
This paper presents a practical approach to fine-grained information extraction. Through plenty of experiences of authors in practically applying information extraction to business process automation, there can be found a couple of fundamental technical challenges: (i) the availability of labeled data is usually limited and (ii) highly detailed classification is required. The main idea of our proposal is to leverage the concept of transfer learning, which is to reuse the pre-trained model of deep neural networks, with a combination of common statistical classifiers to determine the class of each extracted term. To do that, we first exploit BERT to deal with the limitation of training data in real scenarios, then stack BERT with Convolutional Neural Networks to learn hidden representation for classification. To validate our approach, we applied our model to an actual case of document processing, which is a process of competitive bids for government projects in Japan. We used 100 documents for training and testing and confirmed that the model enables to extract fine-grained named entities with a detailed level of information preciseness specialized in the targeted business process, such as a department name of application receivers.
CRDec 4, 2019
The method of detecting online password attacks based on high-level protocol analysis and clustering techniquesNguyen Hong Son, Ha Thanh Dung
Although there have been many solutions applied, the safety challenges related to the password security mechanism are not reduced. The reason for this is that while the means and tools to support password attacks are becoming more and more abundant, the number of transaction systems through the Internet is increasing, and new services systems appear. For example, IoT also uses password-based authentication. In this context, consolidating password-based authentication mechanisms is critical, but monitoring measures for timely detection of attacks also play an important role in this battle. The password attack detection solutions being used need to be supplemented and improved to meet the new situation. In this paper we propose a solution that automatically detects online password attacks in a way that is based solely on the network, using unsupervised learning techniques and protected application orientation. Our solution, therefore, minimizes dependence on the factors encountered by host-based or supervised learning solutions. The certainty of the solution comes from using the results of an in-depth analysis of attack characteristics to build the detection capacity of the mechanism. The solution was implemented experimentally on the real system and gave positive results.