CVMar 27, 2023Code
TabIQA: Table Questions Answering on Business Document ImagesPhuc Nguyen, Nam Tuan Ly, Hideaki Takeda et al.
Table answering questions from business documents has many challenges that require understanding tabular structures, cross-document referencing, and additional numeric computations beyond simple search queries. This paper introduces a novel pipeline, named TabIQA, to answer questions about business document images. TabIQA combines state-of-the-art deep learning techniques 1) to extract table content and structural information from images and 2) to answer various questions related to numerical data, text-based information, and complex queries from structured tables. The evaluation results on VQAonBD 2023 dataset demonstrate the effectiveness of TabIQA in achieving promising performance in answering table-related questions. The TabIQA repository is available at https://github.com/phucty/itabqa.
LGMay 6Code
Robustness of Graph Self-Supervised Learning to Real-World Noise: A Case Study on Text-Driven Biomedical GraphsOthmane Kabal, Mounira Harzallah, Fabrice Guillet et al.
Graph Self-Supervised Learning (GSSL) offers a powerful paradigm for learning graph representations without labeled data. However, existing work assumes clean, manually curated graphs. Recent advances in NLP enable the large-scale automatic extraction of knowledge graphs from text, opening new opportunities for GSSL while introducing substantial real-world noise. This type of noise remains largely unexplored, as prior robustness studies typically rely on synthetic perturbations. To address this gap, we present the first comprehensive evaluation of GSSL methods on text-driven graphs for unsupervised term typing. We introduce Noise-Aware Text-Driven Graph GSSL (NATD-GSSL), a unified framework that combines automatic graph construction, graph refinement, and GSSL. Our evaluation follows a dual-graph protocol that contrasts a noisy graph derived from MedMentions with a clean Unified Medical Language System (UMLS) reference graph, aligned through a shared gold standard. Our results reveal variability in robustness across both pretext tasks and Graph Neural Network (GNN) architectures. Relation reconstruction is highly sensitive to noise and benefits from well-defined schemas, whereas feature reconstruction is considerably more robust, achieving performance comparable to clean-graph settings. Contrastive objectives are generally less affected by noise but depend strongly on alignment with downstream tasks. GNN architecture also plays a critical role: bidirectional relational message-passing designs are better suited to noisy, text-driven graphs, while unidirectional relational ones perform best on clean graphs. Overall, NATD-GSSL provides practical guidance for applying GSSL to real-world, noisy graphs and achieves up to a 7\% improvement over pretrained language model baselines. All code and benchmarks are publicly available at https://github.com/OthmaneKabal/MC2GAE.
LGMay 6
A Unified Benchmark for Evaluating Knowledge Graph Construction Methods and Graph Neural NetworksOthmane Kabal, Mounira Harzallah, Fabrice Guillet et al.
Knowledge graphs automatically constructed from text are increasingly used in real-world applications. However, their inherent noise, fragmentation, and semantic inconsistencies significantly affect the performance of Graph Neural Networks (GNNs) on downstream tasks. Assessing their performance and robustness remains difficult, as it is often unclear whether observed results stem from the learning model or from the quality of the constructed graph itself. In this work, we introduce a dual-purpose benchmark designed to jointly evaluate (i) the performance of GNNs on noisy, text-derived graphs and (ii) the effectiveness of graph construction methods on a downstream task. The benchmark is built in the biomedical domain from a single textual corpus and includes two automatically constructed graphs generated using different extraction methods, alongside a high-quality reference graph curated by experts that serves as an upper performance bound. This design enables controlled comparison of construction methods and systematic evaluation of GNN robustness through semi-supervised node classification. We further provide a standardized, reproducible, and extensible evaluation framework, facilitating the integration of new graph extraction methods and learning models.
CVMar 14, 2023
Rethinking Image-based Table Recognition Using Weakly Supervised MethodsNam Tuan Ly, Atsuhiro Takasu, Phuc Nguyen et al.
Most of the previous methods for table recognition rely on training datasets containing many richly annotated table images. Detailed table image annotation, e.g., cell or text bounding box annotation, however, is costly and often subjective. In this paper, we propose a weakly supervised model named WSTabNet for table recognition that relies only on HTML (or LaTeX) code-level annotations of table images. The proposed model consists of three main parts: an encoder for feature extraction, a structure decoder for generating table structure, and a cell decoder for predicting the content of each cell in the table. Our system is trained end-to-end by stochastic gradient descent algorithms, requiring only table images and their ground-truth HTML (or LaTeX) representations. To facilitate table recognition with deep learning, we create and release WikiTableSet, the largest publicly available image-based table recognition dataset built from Wikipedia. WikiTableSet contains nearly 4 million English table images, 590K Japanese table images, and 640k French table images with corresponding HTML representation and cell bounding boxes. The extensive experiments on WikiTableSet and two large-scale datasets: FinTabNet and PubTabNet demonstrate that the proposed weakly supervised model achieves better, or similar accuracies compared to the state-of-the-art models on all benchmark datasets.
AISep 6, 2024
An Argumentative Approach for Explaining Preemption in Soft-Constraint Based NormsWachara Fungwacharakorn, Kanae Tsushima, Hiroshi Hosobe et al.
Although various aspects of soft-constraint based norms have been explored, it is still challenging to understand preemption. Preemption is a situation where higher-level norms override lower-level norms when new information emerges. To address this, we propose a derivation state argumentation framework (DSA-framework). DSA-framework incorporates derivation states to explain how preemption arises based on evolving situational knowledge. Based on DSA-framework, we present an argumentative approach for explaining preemption. We formally prove that, under local optimality, DSA-framework can provide explanations why one consequence is obligatory or forbidden by soft-constraint based norms represented as logical constraint hierarchies.
CLOct 2, 2020Code
LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attentionIkuya Yamada, Akari Asai, Hiroyuki Shindo et al.
Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.
CLDec 15, 2018Code
Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from WikipediaIkuya Yamada, Akari Asai, Jin Sakuma et al.
The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io.
CLFeb 1, 2024
SPARQL Generation with Entity Pre-trained GPT for KG Question AnsweringDiego Bustamante, Hideaki Takeda
Knowledge Graphs popularity has been rapidly growing in last years. All that knowledge is available for people to query it through the many online databases on the internet. Though, it would be a great achievement if non-programmer users could access whatever information they want to know. There has been a lot of effort oriented to solve this task using natural language processing tools and creativity encouragement by way of many challenges. Our approach focuses on assuming a correct entity linking on the natural language questions and training a GPT model to create SPARQL queries from them. We managed to isolate which property of the task can be the most difficult to solve at few or zero-shot and we proposed pre-training on all entities (under CWA) to improve the performance. We obtained a 62.703% accuracy of exact SPARQL matches on testing at 3-shots, a F1 of 0.809 on the entity linking challenge and a F1 of 0.009 on the question answering challenge.
CLNov 7, 2025
Diagnosing and Mitigating Semantic Inconsistencies in Wikidata's Classification HierarchyShixiong Zhao, Hideaki Takeda
Wikidata is currently the largest open knowledge graph on the web, encompassing over 120 million entities. It integrates data from various domain-specific databases and imports a substantial amount of content from Wikipedia, while also allowing users to freely edit its content. This openness has positioned Wikidata as a central resource in knowledge graph research and has enabled convenient knowledge access for users worldwide. However, its relatively loose editorial policy has also led to a degree of taxonomic inconsistency. Building on prior work, this study proposes and applies a novel validation method to confirm the presence of classification errors, over-generalized subclass links, and redundant connections in specific domains of Wikidata. We further introduce a new evaluation criterion for determining whether such issues warrant correction and develop a system that allows users to inspect the taxonomic relationships of arbitrary Wikidata entities-leveraging the platform's crowdsourced nature to its full potential.
CLJul 20, 2025
Full Triple Matcher: Integrating all triple elements between heterogeneous Knowledge GraphsVictor Eiti Yamamoto, Hideaki Takeda
Knowledge graphs (KGs) are powerful tools for representing and reasoning over structured information. Their main components include schema, identity, and context. While schema and identity matching are well-established in ontology and entity matching research, context matching remains largely unexplored. This is particularly important because real-world KGs often vary significantly in source, size, and information density - factors not typically represented in the datasets on which current entity matching methods are evaluated. As a result, existing approaches may fall short in scenarios where diverse and complex contexts need to be integrated. To address this gap, we propose a novel KG integration method consisting of label matching and triple matching. We use string manipulation, fuzzy matching, and vector similarity techniques to align entity and predicate labels. Next, we identify mappings between triples that convey comparable information, using these mappings to improve entity-matching accuracy. Our approach demonstrates competitive performance compared to leading systems in the OAEI competition and against supervised methods, achieving high accuracy across diverse test cases. Additionally, we introduce a new dataset derived from the benchmark dataset to evaluate the triple-matching step more comprehensively.
AIOct 5, 2020
TabEAno: Table to Knowledge Graph Entity AnnotationPhuc Nguyen, Natthawut Kertkeidkachorn, Ryutaro Ichise et al.
In the Open Data era, a large number of table resources have been made available on the Web and data portals. However, it is difficult to directly utilize such data due to the ambiguity of entities, name variations, heterogeneous schema, missing, or incomplete metadata. To address these issues, we propose a novel approach, namely TabEAno, to semantically annotate table rows toward knowledge graph entities. Specifically, we introduce a "two-cells" lookup strategy bases on the assumption that there is an existing logical relation occurring in the knowledge graph between the two closed cells in the same row of the table. Despite the simplicity of the approach, TabEAno outperforms the state of the art approaches in the two standard datasets e.g, T2D, Limaye with, and in the large-scale Wikipedia tables dataset.
AIOct 1, 2019
MTab: Matching Tabular Data to Knowledge Graph using Probability ModelsPhuc Nguyen, Natthawut Kertkeidkachorn, Ryutaro Ichise et al.
This paper presents the design of our system, namely MTab, for Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab 2019). MTab combines the voting algorithm and the probability models to solve critical problems of the matching tasks. Results on SemTab 2019 show that MTab obtains promising performance for the three matching tasks.
DBJun 26, 2018
EmbNum: Semantic labeling for numerical values with deep metric learningPhuc Nguyen, Khai Nguyen, Ryutaro Ichise et al.
Semantic labeling for numerical values is a task of assigning semantic labels to unknown numerical attributes. The semantic labels could be numerical properties in ontologies, instances in knowledge bases, or labeled data that are manually annotated by domain experts. In this paper, we refer to semantic labeling as a retrieval setting where the label of an unknown attribute is assigned by the label of the most relevant attribute in labeled data. One of the greatest challenges is that an unknown attribute rarely has the same set of values with the similar one in the labeled data. To overcome the issue, statistical interpretation of value distribution is taken into account. However, the existing studies assume a specific form of distribution. It is not appropriate in particular to apply open data where there is no knowledge of data in advance. To address these problems, we propose a neural numerical embedding model (EmbNum) to learn useful representation vectors for numerical attributes without prior assumptions on the distribution of data. Then, the "semantic similarities" between the attributes are measured on these representation vectors by the Euclidean distance. Our empirical experiments on City Data and Open Data show that EmbNum significantly outperforms state-of-the-art methods for the task of numerical attribute semantic labeling regarding effectiveness and efficiency.
CLMay 6, 2017
Learning Distributed Representations of Texts and Entities from Knowledge BaseIkuya Yamada, Hiroyuki Shindo, Hideaki Takeda et al.
We describe a neural network model that jointly learns distributed representations of texts and knowledge base (KB) entities. Given a text in the KB, we train our proposed model to predict entities that are relevant to the text. Our model is designed to be generic with the ability to address various NLP tasks with ease. We train the model using a large corpus of texts and their entity annotations extracted from Wikipedia. We evaluated the model on three important NLP tasks (i.e., sentence textual similarity, entity linking, and factoid question answering) involving both unsupervised and supervised settings. As a result, we achieved state-of-the-art results on all three of these tasks. Our code and trained models are publicly available for further academic research.
CLJan 6, 2016
Joint Learning of the Embedding of Words and Entities for Named Entity DisambiguationIkuya Yamada, Hiroyuki Shindo, Hideaki Takeda et al.
Named Entity Disambiguation (NED) refers to the task of resolving multiple named entity mentions in a document to their correct references in a knowledge base (KB) (e.g., Wikipedia). In this paper, we propose a novel embedding method specifically designed for NED. The proposed method jointly maps words and entities into the same continuous vector space. We extend the skip-gram model by using two models. The KB graph model learns the relatedness of entities using the link structure of the KB, whereas the anchor context model aims to align vectors such that similar words and entities occur close to one another in the vector space by leveraging KB anchors and their context words. By combining contexts based on the proposed embedding with standard NED features, we achieved state-of-the-art accuracy of 93.1% on the standard CoNLL dataset and 85.2% on the TAC 2010 dataset.