DBApr 26
Time travel for knowledge graphs: live queries over RDF change historiesArcangelo Massari, Silvio Peroni
Performing time-traversal queries on RDF datasets remains unsupported in the most extensive knowledge graphs. Existing solutions either require offline ingestion, which prevents concurrent querying and updating, or operate live but with limited query coverage or triplestore dependency. This article presents the Time Agnostic Library, a Python library for performing temporal SPARQL queries live on any SPARQL-compliant triplestore, supporting all six temporal retrieval needs identified in the literature and concurrent updates. The methodology builds on the OpenCitations Data Model (OCDM), which records provenance using the Provenance Ontology (PROV-O) and SPARQL UPDATE operations. The library supports version materialization, single-version and cross-version structured queries, delta materialization, and single-delta and cross-delta structured queries over multi-triple patterns. Evaluation on the BEAR-B benchmark shows sub-linear scaling in both execution time and memory consumption as the number of versions increases. While preprocessing-based systems such as OSTRICH achieve faster query times, they require offline ingestion and cannot handle concurrent data updates. Against R43ples, the closest live system in architecture, the Time Agnostic Library is faster across all query types.
DLApr 23
OpenCitations MetaArcangelo Massari, Fabio Mariani, Ivan Heibi et al.
OpenCitations Meta is a new database for open bibliographic metadata of scholarly publications involved in the citations indexed by the OpenCitations infrastructure, adhering to Open Science principles and published under a CC0 license to promote maximum reuse. It presently incorporates bibliographic metadata for publications recorded in Crossref, DataCite and PubMed, making it the largest bibliographic metadata source using Semantic Web technologies. It assigns new globally persistent identifiers (PIDs), known as OpenCitations Meta Identifiers (OMIDs) to all bibliographic resources, enabling it both to disambiguate publications described using different external PIDS (e.g., a DOI in Crossref and a PMID in PubMed), and to handle citations involving publications lacking external PIDs. By hosting bibliographic metadata internally, OpenCitations Meta eliminates its former reliance on API calls to external resources and thus enhances performance in response to user queries. Its automated data curation, following the OpenCitations Data Model, includes deduplication, error correction, metadata enrichment and full provenance tracking, ensuring transparency and traceability of data and bolstering confidence in data integrity, a feature unparalleled in other bibliographic databases. Its commitment to Semantic Web standards ensures superior interoperability compared to other machine-readable formats, with availability via a SPARQL endpoint, REST APIs and data dumps.
DLMay 3Code
HERITRACE: a domain-agnostic framework for SHACL-driven RDF curation with provenance and change trackingArcangelo Massari, Silvio Peroni
HERITRACE is an open-source web application that enables users without Semantic Web expertise to curate RDF data through form-based interfaces with automatic provenance documentation and change tracking in RDF. It uses SHACL for data model definition and form generation, connects to existing SPARQL-accessible stores without data migration, and records every modification as a provenance snapshot that can be browsed and restored. HERITRACE is domain-agnostic: adapting it to a new collection requires only SHACL shapes and YAML display rules, without code changes. This paper describes the software architecture and provides the first empirical evaluation. HERITRACE is deployed in production for the ParaText project, where classical philologists curate bibliographic data about ancient Greek exegetical traditions, and is planned as the editing interface for OpenCitations and as the curation layer for the Social Sciences and Humanities Citation Index within the GRAPHIA Horizon Europe project. Since it operates on any SPARQL-accessible store without data migration, its adoption potential extends to any domain maintaining RDF data. HERITRACE is publicly available on GitHub under the ISC license, archived on Zenodo and Software Heritage Archive, and documented for deployment with a pre-built Docker image.
DLApr 15
Assessing and Comparing the Coverage of Italian Publications in OpenCitations: a Study within Six Italian UniversitiesErica Andreose, Ivan Heibi, Silvio Peroni et al.
Recent initiatives advocating responsible, transparent research assessment have intensified the call to use open research information rather than proprietary databases. This study evaluates the coverage and citation representation of publications recorded in the Current Research Information Systems (CRIS), all instances of the IRIS software platform, of six Italian universities within OpenCitations, a community-owned open infrastructure. Using persistent identifiers (DOIs, PMIDs, and ISBNs) specified in the IRIS installations involved, we matched the publications recorded in OpenCitations Meta and extracted the related citation links from the OpenCitations Index. Results show that OpenCitations covers, on average, over 40% of IRIS publications, which is quantitatively comparable to those reported by Scopus and Web of Science in another study. However, gaps persist, particularly for publication types prevalent in the Social Sciences and Humanities, such as monographs and critical editions. Overall, the findings demonstrate the growing maturity of OpenCitations and, more broadly, of Open Science infrastructures as viable alternatives as sources of research information, while highlighting areas where further metadata enrichment and interoperability efforts are needed.
CLJul 18, 2024
CiteFusion: An Ensemble Framework for Citation Intent Classification Harnessing Dual-Model Binary Couples and SHAP AnalysesLorenzo Paolini, Sahar Vahdati, Angelo Di Iorio et al.
Understanding the motivations underlying scholarly citations is essential to evaluate research impact and promote transparent scholarly communication. This study introduces CiteFusion, an ensemble framework designed to address the multi-class Citation Intent Classification task on two benchmark datasets: SciCite and ACL-ARC. The framework employs a one-vs-all decomposition of the multi-class task into class-specific binary subtasks, leveraging complementary pairs of SciBERT and XLNet models, independently tuned, for each citation intent. The outputs of these base models are aggregated through a feedforward neural network meta-classifier to reconstruct the original classification task. To enhance interpretability, SHAP (SHapley Additive exPlanations) is employed to analyze token-level contributions, and interactions among base models, providing transparency into the classification dynamics of CiteFusion, and insights about the kind of misclassifications of the ensemble. In addition, this work investigates the semantic role of structural context by incorporating section titles, as framing devices, into input sentences, assessing their positive impact on classification accuracy. CiteFusion ultimately demonstrates robust performance in imbalanced and data-scarce scenarios: experimental results show that CiteFusion achieves state-of-the-art performance, with Macro-F1 scores of 89.60% on SciCite, and 76.24% on ACL-ARC. Furthermore, to ensure interoperability and reusability, citation intents from both datasets schemas are mapped to Citation Typing Ontology (CiTO) object properties, highlighting some overlaps. Finally, we describe and release a web-based application that classifies citation intents leveraging the CiteFusion models developed on SciCite.
DLApr 24
Mapping bibliographic metadata collections: the case of OpenCitations Meta and OpenAlexElia Rizzetto, Silvio Peroni
This study describes the methodology and analyses the results of the process of mapping entities between two large open bibliographic metadata collections, OpenCitations Meta and OpenAlex. The primary objective of this mapping is to integrate OpenAlex internal identifiers into the existing metadata of bibliographic resources in OpenCitations Meta, thereby interlinking and aligning these collections. Furthermore, analysing the output of the mapping provides a unique perspective on the consistency and accuracy of bibliographic metadata, offering a valuable tool for identifying potential inconsistencies in the processed data.
AIJan 24, 2022Code
A Knowledge Graph Embeddings based Approach for Author Name Disambiguation using LiteralsCristian Santini, Genet Asefa Gesese, Silvio Peroni et al.
Scholarly data is growing continuously containing information about the articles from a plethora of venues including conferences, journals, etc. Many initiatives have been taken to make scholarly data available as Knowledge Graphs (KGs). These efforts to standardize these data and make them accessible have also led to many challenges such as exploration of scholarly articles, ambiguous authors, etc. This study more specifically targets the problem of Author Name Disambiguation (AND) on Scholarly KGs and presents a novel framework, Literally Author Name Disambiguation (LAND), which utilizes Knowledge Graph Embeddings (KGEs) using multimodal literal information generated from these KGs. This framework is based on three components: 1) Multimodal KGEs, 2) A blocking procedure, and finally, 3) Hierarchical Agglomerative Clustering. Extensive experiments have been conducted on two newly created KGs: (i) KG containing information from Scientometrics Journal from 1978 onwards (OC-782K), and (ii) a KG extracted from a well-known benchmark for AND provided by AMiner (AMiner-534K). The results show that our proposed architecture outperforms our baselines of 8-14% in terms of the F1 score and shows competitive performances on a challenging benchmark such as AMiner. The code and the datasets are publicly available through Github: https://github.com/sntcristian/and-kge and Zenodo:https://doi.org/10.5281/zenodo.6309855 respectively.
DLApr 26
Are Digital Humanities really committed to open? An exploratory study on the availability of methodological workflows and open peer review practicesSilvio Peroni
Open Science has become a central framework for promoting transparency, accessibility, and inclusiveness in scholarly research. While the Digital Humanities (DH) community has long embraced openness in terms of research outputs, less attention seems to have been paid to the openness of the methodological and evaluative processes underlying knowledge production. This paper presents an exploratory study that investigates the current state of openness in DH research practices, focusing specifically on research data management documentation and peer review processes. In particular, this study addresses two research questions: (1) to what extent DH publications that describe data explicitly reference external documentation detailing data creation and management processes; and (2) how widely open peer review practices are adopted across DH conferences and journals. The results revealed a limited adoption of open methodological practices. Only a small fraction of the analysed articles provided explicit, reusable documentation of data creation workflows, and no references to data management plans or formal research data management documentation were found. An even more critical picture emerges from the analysis of peer review practices: the vast majority of DH venues continue to rely on traditional single- or double-blind review models, with open peer review adopted in only a few isolated cases.
DLDec 23, 2024
Recent Developments in Deep Learning-based Author Name DisambiguationFrancesca Cappelli, Giovanni Colavizza, Silvio Peroni
Author Name Disambiguation (AND) is a critical task for digital libraries aiming to link existing authors with their respective publications. Due to the lack of persistent identifiers used by researchers and the presence of intrinsic linguistic challenges, such as homonymy, the development of Deep Learning algorithms to address this issue has become widespread. Many AND deep learning methods have been developed, and surveys exist comparing the approaches in terms of techniques, complexity, performance. However, none explicitly addresses AND methods in the context of deep learning in the latest years (i.e. timeframe 2016-2024). In this paper, we provide a systematic review of state-of-the-art AND techniques based on deep learning, highlighting recent improvements, challenges, and open issues in the field. We find that DL methods have significantly impacted AND by enabling the integration of structured and unstructured data, and hybrid approaches effectively balance supervised and unsupervised learning.
DLNov 9, 2021
A quantitative and qualitative open citation analysis of retracted articles in the humanitiesIvan Heibi, Silvio Peroni
In this article, we show and discuss the results of a quantitative and qualitative analysis of open citations to retracted publications in the humanities domain. Our study was conducted by selecting retracted papers in the humanities domain and marking their main characteristics (e.g., retraction reason). Then, we gathered the citing entities and annotated their basic metadata (e.g., title, venue, subject, etc.) and the characteristics of their in-text citations (e.g., intent, sentiment, etc.). Using these data, we performed a quantitative and qualitative study of retractions in the humanities, presenting descriptive statistics and a topic modeling analysis of the citing entities' abstracts and the in-text citation contexts. As part of our main findings, we noticed that there was no drop in the overall number of citations after the year of retraction, with few entities which have either mentioned the retraction or expressed a negative sentiment toward the cited publication. In addition, on several occasions, we noticed a higher concern/awareness when it was about citing a retracted publication, by the citing entities belonging to the health sciences domain, if compared to the humanities and the social science domains. Philosophy, arts, and history are the humanities areas that showed the higher concern toward the retraction.
DLJun 23, 2021
BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and ProcessingZeyd Boukhers, Philipp Mayr, Silvio Peroni
Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing.
AINov 25, 2020
The Landscape of Ontology Reuse ApproachesValentina Anita Carriero, Marilena Daquino, Aldo Gangemi et al.
Ontology reuse aims to foster interoperability and facilitate knowledge reuse. Several approaches are typically evaluated by ontology engineers when bootstrapping a new project. However, current practices are often motivated by subjective, case-by-case decisions, which hamper the definition of a recommended behaviour. In this chapter we argue that to date there are no effective solutions for supporting developers' decision-making process when deciding on an ontology reuse strategy. The objective is twofold: (i) to survey current approaches to ontology reuse, presenting motivations, strategies, benefits and limits, and (ii) to analyse two representative approaches and discuss their merits.