AIOct 6, 2022
A Review of Multilingualism in and for OntologiesFrances Gillis-Webber, C. Maria Keet
The Multilingual Semantic Web has been in focus for over a decade. Multilingualism in Linked Data and RDF has shown substantial adoption, but this is unclear for ontologies since the last review 15 years ago. One of the design goals for OWL was internationalisation, with the aim that an ontology is usable across languages and cultures. Much research to improve on multilingual ontologies has taken place in the meantime, and presumably multilingual linked data could use multilingual ontologies. Therefore, this review seeks to (i) elucidate and compare the modelling options for multilingual ontologies, (ii) examine extant ontologies for their multilingualism, and (iii) evaluate ontology editors for their ability to manage a multilingual ontology. Nine different principal approaches for modelling multilinguality in ontologies were identified, which fall into either of the following approaches: using multilingual labels, linguistic models, or a mapping-based approach. They are compared on design by means of an ad hoc visualisation mode of modelling multilingual information for ontologies, shortcomings, and what issues they aim to solve. For the ontologies, we extracted production-level and accessible ontologies from BioPortal and the LOV repositories, which had, at best, 6.77% and 15.74% multilingual ontologies, respectively, where most of them have only partial translations and they all use a labels-based approach only. Based on a set of nine tool requirements for managing multilingual ontologies, the assessment of seven relevant ontology editors showed that there are significant gaps in tooling support, with VocBench 3 nearest of meeting them all. This stock-taking may function as a new baseline and motivate new research directions for multilingual ontologies.
CLAug 1, 2023
CoSMo: A constructor specification language for Abstract Wikipedia's content selection processKutz Arrieta, Pablo R. Fillottrani, C. Maria Keet
Representing snippets of information abstractly is a task that needs to be performed for various purposes, such as database view specification and the first stage in the natural language generation pipeline for generative AI from structured input, i.e., the content selection stage to determine what needs to be verbalised. For the Abstract Wikipedia project, requirements analysis revealed that such an abstract representation requires multilingual modelling, content selection covering declarative content and functions, and both classes and instances. There is no modelling language that meets either of the three features, let alone a combination. Following a rigorous language design process inclusive of broad stakeholder consultation, we created CoSMo, a novel {\sc Co}ntent {\sc S}election {\sc Mo}deling language that meets these and other requirements so that it may be useful both in Abstract Wikipedia as well as other contexts. We describe the design process, rationale and choices, the specification, and preliminary evaluation of the language.
CLOct 21, 2022
Bootstrapping NLP tools across low-resourced African languages: an overview and prospectsC. Maria Keet
Computing and Internet access are substantially growing markets in Southern Africa, which brings with it increasing demands for local content and tools in indigenous African languages. Since most of those languages are low-resourced, efforts have gone into the notion of bootstrapping tools for one African language from another. This paper provides an overview of these efforts for Niger-Congo B (`Bantu') languages. Bootstrapping grammars for geographically distant languages has been shown to still have positive outcomes for morphology and rules or grammar-based natural language generation. Bootstrapping with data-driven approaches to NLP tasks is difficult to use meaningfully regardless geographic proximity, which is largely due to lexical diversity due to both orthography and vocabulary. Cladistic approaches in comparative linguistics may inform bootstrapping strategies and similarity measures might serve as proxy for bootstrapping potential as well, with both fertile ground for further research.
CLSep 29, 2023
Contextualising Levels of Language Resourcedness that affect NLP tasksC. Maria Keet, Langa Khumalo
Several widely used software applications involve some form of processing of natural language, with tasks ranging from digitising hardcopies and text processing to speech generation. Varied language resources are used to develop software systems to accomplish a wide range of natural language processing (NLP) tasks, such as the ubiquitous spellcheckers and chatbots. Languages are typically characterised as either low (LRL) or high resourced languages (HRL) with African languages having been characterised as resource-scarce languages and English by far the most well-resourced language. But what lies in-between? We argue that the dichotomous typology of LRL and HRL for all languages is problematic. Through a clear understanding of language resources situated in a society, a matrix is developed that characterises languages as Very LRL, LRL, RL, HRL and Very HRL. The characterisation is based on the typology of contextual features for each category, rather than counting tools. The motivation is provided for each feature and each characterisation. The contextualisation of resourcedness, with a focus on African languages in this paper, and an increased understanding of where on the scale the language used in a project is, may assist in, among others, better planning of research and implementation projects. We thus argue in this paper that the characterisation of language resources within a given scale in a project is an indispensable component, particularly for those in the lower half of the scale.
CYJul 21, 2020Code
Why a computer program is a functional wholeC. Maria Keet
Sharing, downloading, and reusing software is common-place, some of which is carried out legally with open source software. When it is not legal, it is unclear just how many copyright infringements and trade secret violations have taken place: does an infringement count for the artefact as a whole or perhaps for each file of the program? To answer this question, it must first be established whether a program should be considered as an integral whole, a collection, or a mere set of distinct files, and why. We argue that a program is a functional whole, availing of, and combining, arguments from mereology, granularity, modularity, unity, and function to substantiate the claim. The argumentation and answer contributes to the ontology of software artefacts, may assist industry in litigation cases, and demonstrates that the notion of unifying relation is operationalisable. Indirectly, it provides support for continued modular design of artefacts following established engineering practices.
AIDec 18, 2024
Discerning and Characterising Types of Competency Questions for OntologiesC. Maria Keet, Zubeida Casmod Khan
Competency Questions (CQs) are widely used in ontology development by guiding, among others, the scoping and validation stages. However, very limited guidance exists for formulating CQs and assessing whether they are good CQs, leading to issues such as ambiguity and unusable formulations. To solve this, one requires insight into the nature of CQs for ontologies and their constituent parts, as well as which ones are not. We aim to contribute to such theoretical foundations in this paper, which is informed by analysing questions, their uses, and the myriad of ontology development tasks. This resulted in a first Model for Competency Questions, which comprises five main types of CQs, each with a different purpose: Scoping (SCQ), Validating (VCQ), Foundational (FCQ), Relationship (RCQ), and Metaproperty (MpCQ) questions. This model enhances the clarity of CQs and therewith aims to improve on the effectiveness of CQs in ontology development, thanks to their respective identifiable distinct constituent elements. We illustrate and evaluate them with a user story and demonstrate where which type can be used in ontology development tasks. To foster use and research, we created an annotated repository of 438 CQs, the Repository of Ontology Competency QuestionS (ROCQS), incorporating an existing CQ dataset and new CQs and CQ templates, which further demonstrate distinctions among types of CQs.
CLJun 6, 2021
An evaluation of template and ML-based generation of user-readable text from a knowledge graphZola Mahlaza, C. Maria Keet, Jarryd Dunn et al.
Typical user-friendly renderings of knowledge graphs are visualisations and natural language text. Within the latter HCI solution approach, data-driven natural language generation systems receive increased attention, but they are often outperformed by template-based systems due to suffering from errors such as content dropping, hallucination, or repetition. It is unknown which of those errors are associated significantly with low quality judgements by humans who the text is aimed for, which hampers addressing errors based on their impact on improving human evaluations. We assessed their possible association with an experiment availing of expert and crowdsourced evaluations of human authored text, template generated text, and sequence-to-sequence model generated text. The results showed that there was no significant association between human authored texts with errors and the low human judgements of naturalness and quality. There was also no significant association between machine learning generated texts with dropped or hallucinated slots and the low human judgements of naturalness and quality. Thus, both approaches appear to be viable options for designing a natural language interface for knowledge graphs.
AIJan 20, 2021
Bias in ontologies -- a preliminary assessmentC. Maria Keet
Logical theories in the form of ontologies and similar artefacts in computing and IT are used for structuring, annotating, and querying data, among others, and therewith influence data analytics regarding what is fed into the algorithms. Algorithmic bias is a well-known notion, but what does bias mean in the context of ontologies that provide a structuring mechanism for an algorithm's input? What are the sources of bias there and how would they manifest themselves in ontologies? We examine and enumerate types of bias relevant for ontologies, and whether they are explicit or implicit. These eight types are illustrated with examples from extant production-level ontologies and samples from the literature. We then assessed three concurrently developed COVID-19 ontologies on bias and detected different subsets of types of bias in each one, to a greater or lesser extent. This first characterisation aims contribute to a sensitisation of ethical aspects of ontologies primarily regarding representation of information and knowledge.
CYMar 2, 2020
Toward equipping Artificial Moral Agents with multiple ethical theoriesGeorge Rautenbach, C. Maria Keet
Artificial Moral Agents (AMA's) is a field in computer science with the purpose of creating autonomous machines that can make moral decisions akin to how humans do. Researchers have proposed theoretical means of creating such machines, while philosophers have made arguments as to how these machines ought to behave, or whether they should even exist. Of the currently theorised AMA's, all research and design has been done with either none or at most one specified normative ethical theory as basis. This is problematic because it narrows down the AMA's functional ability and versatility which in turn causes moral outcomes that a limited number of people agree with (thereby undermining an AMA's ability to be moral in a human sense). As solution we design a three-layer model for general normative ethical theories that can be used to serialise the ethical views of people and businesses for an AMA to use during reasoning. Four specific ethical norms (Kantianism, divine command theory, utilitarianism, and egoism) were modelled and evaluated as proof of concept for normative modelling. Furthermore, all models were serialised to XML/XSD as proof of support for computerisation.
AIJul 17, 2019
CLaRO: a Data-driven CNL for Specifying Competency QuestionsC. Maria Keet, Zola Mahlaza, Mary-Jane Antia
Competency Questions (CQs) for an ontology and similar artefacts aim to provide insights into the contents of an ontology and to demarcate its scope. The absence of a controlled natural language, tooling and automation to support the authoring of CQs has hampered their effective use in ontology development and evaluation. The few question templates that exists are based on informal analyses of a small number of CQs and have limited coverage of question types and sentence constructions. We aim to fill this gap by proposing a template-based CNL to author CQs, called CLaRO. For its design, we exploited a new dataset of 234 CQs that had been processed automatically into 106 patterns, which we analysed and used to design a template-based CNL, with an additional CNL model and XML serialisation. The CNL was evaluated with a subset of questions from the original dataset and with two sets of newly sourced CQs. The coverage of CLaRO, with its 93 main templates and 41 linguistic variants, is about 90% for unseen questions. CLaRO has the potential to facilitate streamlining formalising ontology content requirements and, given that about one third of the competency questions in the test sets turned out to be invalid questions, assist in writing good questions.
AIDec 14, 2018
More Effective Ontology Authoring with Test-Driven DevelopmentC. Maria Keet, Kieren Davies, Agnieszka Lawrynowicz
Ontology authoring is a complex process, where commonly the automated reasoner is invoked for verification of newly introduced changes, therewith amounting to a time-consuming test-last approach. Test-Driven Development (TDD) for ontology authoring is a recent {\em test-first} approach that aims to reduce authoring time and increase authoring efficiency. Current TDD testing falls short on coverage of OWL features and possible test outcomes, the rigorous foundation thereof, and evaluations to ascertain its effectiveness. We aim to address these issues in one instantiation of TDD for ontology authoring. We first propose a succinct, logic-based model of TDD testing and present novel TDD algorithms so as to cover also any OWL 2 class expression for the TBox and for the principal ABox assertions, and prove their correctness. The algorithms use methods from the OWL API directly such that reclassification is not necessary for test execution, therewith reducing ontology authoring time. The algorithms were implemented in TDDonto2, a Protégé plugin. TDDonto2 was evaluated on editing efficiency and by users. The editing efficiency study demonstrated that it is faster than a typical ontology authoring interface, especially for medium size and large ontologies. The user evaluation demonstrated that modellers make significantly less errors with TDDonto2 compared to the standard Protégé interface and complete their tasks better using less time. Thus, the results indicate that Test-Driven Development is a promising approach in an ontology development methodology.
AINov 23, 2018
Competency Questions and SPARQL-OWL Queries Dataset and AnalysisDawid Wisniewski, Jedrzej Potoniec, Agnieszka Lawrynowicz et al.
Competency Questions (CQs) are natural language questions outlining and constraining the scope of knowledge represented by an ontology. Despite that CQs are a part of several ontology engineering methodologies, we have observed that the actual publication of CQs for the available ontologies is very limited and even scarcer is the publication of their respective formalisations in terms of, e.g., SPARQL queries. This paper aims to contribute to addressing the engineering shortcomings of using CQs in ontology development, to facilitate wider use of CQs. In order to understand the relation between CQs and the queries over the ontology to test the CQs on an ontology, we gather, analyse, and publicly release a set of 234 CQs and their translations to SPARQL-OWL for several ontologies in different domains developed by different groups. We analysed the CQs in two principal ways. The first stage focused on a linguistic analysis of the natural language text itself, i.e., a lexico-syntactic analysis without any presuppositions of ontology elements, and a subsequent step of semantic analysis in order to find patterns. This increased diversity of CQ sources resulted in a 5-fold increase of hitherto published patterns, to 106 distinct CQ patterns, which have a limited subset of few patterns shared across the CQ sets from the different ontologies. Next, we analysed the relation between the found CQ patterns and the 46 SPARQL-OWL query signatures, which revealed that one CQ pattern may be realised by more than one SPARQL-OWL query signature, and vice versa. We hope that our work will contribute to establishing common practices, templates, automation, and user tools that will support CQ formulation, formalisation, execution, and general management.
AISep 9, 2018
Evidence-based lean logic profiles for conceptual data modelling languagesPablo Rubén Fillottrani, C. Maria Keet
Multiple logic-based reconstructions of conceptual data modelling languages such as EER, UML Class Diagrams, and ORM exist. They mainly cover various fragments of the languages and none are formalised such that the logic applies simultaneously for all three modelling language families as unifying mechanism. This hampers interchangeability, interoperability, and tooling support. In addition, due to the lack of a systematic design process of the logic used for the formalisation, hidden choices permeate the formalisations that have rendered them incompatible. We aim to address these problems, first, by structuring the logic design process in a methodological way. We generalise and extend the DSL design process to apply to logic language design more generally and, in particular, by incorporating an ontological analysis of language features in the process. Second, we specify minimal logic profiles availing of this extended process, including the ontological commitments embedded in the languages, of evidence gathered of language feature usage, and of computational complexity insights from Description Logics (DL). The profiles characterise the essential logic structure needed to handle the semantics of conceptual models, therewith enabling the development of interoperability tools. There is no known DL language that matches exactly the features of those profiles and the common core is small (in the tractable DL $\mathcal{ALNI}$). Although hardly any inconsistencies can be derived with the profiles, it is promising for scalable runtime use of conceptual data models.
CLDec 20, 2016
Grammar rules for the isiZulu complex verbC. Maria Keet, Langa Khumalo
The isiZulu verb is known for its morphological complexity, which is a subject for on-going linguistics research, as well as for prospects of computational use, such as controlled natural language interfaces, machine translation, and spellcheckers. To this end, we seek to answer the question as to what the precise grammar rules for the isiZulu complex verb are (and, by extension, the Bantu verb morphology). To this end, we iteratively specify the grammar as a Context Free Grammar, and evaluate it computationally. The grammar presented in this paper covers the subject and object concords, negation, present tense, aspect, mood, and the causative, applicative, stative, and the reciprocal verbal extensions, politeness, the wh-question modifiers, and aspect doubling, ensuring their correct order as they appear in verbs. The grammar conforms to specification.
CLAug 10, 2016
An assessment of orthographic similarity measures for several African languagesC. Maria Keet
Natural Language Interfaces and tools such as spellcheckers and Web search in one's own language are known to be useful in ICT-mediated communication. Most languages in Southern Africa are under-resourced, however. Therefore, it would be very useful if both the generic and the few language-specific NLP tools could be reused or easily adapted across languages. This depends on the notion, and extent, of similarity between the languages. We assess this from the angle of orthography and corpora. Twelve versions of the Universal Declaration of Human Rights (UDHR) are examined, showing clusters of languages, and which are thus more or less amenable to cross-language adaptation of NLP tools, which do not match with Guthrie zones. To examine the generalisability of these results, we zoom in on isiZulu both quantitatively and qualitatively with four other corpora and texts in different genres. The results show that the UDHR is a typical text document orthographically. The results also provide insight into usability of typical measures such as lexical diversity and genre, and that the same statistic may mean different things in different documents. While NLTK for Python could be used for basic analyses of text, it, and similar NLP tools, will need considerable customization.
AIDec 19, 2015
Test-Driven Development of ontologies (extended version)C. Maria Keet, Agnieszka Lawrynowicz
Emerging ontology authoring methods to add knowledge to an ontology focus on ameliorating the validation bottleneck. The verification of the newly added axiom is still one of trying and seeing what the reasoner says, because a systematic testbed for ontology authoring is missing. We sought to address this by introducing the approach of test-driven development for ontology authoring. We specify 36 generic tests, as TBox queries and TBox axioms tested through individuals, and structure their inner workings in an `open box'-way, which cover the OWL 2 DL language features. This is implemented as a Protege plugin so that one can perform a TDD test as a black box test. We evaluated the two test approaches on their performance. The TBox queries were faster, and that effect is more pronounced the larger the ontology is. We provide a general sequence of a TDD process for ontology engineering as a foundation for a TDD methodology.
AIDec 19, 2014
KF metamodel formalizationPablo R. Fillottrani, C. Maria Keet
The KF metamodel is a comprehensive unifying metamodel covering the static structural entities and constraints of UML Class Diagrams (v2.4.1), ER, EER, ORM, and ORM2, and intended to boost interoperability of common conceptual data modelling languages. It was originally designed in UML with textual constraints, and in this report we present its formalisations in FOL and OWL, which accompanies the paper that describes, discusses, and analyses the KF metamodel in detail. These new formalizations contribute to give a precise meaning to the metamodel, to understand its complexity properties and to provide a basis for future implementations.
CLJun 7, 2014
Toward verbalizing ontologies in isiZuluC. Maria Keet, Langa Khumalo
IsiZulu is one of the eleven official languages of South Africa and roughly half the population can speak it. It is the first (home) language for over 10 million people in South Africa. Only a few computational resources exist for isiZulu and its related Nguni languages, yet the imperative for tool development exists. We focus on natural language generation, and the grammar options and preferences in particular, which will inform verbalization of knowledge representation languages and could contribute to machine translation. The verbalization pattern specification shows that the grammar rules are elaborate and there are several options of which one may have preference. We devised verbalization patterns for subsumption, basic disjointness, existential and universal quantification, and conjunction. This was evaluated in a survey among linguists and non-linguists. Some differences between linguists and non-linguists can be observed, with the former much more in agreement, and preferences depend on the overall structure of the sentence, such as singular for subsumption and plural in other cases.