Benjamin Adams

h-index21

9papers

1,225citations

Novelty25%

AI Score31

Ranked #132,979 of 194,257 authors (top 68%)#24,033 in CL (top 78%)

9 Papers

21.4CLAug 20, 2023

cantnlp@LT-EDI-2023: Homophobia/Transphobia Detection in Social Media Comments using Spatio-Temporally Retrained Language Models

Sidney G. -J. Wong, Matthew Durward, Benjamin Adams et al.

This paper describes our multiclass classification system developed as part of the LTEDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based crosslanguage pretrained language model, XLMRoBERTa, with spatially and temporally relevant social media language data. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. We developed the best performing seven-label classification system for Malayalam based on weighted macro averaged F1 score (ranked first out of six) with variable performance for other language and class-label conditions. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. The results suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.

0.9CLAug 21, 2023

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

Sidney G. -J. Wong, Jonathan Dunn, Benjamin Adams

This paper describes a preliminary study on the comparative linguistic ecology of online spaces (i.e., social media language data) and real-world spaces in Aotearoa New Zealand (i.e., subnational administrative areas). We compare measures of linguistic diversity between these different spaces and discuss how social media users align with real-world populations. The results from the current study suggests that there is potential to use online social media language data to observe spatial and temporal changes in linguistic diversity at subnational geographic areas; however, further work is required to understand how well social media represents real-world behaviour.

24.1CLMar 16, 2024

Pre-Trained Language Models Represent Some Geographic Populations Better Than Others

Jonathan Dunn, Benjamin Adams, Harish Tayyar Madabushi

This paper measures the skew in how well two families of LLMs represent diverse geographic populations. A spatial probing task is used with geo-referenced corpora to measure the degree to which pre-trained language models from the OPT and BLOOM series represent diverse populations around the world. Results show that these models perform much better for some populations than others. In particular, populations across the US and the UK are represented quite well while those in South and Southeast Asia are poorly represented. Analysis shows that both families of models largely share the same skew across populations. At the same time, this skew cannot be fully explained by sociolinguistic factors, economic factors, or geographic factors. The basic conclusion from this analysis is that pre-trained models do not equally represent the world's population: there is a strong skew towards specific geographic populations. This finding challenges the idea that a single model can be used for all populations.

9.4LGJul 5, 2025

KEA Explain: Explanations of Hallucinations using Graph Kernel Analysis

Reilly Haskins, Benjamin Adams

Large Language Models (LLMs) frequently generate hallucinations: statements that are syntactically plausible but lack factual grounding. This research presents KEA (Kernel-Enriched AI) Explain: a neurosymbolic framework that detects and explains such hallucinations by comparing knowledge graphs constructed from LLM outputs with ground truth data from Wikidata or contextual documents. Using graph kernels and semantic clustering, the method provides explanations for detected hallucinations, ensuring both robustness and interpretability. Our framework achieves competitive accuracy in detecting hallucinations across both open- and closed-domain tasks, and is able to generate contrastive explanations, enhancing transparency. This research advances the reliability of LLMs in high-stakes domains and provides a foundation for future work on precision improvements and multi-source knowledge integration.

1.2DLSep 16, 2021

Quoka Atlas of Scholarly Knowledge Production: An Interactive Sensemaking Tool for Exploring the Outputs of Research Institutions

Benjamin Adams, Richard Hosking

The vast amount of research produced at institutions world-wide is extremely diverse, and coarse-grained quantitative measures of impact often obscure the individual contributions of these institutions to specific research fields and topics. We show that by applying an information retrieval model to index research articles which are faceted by institution and time, we can develop tools to rank institutions given a keyword query. We present an interactive atlas, Quoka, designed to enable a user to explore these rankings contextually by geography and over time. Through a set of use cases we demonstrate that the atlas can be used to perform sensemaking tasks to learn and collect information about the relationships between institutions and scholarly knowledge production.

35.0CLApr 3, 2021

Measuring Linguistic Diversity During COVID-19

Jonathan Dunn, Tom Coupe, Benjamin Adams

Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-referenced social media and web data. The goal, however, has been to describe these corpora themselves rather than to make inferences about underlying populations. This paper shows that a difference-in-differences method based on the Herfindahl-Hirschman Index can identify the bias in digital corpora that is introduced by non-local populations. These methods tell us where significant changes have taken place and whether this leads to increased or decreased diversity. This is an important step in aligning digital corpora like social media with the real-world populations that have produced them.

1.2SIOct 18, 2020

We Need to Rethink How We Describe and Organize Spatial Information: Instrumenting and Observing the Community of Users to Improve Data Description and Discovery

Benjamin Adams, Mark Gahegan

In Spatial Data Infrastructure or Cyber Infrastructure, the description of geographic data semantics is intended to support data discovery, reuse and integration. In the vast majority of cases the producers of these data generate descriptions based on particular understandings of what uses the data are good for. This producer-oriented perspective means that the descriptions often do not help to answer the question of whether a data set is of use for a consumer who might want to apply it in a different context. In this paper, we discuss the role geographic information observatories can play in providing an infrastructure for observing the context of data use by consumers. These observations of data pragmatics lead to operational statistical methods that will support better fitness-for-use assessment. Finally, we highlight some of the challenges to building these observatories, and briefly discuss strategies to address those challenges.

1.3CLApr 2, 2020

Mapping Languages and Demographics with Georeferenced Corpora

Jonathan Dunn, Ben Adams

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r=0.60 (social media) and r=0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.

2.4AIJun 27, 2012

The observational roots of reference of the semantic web

Simon Scheider, Krzysztof Janowicz, Benjamin Adams

Shared reference is an essential aspect of meaning. It is also indispensable for the semantic web, since it enables to weave the global graph, i.e., it allows different users to contribute to an identical referent. For example, an essential kind of referent is a geographic place, to which users may contribute observations. We argue for a human-centric, operational approach towards reference, based on respective human competences. These competences encompass perceptual, cognitive as well as technical ones, and together they allow humans to inter-subjectively refer to a phenomenon in their environment. The technology stack of the semantic web should be extended by such operations. This would allow establishing new kinds of observation-based reference systems that help constrain and integrate the semantic web bottom-up.