CLMay 27
Challenges in Explaining Pretrained Clinical Text ClassifiersKristian Miok, Matej Klemen, Blaz Škrlj et al.
Explaining the predictions of neural models in clinical NLP remains a significant challenge, especially for complex tasks involving long, unstructured medical texts. While post-hoc methods like LIME and SHAP are widely used, they often fall short when applied to clinical narratives. In this paper, we identify core limitations of token-level and perturbation-based explanation techniques through targeted demonstra- tions on a hospital length-of-stay prediction task. Our findings reveal issues such as overemphasis on non-informative tokens, instability in at- tributions, and high-confidence predictions for incoherent input variants. These results underscore the need for explanation strategies that are clin- ically meaningful, semantically grounded, and robust to linguistic noise.
CLFeb 2Code
Evaluating Metalinguistic Knowledge in Large Language Models across the World's LanguagesTjaša Arčon, Matej Klemen, Marko Robnik-Šikonja et al.
Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.
CLAug 22, 2022
Review of Natural Language Processing in PharmacologyDimitar Trajanov, Vangel Trajkovski, Makedonka Dimitrieva et al.
Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers.
CLMay 21, 2024
Code-mixed Sentiment and Hate-speech PredictionAnjali Yadav, Tanya Garg, Matej Klemen et al.
Code-mixed discourse combines multiple languages in a single text. It is commonly used in informal discourse in countries with several official languages, but also in many other countries in combination with English or neighboring languages. As recently large language models have dominated most natural language processing tasks, we investigated their performance in code-mixed settings for relevant tasks. We first created four new bilingual pre-trained masked language models for English-Hindi and English-Slovene languages, specifically aimed to support informal language. Then we performed an evaluation of monolingual, bilingual, few-lingual, and massively multilingual models on several languages, using two tasks that frequently contain code-mixed text, in particular, sentiment analysis and offensive language detection in social media texts. The results show that the most successful classifiers are fine-tuned bilingual models and multilingual models, specialized for social media texts, followed by non-specialized massively multilingual and monolingual models, while huge generative models are not competitive. For our affective problems, the models mostly perform slightly better on code-mixed data compared to non-code-mixed data.
CLNov 28, 2025
Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical AnalysisMatej Klemen, Tjaša Arčon, Luka Terčon et al.
Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.
CLOct 30, 2024
Neural spell-checker: Beyond words with synthetic data generationMatej Klemen, Martin Božič, Špela Arhar Holdt et al.
Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts. Recent improvements in deep learning, and in particular in large language models, have opened new opportunities to improve traditional spell-checkers with new functionalities that not only assess spelling correctness but also the suitability of a word for a given context. In our work, we present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets. The first spell-checker is a traditional, fast, word-based approach, based on a morphological lexicon with a significantly larger word list compared to existing spell-checkers. The second approach uses a language model trained on a large corpus with synthetically inserted errors. We present the training data construction strategies, which turn out to be a crucial component of neural spell-checkers. Further, the proposed neural model significantly outperforms all existing spell-checkers for Slovene in both precision and recall.
CLNov 13, 2021
Extracting and filtering paraphrases by bridging natural language inference and paraphrasingMatej Klemen, Marko Robnik-Šikonja
Paraphrasing is a useful natural language processing task that can contribute to more diverse generated or translated texts. Natural language inference (NLI) and paraphrasing share some similarities and can benefit from a joint approach. We propose a novel methodology for the extraction of paraphrasing datasets from NLI datasets and cleaning existing paraphrasing datasets. Our approach is based on bidirectional entailment; namely, if two sentences can be mutually entailed, they are paraphrases. We evaluate our approach using several large pretrained transformer language models in the monolingual and cross-lingual setting. The results show high quality of extracted paraphrasing datasets and surprisingly high noise levels in two existing paraphrasing datasets.
CLNov 24, 2020
Enhancing deep neural networks with morphological informationMatej Klemen, Luka Krsnik, Marko Robnik-Šikonja
Deep learning approaches are superior in NLP due to their ability to extract informative features and patterns from languages. The two most successful neural architectures are LSTM and transformers, used in large pretrained language models such as BERT. While cross-lingual approaches are on the rise, most current NLP techniques are designed and applied to English, and less-resourced languages are lagging behind. In morphologically rich languages, information is conveyed through morphology, e.g., through affixes modifying stems of words. Existing neural approaches do not explicitly use the information on word morphology. We analyse the effect of adding morphological features to LSTM and BERT models. As a testbed, we use three tasks available in many less-resourced languages: named entity recognition (NER), dependency parsing (DP), and comment filtering (CF). We construct baselines involving LSTM and BERT models, which we adjust by adding additional input in the form of part of speech (POS) tags and universal features. We compare models across several languages from different language families. Our results suggest that adding morphological features has mixed effects depending on the quality of features and the task. The features improve the performance of LSTM-based models on the NER and DP tasks, while they do not benefit the performance on the CF task. For BERT-based models, the morphological features only improve the performance on DP when they are of high quality while not showing practical improvement when they are predicted. Even for high-quality features, the improvements are less pronounced in language-specific BERT variants compared to massively multilingual BERT models. As in NER and CF datasets manually checked features are not available, we only experiment with predicted features and find that they do not cause any practical improvement in performance.
SIJun 22, 2019
Predicting kills in Game of Thrones using network propertiesJaka Stavanja, Matej Klemen, Lovro Šubelj
TV series such as HBO's Game of Thrones have seen a high number of dedicated followers, mostly due to the dramatic murders of the most important characters. In our work, we try to predict killer and victim pairs using data about previous kills and additional metadata. We construct a network where two character nodes are linked if one killed the other and use a link prediction framework to evaluate different techniques for kill predictions. Lastly, we compute various network properties on a social network of characters and use them as features in conjunction with classic data mining techniques. Due to the small size of the dataset and the somewhat random kill distribution, we cannot predict much with standard indices alone, although using them in conjunction with additional rules based on degrees works surprisingly well. The features we compute on the social network help the classic machine learning approaches, but do not yield very accurate predictions. The best results overall are achieved using indices that use simple degree information, the best of which gives us the Area Under the ROC Curve of 0.875.