LGJul 13, 2023Code
GRAN is superior to GraphRNN: node orderings, kernel- and graph embeddings-based metrics for graph generatorsOusmane Touat, Julian Stier, Pierre-Edouard Portier et al.
A wide variety of generative models for graphs have been proposed. They are used in drug discovery, road networks, neural architecture search, and program synthesis. Generating graphs has theoretical challenges, such as isomorphic representations -- evaluating how well a generative model performs is difficult. Which model to choose depending on the application domain? We extensively study kernel-based metrics on distributions of graph invariants and manifold-based and kernel-based metrics in graph embedding space. Manifold-based metrics outperform kernel-based metrics in embedding space. We use these metrics to compare GraphRNN and GRAN, two well-known generative models for graphs, and unveil the influence of node orderings. It shows the superiority of GRAN over GraphRNN - further, our proposed adaptation of GraphRNN with a depth-first search ordering is effective for small-sized graphs. A guideline on good practices regarding dataset selection and node feature initialization is provided. Our work is accompanied by open-source code and reproducible experiments.
CLAug 3, 2023
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text RatingsVeronika Hackl, Alexandra Elena Müller, Michael Granitzer et al.
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4, a state-of-the-art artificial intelligence language model, across multiple iterations, time spans and stylistic variations. The model rated responses to tasks within the Higher Education (HE) subject domain of macroeconomics in terms of their content and style. Statistical analysis was conducted in order to learn more about the interrater reliability, consistency of the ratings across iterations and the correlation between ratings in terms of content and style. The results revealed a high interrater reliability with ICC scores ranging between 0.94 and 0.99 for different timespans, suggesting that GPT-4 is capable of generating consistent ratings across repetitions with a clear prompt. Style and content ratings show a high correlation of 0.87. When applying a non-adequate style the average content ratings remained constant, while style ratings decreased, which indicates that the large language model (LLM) effectively distinguishes between these two criteria during evaluation. The prompt used in this study is furthermore presented and explained. Further research is necessary to assess the robustness and reliability of AI models in various use cases.
CLMay 20Code
GradeLegal: Automated Grading for German Legal CasesAbdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn et al.
Grading German legal exam solutions faces growing volumes and a shortage of qualified graders, delaying feedback and creating a bottleneck. At the same time, it is a high-stakes expert task, since state exam grades strongly influence career outcomes in Germany. Despite this practical relevance, literature lacks systematic studies on effective methods for grading legal exams. To address this gap, we investigate whether large language models (LLMs) can support the automated grading of German legal case solutions in criminal and public law, thereby enabling scalable feedback and student self-testing. We present a systematic evaluation of 27 proprietary and open-source LLMs, benchmarking prompting strategies that incrementally add task-related information, such as a sample solution and a grading rubric. Using quadratic weighted kappa (QWK), reasoning-oriented LLMs can approximate expert grading in public law when given a sample solution and a grading rubric (up to 0.91), compared to 0.60 in criminal law, suggesting a harder grading task in criminal law. Beyond single-model grading, ensembling improves agreement by up to 0.15 over its best member and can offer an alternative to stronger closed-source single models. In addition, our findings suggest that effective prompt design and model selection are necessary for reliable LLM-based grading of legal exams.
CLMar 7, 2023
German BERT Model for Legal Named Entity RecognitionHarshil Darji, Jelena Mitrović, Michael Granitzer
The use of BERT, one of the most popular language models, has led to improvements in many Natural Language Processing (NLP) tasks. One such task is Named Entity Recognition (NER) i.e. automatic identification of named entities such as location, person, organization, etc. from a given text. It is also an important base step for many NLP tasks such as information extraction and argumentation mining. Even though there is much research done on NER using BERT and other popular language models, the same is not explored in detail when it comes to Legal NLP or Legal Tech. Legal NLP applies various NLP techniques such as sentence similarity or NER specifically on legal data. There are only a handful of models for NER tasks using BERT language models, however, none of these are aimed at legal documents in German. In this paper, we fine-tune a popular BERT language model trained on German data (German BERT) on a Legal Entity Recognition (LER) dataset. To make sure our model is not overfitting, we performed a stratified 10-fold cross-validation. The results we achieve by fine-tuning German BERT on the LER dataset outperform the BiLSTM-CRF+ model used by the authors of the same LER dataset. Finally, we make the model openly available via HuggingFace.
LGApr 11, 2022
The Importance of Future Information in Credit Card Fraud DetectionVan Bach Nguyen, Kanishka Ghosh Dastidar, Michael Granitzer et al.
Fraud detection systems (FDS) mainly perform two tasks: (i) real-time detection while the payment is being processed and (ii) posterior detection to block the card retrospectively and avoid further frauds. Since human verification is often necessary and the payment processing time is limited, the second task manages the largest volume of transactions. In the literature, fraud detection challenges and algorithms performance are widely studied but the very formulation of the problem is never disrupted: it aims at predicting if a transaction is fraudulent based on its characteristics and the past transactions of the cardholder. Yet, in posterior detection, verification often takes days, so new payments on the card become available before a decision is taken. This is our motivation to propose a new paradigm: posterior fraud detection with "future" information. We start by providing evidence of the on-time availability of subsequent transactions, usable as extra context to improve detection. We then design a Bidirectional LSTM to make use of these transactions. On a real-world dataset with over 30 million transactions, it achieves higher performance than a regular LSTM, which is the state-of-the-art classifier for fraud detection that only uses the past context. We also introduce new metrics to show that the proposal catches more frauds, more compromised cards, and based on their earliest frauds. We believe that future works on this new paradigm will have a significant impact on the detection of compromised cards.
IRMar 28Code
Beyond the Click: A Framework for Inferring Cognitive Traces in SearchSaber Zerhoudi, Michael Granitzer
User simulators are essential for evaluating search systems, but they primarily reproduce user actions without modeling the underlying thought process. Large-scale interaction logs record what users do, but not what they might be thinking or feeling, such as confusion or satisfaction. We present a framework for inferring cognitive traces from behavioral logs. Our method uses a multi-agent LLM system grounded in Information Foraging Theory (IFT) and validated by human experts. We annotate three public datasets (AOL, Stack Overflow, and MovieLens), producing over 530,000 cognitive labels across 50,000 sessions. A cross-dataset evaluation with a shuffled-label control reveals that cognitive labels provide the strongest signal where behavioral features are weakest: on MovieLens, the cognitive model improves F1 by up to 6.6% over the behavioral baseline and 1.8% above the shuffled control, while on AOL, where click patterns are highly predictive, improvements are near zero. We release the annotation collection on HuggingFace, an open-source annotation tool, and all experimental code to support future work on cognitively aware user simulation.
AIJul 28, 2024Code
Mixture of Modular Experts: Distilling Knowledge from a Multilingual Teacher into Specialized Modular Language ModelsMohammed Al-Maamari, Mehdi Ben Amor, Michael Granitzer
This research combines Knowledge Distillation (KD) and Mixture of Experts (MoE) to develop modular, efficient multilingual language models. Key objectives include evaluating adaptive versus fixed alpha methods in KD and comparing modular MoE architectures for handling multi-domain inputs and preventing catastrophic forgetting. KD compresses large language models (LLMs) into smaller, efficient models, while MoE enhances modularity with specialized tasks. Experiments showed similar performance for both KD methods, with marginal improvements from adaptive alpha. A combined loss approach provided more stable learning. The router, trained to classify input sequences into English, French, German, or Python, achieved 99.95% precision, recall, and F1 score, with Logistic Regression being the most effective classifier. Evaluations of modular MoE architectures revealed that Pre-trained Language Experts (PLE) and Joint Expert Embedding Training (JEET) performed similarly, while the MoE with Common Expert (MoE-CE) setup showed slightly lower performance. Including a common expert in MoE-CE improved its performance. Studies on catastrophic forgetting indicated that sequential training led to significant forgetting, while single-session training with balanced batches and the MoE approach mitigated this issue. The MoE architecture preserved knowledge across multiple languages effectively. The research contributes open-sourced resources including the dataset (https://zenodo.org/doi/10.5281/zenodo.12677631), a balanced dataset creation tool (https://github.com/padas-lab-de/multi-language-dataset-creator), and the research codebase (https://github.com/ModMaamari/mixture-modular-experts).
CLApr 26, 2023
Technical Report: Impact of Position Bias on Language Models in Token ClassificationMehdi Ben Amor, Michael Granitzer, Jelena Mitrović
Language Models (LMs) have shown state-of-the-art performance in Natural Language Processing (NLP) tasks. Downstream tasks such as Named Entity Recognition (NER) or Part-of-Speech (POS) tagging are known to suffer from data imbalance issues, particularly regarding the ratio of positive to negative examples and class disparities. This paper investigates an often-overlooked issue of encoder models, specifically the position bias of positive examples in token classification tasks. For completeness, we also include decoders in the evaluation. We evaluate the impact of position bias using different position embedding techniques, focusing on BERT with Absolute Position Embedding (APE), Relative Position Embedding (RPE), and Rotary Position Embedding (RoPE). Therefore, we conduct an in-depth evaluation of the impact of position bias on the performance of LMs when fine-tuned on token classification benchmarks. Our study includes CoNLL03 and OntoNote5.0 for NER, English Tree Bank UD\_en, and TweeBank for POS tagging. We propose an evaluation approach to investigate position bias in transformer models. We show that LMs can suffer from this bias with an average drop ranging from 3\% to 9\% in their performance. To mitigate this effect, we propose two methods: Random Position Shifting and Context Perturbation, that we apply on batches during the training process. The results show an improvement of $\approx$ 2\% in the performance of the model on CoNLL03, UD\_en, and TweeBank.
CRSep 12, 2022
SmartKex: Machine Learning Assisted SSH Keys Extraction From The Heap DumpChristofer Fellicious, Stewart Sentanoe, Michael Granitzer et al.
Digital forensics is the process of extracting, preserving, and documenting evidence in digital devices. A commonly used method in digital forensics is to extract data from the main memory of a digital device. However, the main challenge is identifying the important data to be extracted. Several pieces of crucial information reside in the main memory, like usernames, passwords, and cryptographic keys such as SSH session keys. In this paper, we propose SmartKex, a machine-learning assisted method to extract session keys from heap memory snapshots of an OpenSSH process. In addition, we release an openly available dataset and the corresponding toolchain for creating additional data. Finally, we compare SmartKex with naive brute-force methods and empirically show that SmartKex can extract the session keys with high accuracy and high throughput. With the provided resources, we intend to strengthen the research on the intersection between digital forensics, cybersecurity, and machine learning.
CLMar 18Code
Toward Generalized Cross-Lingual Hateful Language Detection with Web-Scale Data and Ensemble LLM AnnotationsDang H. Dang, Jelena Mitrovi, Michael Granitzer
We study whether large-scale unlabelled web data and LLM-based synthetic annotations can improve multilingual hate speech detection. Starting from texts crawled via OpenWebSearch.eu~(OWS) in four languages (English, German, Spanish, Vietnamese), we pursue two complementary strategies. First, we apply continued pre-training to BERT models by continuing masked language modelling on unlabelled OWS texts before supervised fine-tuning, and show that this yields an average macro-F1 gain of approximately 3% over standard baselines across sixteen benchmarks, with stronger gains in low-resource settings. Second, we use four open-source LLMs (Mistral-7B, Llama3.1-8B, Gemma2-9B, Qwen2.5-14B) to produce synthetic annotations through three ensemble strategies: mean averaging, majority voting, and a LightGBM meta-learner. The LightGBM ensemble consistently outperforms the other strategies. Fine-tuning on these synthetic labels substantially benefits a small model (Llama3.2-1B: +11% pooled F1), but provides only a modest gain for the larger Qwen2.5-14B (+0.6%). Our results indicate that the combination of web-scale unlabelled data and LLM-ensemble annotations is the most valuable for smaller models and low-resource languages.
LGNov 11, 2025Code
Benevolent Dictators? On LLM Agent Behavior in Dictator GamesAndreas Einwiller, Kanishka Ghosh Dastidar, Artur Romazanov et al.
In behavioral sciences, experiments such as the ultimatum game are conducted to assess preferences for fairness or self-interest of study participants. In the dictator game, a simplified version of the ultimatum game where only one of two players makes a single decision, the dictator unilaterally decides how to split a fixed sum of money between themselves and the other player. Although recent studies have explored behavioral patterns of AI agents based on Large Language Models (LLMs) instructed to adopt different personas, we question the robustness of these results. In particular, many of these studies overlook the role of the system prompt - the underlying instructions that shape the model's behavior - and do not account for how sensitive results can be to slight changes in prompts. However, a robust baseline is essential when studying highly complex behavioral aspects of LLMs. To overcome previous limitations, we propose the LLM agent behavior study (LLM-ABS) framework to (i) explore how different system prompts influence model behavior, (ii) get more reliable insights into agent preferences by using neutral prompt variations, and (iii) analyze linguistic features in responses to open-ended instructions by LLM agents to better understand the reasoning behind their behavior. We found that agents often exhibit a strong preference for fairness, as well as a significant impact of the system prompt on their behavior. From a linguistic perspective, we identify that models express their responses differently. Although prompt sensitivity remains a persistent challenge, our proposed framework demonstrates a robust foundation for LLM agent behavior studies. Our code artifacts are available at https://github.com/andreaseinwiller/LLM-ABS.
SIMar 17
Form Without Function: Agent Social Behavior in the Moltbook NetworkSaber Zerhoudi, Kanishka Ghosh Dastidar, Felix Klement et al.
Moltbook is a social network where every participant is an AI agent. We analyze 1,312,238 posts, 6.7~million comments, and over 120,000 agent profiles across 5,400 communities, collected over 40 days (January 27 to March 9, 2026). We evaluate the platform through three layers. At the interaction layer, 91.4% of post authors never return to their own threads, 85.6% of conversations are flat (no reply ever receives a reply), the median time-to-first-comment is 55 seconds, and 97.3% of comments receive zero upvotes. Interaction reciprocity is 3.3%, compared to 22-60% on human platforms. An argumentation analysis finds that 64.6% of comment-to-post relations carry no argumentative connection. At the content layer, 97.9% of agents never post in a community matching their bio, 92.5% of communities contain every topic in roughly equal proportions, and over 80% of shared URLs point to the platform's own infrastructure. At the instruction layer, we use 41 Wayback Machine snapshots to identify six instruction changes during the observation window. Hard constraints (rate limit, content filters) produce immediate behavioral shifts. Soft guidance (``upvote good posts'', ``stay on topic'') is ignored until it becomes an explicit step in the executable checklist. The platform also poses technological risks. We document credential leaks (API keys, JWT tokens), 12,470 unique Ethereum addresses with 3,529 confirmed transaction histories, and attack discourse ranging from template-based SSH brute-forcing to multi-agent offensive security architectures. These persist unmoderated because the quality-filtering mechanisms are themselves non-functional. Moltbook is a socio-technical system where the technical layer responds to changes, but the social layer largely fails to emerge. The form of social media is reproduced in full. The function is absent.
IRApr 29Code
AgentSim: A Platform for Verifiable Agent-Trace SimulationSaber Zerhoudi, Michael Granitzer, Jelena Mitrovic
Training trustworthy agentic LLMs requires data that shows the grounded reasoning process, not just the final answer. Existing datasets fall short: question-answering data is outcome-only, chain-of-thought data is not tied to specific documents, and web-agent datasets track interface actions rather than the core retrieval and synthesis steps of a RAG workflow. We introduce AgentSim, an open-source platform for simulating RAG agents. It generates verifiable, stepwise traces of agent reasoning over any document collection. AgentSim uses a policy to ensure the agent widely explores the document set. It combines a multi-model validation pipeline with an active human-in-the-loop process. This approach focuses human effort on difficult steps where models disagree. Using AgentSim, we construct and release the Agent-Trace Corpus (ATC), a large collection of grounded reasoning trajectories spanning three established IR benchmarks. We make three contributions: (1) the AgentSim platform with two mechanisms, Corpus-Aware Seeding and Active Validation, that improve trace diversity and quality; (2) the Agent-Trace Corpus (ATC), over 103,000 verifiable reasoning steps spanning three IR benchmarks, with 100% grounding rate on substantive answers; and (3) a comparative behavioral analysis revealing systematic differences in how state-of-the-art models approach information seeking. Platform, toolkit, and corpus are publicly available.
LGJul 9, 2024
DriftGAN: Using historical data for Unsupervised Recurring Drift DetectionChristofer Fellicious, Sahib Julka, Lorenz Wendlinger et al.
In real-world applications, input data distributions are rarely static over a period of time, a phenomenon known as concept drift. Such concept drifts degrade the model's prediction performance, and therefore we require methods to overcome these issues. The initial step is to identify concept drifts and have a training method in place to recover the model's performance. Most concept drift detection methods work on detecting concept drifts and signalling the requirement to retrain the model. However, in real-world cases, there could be concept drifts that recur over a period of time. In this paper, we present an unsupervised method based on Generative Adversarial Networks(GAN) to detect concept drifts and identify whether a specific concept drift occurred in the past. Our method reduces the time and data the model requires to get up to speed for recurring drifts. Our key results indicate that our proposed model can outperform the current state-of-the-art models in most datasets. We also test our method on a real-world use case from astrophysics, where we detect the bow shock and magnetopause crossings with better results than the existing methods in the domain.
IRApr 25
IIRSim Studio: A Dashboard for User SimulationSaber Zerhoudi, Adam Roegiest, Michael Granitzer
User simulation is a valuable methodology for evaluation in Information Retrieval (IR), enabling low-cost experimentation and counterfactual analysis. However, existing simulation frameworks are primarily code-centric libraries that require substantial setup effort, which limits adoption and hinders reproducibility. The bottleneck is not the simulation engines themselves, but the lack of infrastructure connecting experiment design, execution, and sharing into a single verifiable workflow. This paper introduces IIRSim Studio, a web-based workbench that addresses this gap through four contributions: (1) a visual environment for composing simulation pipelines on top of simulation frameworks, serving both novices learning simulation concepts and experts piloting large-scale experiments; (2) a component lifecycle that supports authoring, versioning, and sharing custom simulation components through Git-backed storage and runtime injection; (3) a provenance model based on experiment bundles and environment templates that makes the scope of replication explicit; and (4) a shared-task workflow, demonstrated through the re-deployment of a Sim4IA micro-task. IIRSim Studio is available as a hosted service and as a portable containerized deployment.
IRJul 5, 2024
Challenges and Considerations in Annotating Legal Data: A Comprehensive OverviewHarshil Darji, Jelena Mitrović, Michael Granitzer
The process of annotating data within the legal sector is filled with distinct challenges that differ from other fields, primarily due to the inherent complexities of legal language and documentation. The initial task usually involves selecting an appropriate raw dataset that captures the intricate aspects of legal texts. Following this, extracting text becomes a complicated task, as legal documents often have complex structures, footnotes, references, and unique terminology. The importance of data cleaning is magnified in this context, ensuring that redundant information is eliminated while maintaining crucial legal details and context. Creating comprehensive yet straightforward annotation guidelines is imperative, as these guidelines serve as the road map for maintaining uniformity and addressing the subtle nuances of legal terminology. Another critical aspect is the involvement of legal professionals in the annotation process. Their expertise is valuable in ensuring that the data not only remains contextually accurate but also adheres to prevailing legal standards and interpretations. This paper provides an expanded view of these challenges and aims to offer a foundational understanding and guidance for researchers and professionals engaged in legal data annotation projects. In addition, we provide links to our created and fine-tuned datasets and language models. These resources are outcomes of our discussed projects and solutions to challenges faced while working on them.
IRFeb 26
Generative Agents Navigating Digital LibrariesSaber Zerhoudi, Michael Granitzer
In the rapidly evolving field of digital libraries, the development of large language models (LLMs) has opened up new possibilities for simulating user behavior. This innovation addresses the longstanding challenge in digital library research: the scarcity of publicly available datasets on user search patterns due to privacy concerns. In this context, we introduce Agent4DL, a user search behavior simulator specifically designed for digital library environments. Agent4DL generates realistic user profiles and dynamic search sessions that closely mimic actual search strategies, including querying, clicking, and stopping behaviors tailored to specific user profiles. Our simulator's accuracy in replicating real user interactions has been validated through comparisons with real user data. Notably, Agent4DL demonstrates competitive performance compared to existing user search simulators such as SimIIR 2.0, particularly in its ability to generate more diverse and context-aware user behaviors.
CRFeb 18, 2025Code
Malware Detection based on API callsChristofer Fellicious, Manuel Bischof, Kevin Mayer et al.
Malware attacks pose a significant threat in today's interconnected digital landscape, causing billions of dollars in damages. Detecting and identifying families as early as possible provides an edge in protecting against such malware. We explore a lightweight, order-invariant approach to detecting and mitigating malware threats: analyzing API calls without regard to their sequence. We publish a public dataset of over three hundred thousand samples and their function call parameters for this task, annotated with labels indicating benign or malicious activity. The complete dataset is above 550GB uncompressed in size. We leverage machine learning algorithms, such as random forests, and conduct behavioral analysis by examining patterns and anomalies in API call sequences. By investigating how the function calls occur regardless of their order, we can identify discriminating features that can help us identify malware early on. The models we've developed are not only effective but also efficient. They are lightweight and can run on any machine with minimal performance overhead, while still achieving an impressive F1-Score of over 85\%. We also empirically show that we only need a subset of the function call sequence, specifically calls to the ntdll.dll library, to identify malware. Our research demonstrates the efficacy of this approach through empirical evaluations, underscoring its accuracy and scalability. The code is open source and available at Github along with the dataset on Zenodo.
CLDec 20, 2024Code
On the Suitability of pre-trained foundational LLMs for Analysis in German Legal EducationLorenz Wendlinger, Christian Braun, Abdullah Al Zubaer et al.
We show that current open-source foundational LLMs possess instruction capability and German legal background knowledge that is sufficient for some legal analysis in an educational context. However, model capability breaks down in very specific tasks, such as the classification of "Gutachtenstil" appraisal style components, or with complex contexts, such as complete legal opinions. Even with extended context and effective prompting strategies, they cannot match the Bag-of-Words baseline. To combat this, we introduce a Retrieval Augmented Generation based prompt example selection method that substantially improves predictions in high data availability scenarios. We further evaluate the performance of pre-trained LLMs on two standard tasks for argument mining and automated essay scoring and find it to be more adequate. Throughout, pre-trained LLMs improve upon the baseline in scenarios with little or no labeled data with Chain-of-Thought prompting further helping in the zero-shot case.
LGNov 5, 2024Code
SUDS: A Strategy for Unsupervised Drift SamplingChristofer Fellicious, Lorenz Wendlinger, Mario Gancarski et al.
Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at https://github.com/cfellicious/SUDS/
IRFeb 19
WebFAQ 2.0: A Multilingual QA Dataset with Mined Hard Negatives for Dense RetrievalMichael Dinzinger, Laura Caspari, Ali Salman et al.
We introduce WebFAQ 2.0, a new version of the WebFAQ dataset, containing 198 million FAQ-based natural question-answer pairs across 108 languages. Compared to the previous version, it significantly expands multilingual coverage and the number of bilingual aligned QA pairs to over 14.3M, making it the largest FAQ-based resource. Unlike the original release, WebFAQ 2.0 uses a novel data collection strategy that directly crawls and extracts relevant web content, resulting in a substantially more diverse and multilingual dataset with richer context through page titles and descriptions. In response to community feedback, we also release a hard negatives dataset for training dense retrievers, with 1.25M queries across 20 languages. These hard negatives were mined using a two-stage retrieval pipeline and include cross-encoder scores for 200 negatives per query. We further show how this resource enables two primary fine-tuning strategies for dense retrievers: Contrastive Learning with MultipleNegativesRanking loss, and Knowledge Distillation with MarginMSE loss. WebFAQ 2.0 is not a static resource but part of a long-term effort. Since late 2025, structured FAQs are being regularly released through the Open Web Index, enabling continuous expansion and refinement. We publish the datasets and training scripts to facilitate further research in multilingual and cross-lingual IR. The dataset itself and all related resources are publicly available on GitHub and HuggingFace.
LGFeb 4
Robust Generalizable Heterogeneous Legal Link PredictionLorenz Wendlinger, Simon Alexander Nonn, Abdullah Al Zubaer et al.
Recent work has applied link prediction to large heterogeneous legal citation networks \new{with rich meta-features}. We find that this approach can be improved by including edge dropout and feature concatenation for the learning of more robust representations, which reduces error rates by up to 45%. We also propose an approach based on multilingual node features with an improved asymmetric decoder for compatibility, which allows us to generalize and extend the prediction to more, geographically and linguistically disjoint, data from New Zealand. Our adaptations also improve inductive transferability between these disjoint legal systems.
CLApr 2, 2024
LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource LanguagesNataliia Kholodna, Sahib Julka, Mohammad Khodadadi et al.
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
IRApr 30
NuggetIndex: Governed Atomic Retrieval for Maintainable RAGSaber Zerhoudi, Michael Granitzer, Jelena Mitrovic
Retrieval-augmented generation (RAG) systems are frequently evaluated via fact-based metrics, yet standard implementations retrieve passages or static propositions. This unit mismatch between evaluation and retrieval objects hinders maintenance when corpora evolve and fails to capture superseded facts or source disagreements. We propose NuggetIndex, a retrieval system that stores atomic information units as managed records, so called nuggets. Each record maintains links to evidence, a temporal validity interval, and a lifecycle state. By filtering invalid or deprecated nuggets prior to ranking, the system prevents the inclusion of outdated information. We evaluate the approach using a nuggetized MS MARCO subset, a temporal Wikipedia QA dataset, and a multi-hop QA task. Against passage and unmanaged proposition retrieval baselines, NuggetIndex improves nugget recall by 42%, increases temporal correctness by 9 percentage points without the recall collapse observed in time-filtered baselines, and reduces conflict rates by 55%. The compact nugget format reduces generator input length by 64% while enabling lightweight index structures suitable for browser-based and resource-constrained deployment. We release our implementation, datasets, and evaluation scripts
LGDec 5, 2023
Towards Measuring Representational Similarity of Large Language ModelsMax Klabunde, Mehdi Ben Amor, Michael Granitzer et al.
Understanding the similarity of the numerous released large language models (LLMs) has many uses, e.g., simplifying model selection, detecting illegal model reuse, and advancing our understanding of what makes LLMs perform well. In this work, we measure the similarity of representations of a set of LLMs with 7B parameters. Our results suggest that some LLMs are substantially different from others. We identify challenges of using representational similarity measures that suggest the need of careful study of similarity scores to avoid false conclusions.
NEApr 24, 2024
Efficient NAS with FaDE on Hierarchical SpacesSimon Neumeyer, Julian Stier, Michael Granitzer
Neural architecture search (NAS) is a challenging problem. Hierarchical search spaces allow for cheap evaluations of neural network sub modules to serve as surrogate for architecture evaluations. Yet, sometimes the hierarchy is too restrictive or the surrogate fails to generalize. We present FaDE which uses differentiable architecture search to obtain relative performance predictions on finite regions of a hierarchical NAS space. The relative nature of these ranks calls for a memory-less, batch-wise outer search algorithm for which we use an evolutionary algorithm with pseudo-gradient descent. FaDE is especially suited on deep hierarchical, respectively multi-cell search spaces, which it can explore by linear instead of exponential cost and therefore eliminates the need for a proxy search space. Our experiments show that firstly, FaDE-ranks on finite regions of the search space correlate with corresponding architecture performances and secondly, the ranks can empower a pseudo-gradient evolutionary search on the complete neural architecture search space.
CRMar 7, 2025
Bridging the Semantic Gap in Virtual Machine Introspection and Forensic Memory AnalysisChristofer Fellicious, Hans P. Reiser, Michael Granitzer
Forensic Memory Analysis (FMA) and Virtual Machine Introspection (VMI) are critical tools for security in a virtualization-based approach. VMI and FMA involves using digital forensic methods to extract information from the system to identify and explain security incidents. A key challenge in both FMA and VMI is the "Semantic Gap", which is the difficulty of interpreting raw memory data without specialized tools and expertise. In this work, we investigate how a priori knowledge, metadata and engineered features can aid VMI and FMA, leveraging machine learning to automate information extraction and reduce the workload of forensic investigators. We choose OpenSSH as our use case to test different methods to extract high level structures. We also test our method on complete physical memory dumps to showcase the effectiveness of the engineered features. Our features range from basic statistical features to advanced graph-based representations using malloc headers and pointer translations. The training and testing are carried out on public datasets that we compare against already recognized baseline methods. We show that using metadata, we can improve the performance of the algorithm when there is very little training data and also quantify how having more data results in better generalization performance. The final contribution is an open dataset of physical memory dumps, totalling more than 1 TB of different memory state, software environments, main memory capacities and operating system versions. Our methods show that having more metadata boosts performance with all methods obtaining an F1-Score of over 80%. Our research underscores the possibility of using feature engineering and machine learning techniques to bridge the semantic gap.
CLFeb 28, 2025
WebFAQ: A Multilingual Collection of Natural Q&A Datasets for Dense RetrievalMichael Dinzinger, Laura Caspari, Kanishka Ghosh Dastidar et al.
We present WebFAQ, a large-scale collection of open-domain question answering datasets derived from FAQ-style schema.org annotations. In total, the data collection consists of 96 million natural question-answer (QA) pairs across 75 languages, including 47 million (49%) non-English samples. WebFAQ further serves as the foundation for 20 monolingual retrieval benchmarks with a total size of 11.2 million QA pairs (5.9 million non-English). These datasets are carefully curated through refined filtering and near-duplicate detection, yielding high-quality resources for training and evaluating multilingual dense retrieval models. To empirically confirm WebFAQ's efficacy, we use the collected QAs to fine-tune an in-domain pretrained XLM-RoBERTa model. Through this process of dataset-specific fine-tuning, the model achieves significant retrieval performance gains, which generalize - beyond WebFAQ - to other multilingual retrieval benchmarks evaluated in zero-shot setting. Last but not least, we utilize WebFAQ to construct a set of QA-aligned bilingual corpora spanning over 1000 language pairs using state-of-the-art bitext mining and automated LLM-assessed translation evaluation. Due to our advanced, automated method of bitext dataset generation, the resulting bilingual corpora demonstrate higher translation quality compared to similar datasets. WebFAQ and all associated resources are publicly available on GitHub and HuggingFace.
CLDec 18, 2024
Enhancing Rhetorical Figure Annotation: An Ontology-Based Web Application with RAG IntegrationRamona Kühn, Jelena Mitrović, Michael Granitzer
Rhetorical figures play an important role in our communication. They are used to convey subtle, implicit meaning, or to emphasize statements. We notice them in hate speech, fake news, and propaganda. By improving the systems for computational detection of rhetorical figures, we can also improve tasks such as hate speech and fake news detection, sentiment analysis, opinion mining, or argument mining. Unfortunately, there is a lack of annotated data, as well as qualified annotators that would help us build large corpora to train machine learning models for the detection of rhetorical figures. The situation is particularly difficult in languages other than English, and for rhetorical figures other than metaphor, sarcasm, and irony. To overcome this issue, we develop a web application called "Find your Figure" that facilitates the identification and annotation of German rhetorical figures. The application is based on the German Rhetorical ontology GRhOOT which we have specially adapted for this purpose. In addition, we improve the user experience with Retrieval Augmented Generation (RAG). In this paper, we present the restructuring of the ontology, the development of the web application, and the built-in RAG pipeline. We also identify the optimal RAG settings for our application. Our approach is one of the first to practically use rhetorical ontologies in combination with RAG and shows promising results.
LGDec 16, 2024
Krony-PT: GPT2 compressed with Kronecker ProductsMohamed Ayoub Ben Ayad, Jelena Mitrovic, Michael Granitzer
We introduce Krony-PT, a compression technique for GPT-2 based on Kronecker products. We specifically target the feed-forward weights of each transformer block, and systematically compress the feed-forward layer matrices to various degrees. We introduce a modified Van Loan decomposition to initialize new Kronecker factors, and also propose a new pruning-based initialization technique. Our method compresses the original 124M-parameter GPT-2 to various smaller models, ranging from 80M to 96M. Our 81M model variant outperforms DistilGPT2 on next-token prediction across all standard language modeling datasets, and shows competitive or comparable performance with significantly larger Kronecker-based compressions of GPT-2.
LGOct 6, 2025
Compressed Concatenation of Small Embedding ModelsMohamed Ayoub Ben Ayad, Michael Dinzinger, Kanishka Ghosh Dastidar et al.
Embedding models are central to dense retrieval, semantic search, and recommendation systems, but their size often makes them impractical to deploy in resource-constrained environments such as browsers or edge devices. While smaller embedding models offer practical advantages, they typically underperform compared to their larger counterparts. To bridge this gap, we demonstrate that concatenating the raw embedding vectors of multiple small models can outperform a single larger baseline on standard retrieval benchmarks. To overcome the resulting high dimensionality of naive concatenation, we introduce a lightweight unified decoder trained with a Matryoshka Representation Learning (MRL) loss. This decoder maps the high-dimensional joint representation to a low-dimensional space, preserving most of the original performance without fine-tuning the base models. We also show that while concatenating more base models yields diminishing gains, the robustness of the decoder's representation under compression and quantization improves. Our experiments show that, on a subset of MTEB retrieval tasks, our concat-encode-quantize pipeline recovers 89\% of the original performance with a 48x compression factor when the pipeline is applied to a concatenation of four small embedding models.
CLJun 24, 2024
Computational Approaches to the Detection of Lesser-Known Rhetorical Figures: A Systematic Survey and Research ChallengesRamona Kühn, Jelena Mitrović, Michael Granitzer
Rhetorical figures play a major role in our everyday communication as they make text more interesting, more memorable, or more persuasive. Therefore, it is important to computationally detect rhetorical figures to fully understand the meaning of a text. We provide a comprehensive overview of computational approaches to lesser-known rhetorical figures. We explore the linguistic and computational perspectives on rhetorical figures, emphasizing their significance for the domain of Natural Language Processing. We present different figures in detail, delving into datasets, definitions, rhetorical functions, and detection approaches. We identified challenges such as dataset scarcity, language limitations, and reliance on rule-based methods.
CVMay 12, 2023
Knowledge distillation with Segment Anything (SAM) model for Planetary Geological MappingSahib Julka, Michael Granitzer
Planetary science research involves analysing vast amounts of remote sensing data, which are often costly and time-consuming to annotate and process. One of the essential tasks in this field is geological mapping, which requires identifying and outlining regions of interest in planetary images, including geological features and landforms. However, manually labelling these images is a complex and challenging task that requires significant domain expertise and effort. To expedite this endeavour, we propose the use of knowledge distillation using the recently introduced cutting-edge Segment Anything (SAM) model. We demonstrate the effectiveness of this prompt-based foundation model for rapid annotation and quick adaptability to a prime use case of mapping planetary skylights. Our work reveals that with a small set of annotations obtained with the right prompts from the model and subsequently training a specialised domain decoder, we can achieve satisfactory semantic segmentation on this task. Key results indicate that the use of knowledge distillation can significantly reduce the effort required by domain experts for manual annotation and improve the efficiency of image segmentation tasks. This approach has the potential to accelerate extra-terrestrial discovery by automatically detecting and segmenting Martian landforms.
LGNov 12, 2021
deepstruct -- linking deep learning and graph theoryJulian Stier, Michael Granitzer
deepstruct connects deep learning models and graph theory such that different graph structures can be imposed on neural networks or graph structures can be extracted from trained neural network models. For this, deepstruct provides deep neural network models with different restrictions which can be created based on an initial graph. Further, tools to extract graph structures from trained models are available. This step of extracting graphs can be computationally expensive even for models of just a few dozen thousand parameters and poses a challenging problem. deepstruct supports research in pruning, neural architecture search, automated network design and structure analysis of neural networks.
LGJul 27, 2021
Experiments on Properties of Hidden Structures of Sparse Neural NetworksJulian Stier, Harshil Darji, Michael Granitzer
Sparsity in the structure of Neural Networks can lead to less energy consumption, less memory usage, faster computation times on convenient hardware, and automated machine learning. If sparsity gives rise to certain kinds of structure, it can explain automatically obtained features during learning. We provide insights into experiments in which we show how sparsity can be achieved through prior initialization, pruning, and during learning, and answer questions on the relationship between the structure of Neural Networks and their performance. This includes the first work of inducing priors from network theory into Recurrent Neural Networks and an architectural performance prediction during a Neural Architecture Search. Within our experiments, we show how magnitude class blinded pruning achieves 97.5% on MNIST with 80% compression and re-training, which is 0.5 points more than without compression, that magnitude class uniform pruning is significantly inferior to it and how a genetic search enhanced with performance prediction achieves 82.4% on CIFAR10. Further, performance prediction for Recurrent Networks learning the Reber grammar shows an $R^2$ of up to 0.81 given only structural information.
CVMar 21, 2021
Conditional Generative Adversarial Networks for Speed Control in Trajectory SimulationSahib Julka, Vishal Sowrirajan, Joerg Schloetterer et al.
Motion behaviour is driven by several factors -- goals, presence and actions of neighbouring agents, social relations, physical and social norms, the environment with its variable characteristics, and further. Most factors are not directly observable and must be modelled from context. Trajectory prediction, is thus a hard problem, and has seen increasing attention from researchers in the recent years. Prediction of motion, in application, must be realistic, diverse and controllable. In spite of increasing focus on multimodal trajectory generation, most methods still lack means for explicitly controlling different modes of the data generation. Further, most endeavours invest heavily in designing special mechanisms to learn the interactions in latent space. We present Conditional Speed GAN (CSG), that allows controlled generation of diverse and socially acceptable trajectories, based on user controlled speed. During prediction, CSG forecasts future speed from latent space and conditions its generation based on it. CSG is comparable to state-of-the-art GAN methods in terms of the benchmark distance metrics, while being simple and useful for simulation and data augmentation for different contexts such as fast or slow paced environments. Additionally, we compare the effect of different aggregation mechanisms and show that a naive approach of concatenation works comparable to its attention and pooling alternatives.
IMOct 29, 2020
Lessons Learned from the 1st ARIEL Machine Learning Challenge: Correcting Transiting Exoplanet Light Curves for Stellar SpotsNikolaos Nikolaou, Ingo P. Waldmann, Angelos Tsiaras et al.
The last decade has witnessed a rapid growth of the field of exoplanet discovery and characterisation. However, several big challenges remain, many of which could be addressed using machine learning methodology. For instance, the most prolific method for detecting exoplanets and inferring several of their characteristics, transit photometry, is very sensitive to the presence of stellar spots. The current practice in the literature is to identify the effects of spots visually and correct for them manually or discard the affected data. This paper explores a first step towards fully automating the efficient and precise derivation of transit depths from transit light curves in the presence of stellar spots. The methods and results we present were obtained in the context of the 1st Machine Learning Challenge organized for the European Space Agency's upcoming Ariel mission. We first present the problem, the simulated Ariel-like data and outline the Challenge while identifying best practices for organizing similar challenges in the future. Finally, we present the solutions obtained by the top-5 winning teams, provide their code and discuss their implications. Successful solutions either construct highly non-linear (w.r.t. the raw data) models with minimal preprocessing -deep neural networks and ensemble methods- or amount to obtaining meaningful statistics from the light curves, constructing linear models on which yields comparably good predictive performance.
CLOct 23, 2020
HateBERT: Retraining BERT for Abusive Language Detection in EnglishTommaso Caselli, Valerio Basile, Jelena Mitrović et al.
In this paper, we introduce HateBERT, a re-trained BERT model for abusive language detection in English. The model was trained on RAL-E, a large-scale dataset of Reddit comments in English from communities banned for being offensive, abusive, or hateful that we have collected and made available to the public. We present the results of a detailed comparison between a general pre-trained language model and the abuse-inclined version obtained by retraining with posts from the banned communities on three English datasets for offensive, abusive language and hate speech detection tasks. In all datasets, HateBERT outperforms the corresponding general BERT model. We also discuss a battery of experiments comparing the portability of the generic pre-trained language model and its corresponding abusive language-inclined counterpart across the datasets, indicating that portability is affected by compatibility of the annotated phenomena.
LGJul 14, 2020
ADSAGE: Anomaly Detection in Sequences of Attributed Graph Edges applied to insider threat detection at fine-grained levelMathieu Garchery, Michael Granitzer
Previous works on the CERT insider threat detection case have neglected graph and text features despite their relevance to describe user behavior. Additionally, existing systems heavily rely on feature engineering and audit data aggregation to detect malicious activities. This is time consuming, requires expert knowledge and prevents tracing back alerts to precise user actions. To address these issues we introduce ADSAGE to detect anomalies in audit log events modeled as graph edges. Our general method is the first to perform anomaly detection at edge level while supporting both edge sequences and attributes, which can be numeric, categorical or even text. We describe how ADSAGE can be used for fine-grained, event level insider threat detection in different audit logs from the CERT use case. Remarking that there is no standard benchmark for the CERT problem, we use a previously proposed evaluation setting based on realistic recall-based metrics. We evaluate ADSAGE on authentication, email traffic and web browsing logs from the CERT insider threat datasets, as well as on real-world authentication events. ADSAGE is effective to detect anomalies in authentications, modeled as user to computer interactions, and in email communications. Simple baselines give surprisingly strong results as well. We also report performance split by malicious scenarios present in the CERT datasets: interestingly, several detectors are complementary and could be combined to improve detection. Overall, our results show that graph features are informative to characterize malicious insider activities, and that detection at fine-grained level is possible.
LGJun 7, 2020
DeepGG: a Deep Graph GeneratorJulian Stier, Michael Granitzer
Learning distributions of graphs can be used for automatic drug discovery, molecular design, complex network analysis, and much more. We present an improved framework for learning generative models of graphs based on the idea of deep state machines. To learn state transition decisions we use a set of graph and node embedding techniques as memory of the state machine. Our analysis is based on learning the distribution of random graph generators for which we provide statistical tests to determine which properties can be learned and how well the original distribution of graphs is represented. We show that the design of the state machine favors specific distributions. Models of graphs of size up to 150 vertices are learned. Code and parameters are publicly available to reproduce our results.
LGFeb 17, 2020
Investigating Extensions to Random Walk Based Graph EmbeddingJoerg Schloetterer, Martin Wehking, Fatemeh Salehi Rizi et al.
Graph embedding has recently gained momentum in the research community, in particular after the introduction of random walk and neural network based approaches. However, most of the embedding approaches focus on representing the local neighborhood of nodes and fail to capture the global graph structure, i.e. to retain the relations to distant nodes. To counter that problem, we propose a novel extension to random walk based graph embedding, which removes a percentage of least frequent nodes from the walks at different levels. By this removal, we simulate farther distant nodes to reside in the close neighborhood of a node and hence explicitly represent their connection. Besides the common evaluation tasks for graph embeddings, such as node classification and link prediction, we evaluate and compare our approach against related methods on shortest path approximation. The results indicate, that extensions to random walk based methods (including our own) improve the predictive performance only slightly - if at all.
SIFeb 16, 2020
Global and Local Feature Learning for Ego-Network AnalysisFatemeh Salehi Rizi, Michael Granitzer, Konstantin Ziegler
In an ego-network, an individual (ego) organizes its friends (alters) in different groups (social circles). This social network can be efficiently analyzed after learning representations of the ego and its alters in a low-dimensional, real vector space. These representations are then easily exploited via statistical models for tasks such as social circle detection and prediction. Recent advances in language modeling via deep learning have inspired new methods for learning network representations. These methods can capture the global structure of networks. In this paper, we evolve these techniques to also encode the local structure of neighborhoods. Therefore, our local representations capture network features that are hidden in the global representation of large networks. We show that the task of social circle prediction benefits from a combination of global and local features generated by our technique.
SIFeb 16, 2020
Predicting event attendance exploring social influenceFatemeh Salehi Rizi, Michael Granitzer
The problem of predicting people's participation in real-world events has received considerable attention as it offers valuable insights for human behavior analysis and event-related advertisement. Today social networks (e.g. Twitter) widely reflect large popular events where people discuss their interest with friends. Event participants usually stimulate friends to join the event which propagates a social influence in the network. In this paper, we propose to model the social influence of friends on event attendance. We consider non-geotagged posts besides structures of social groups to infer users' attendance. To leverage the information on network topology we apply some of recent graph embedding techniques such as node2vec, HARP and Poincar`e. We describe the approach followed to design the feature space and feed it to a neural network. The performance evaluation is conducted using two large music festivals datasets, namely the VFestival and Creamfields. The experimental results show that our classifier outperforms the state-of-the-art baseline with 89% accuracy observed for the VFestival dataset.
LGFeb 12, 2020
Shortest path distance approximation using deep learning techniquesFatemeh Salehi Rizi, Joerg Schloetterer, Michael Granitzer
Computing shortest path distances between nodes lies at the heart of many graph algorithms and applications. Traditional exact methods such as breadth-first-search (BFS) do not scale up to contemporary, rapidly evolving today's massive networks. Therefore, it is required to find approximation methods to enable scalable graph processing with a significant speedup. In this paper, we utilize vector embeddings learnt by deep learning techniques to approximate the shortest paths distances in large graphs. We show that a feedforward neural network fed with embeddings can approximate distances with relatively low distortion error. The suggested method is evaluated on the Facebook, BlogCatalog, Youtube and Flickr social networks.
LGDec 9, 2019
Parallel Total Variation Distance Estimation with Neural Networks for Merging Over-ClusteringsChristian Reiser, Jörg Schlötterer, Michael Granitzer
We consider the initial situation where a dataset has been over-partitioned into $k$ clusters and seek a domain independent way to merge those initial clusters. We identify the total variation distance (TVD) as suitable for this goal. By exploiting the relation of the TVD to the Bayes accuracy we show how neural networks can be used to estimate TVDs between all pairs of clusters in parallel. Crucially, the needed memory space is decreased by reducing the required number of output neurons from $k^2$ to $k$. On realistically obtained over-clusterings of ImageNet subsets it is demonstrated that our TVD estimates lead to better merge decisions than those obtained by relying on state-of-the-art unsupervised representations. Further the generality of the approach is verified by evaluating it on a a point cloud dataset.
LGOct 20, 2019
Policy Learning for Malaria ControlVan Bach Nguyen, Belaid Mohamed Karim, Bao Long Vu et al.
Sequential decision making is a typical problem in reinforcement learning with plenty of algorithms to solve it. However, only a few of them can work effectively with a very small number of observations. In this report, we introduce the progress to learn the policy for Malaria Control as a Reinforcement Learning problem in the KDD Cup Challenge 2019 and propose diverse solutions to deal with the limited observations problem. We apply the Genetic Algorithm, Bayesian Optimization, Q-learning with sequence breaking to find the optimal policy for five years in a row with only 20 episodes/100 evaluations. We evaluate those algorithms and compare their performance with Random Search as a baseline. Among these algorithms, Q-Learning with sequence breaking has been submitted to the challenge and got ranked 7th in KDD Cup.
NEOct 16, 2019
Structural Analysis of Sparse Neural NetworksJulian Stier, Michael Granitzer
Sparse Neural Networks regained attention due to their potential for mathematical and computational advantages. We give motivation to study Artificial Neural Networks (ANNs) from a network science perspective, provide a technique to embed arbitrary Directed Acyclic Graphs into ANNs and report study results on predicting the performance of image classifiers based on the structural properties of the networks' underlying graph. Results could further progress neuroevolution and add explanations for the success of distinct architectures from a structural perspective.
LGSep 3, 2019
Towards automated feature engineering for credit card fraud detection using multi-perspective HMMsYvan Lucas, Pierre-Edouard Portier, Léa Laporte et al.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this framework, we model a sequence of credit card transactions from three different perspectives, namely (i) The sequence contains or doesn't contain a fraud (ii) The sequence is obtained by fixing the card-holder or the payment terminal (iii) It is a sequence of spent amount or of elapsed time between the current and previous transactions. Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sequences is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. Our multiple perspectives HMM-based approach offers automated feature engineering to model temporal correlations so as to improve the effectiveness of the classification task and allows for an increase in the detection of fraudulent transactions when combined with the state of the art expert based feature engineering strategy for credit card fraud detection. In extension to previous works, we show that this approach goes beyond ecommerce transactions and provides a robust feature engineering over different datasets, hyperparameters and classifiers. Moreover, we compare strategies to deal with structural missing values.
LGJun 17, 2019
Dataset shift quantification for credit card fraud detectionYvan Lucas, Pierre-Edouard Portier, Léa Laporte et al.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However purchase behaviour and fraudster strategies may change over time. This phenomenon is named dataset shift or concept drift in the domain of fraud detection. In this paper, we present a method to quantify day-by-day the dataset shift in our face-to-face credit card transactions dataset (card holder located in the shop) . In practice, we classify the days against each other and measure the efficiency of the classification. The more efficient the classification, the more different the buying behaviour between two days, and vice versa. Therefore, we obtain a distance matrix characterizing the dataset shift. After an agglomerative clustering of the distance matrix, we observe that the dataset shift pattern matches the calendar events for this time period (holidays, week-ends, etc). We then incorporate this dataset shift knowledge in the credit card fraud detection task as a new feature. This leads to a small improvement of the detection.
LGMay 15, 2019
Multiple perspectives HMM-based feature engineering for credit card fraud detectionYvan Lucas, Pierre-Edouard Portier, Léa Laporte et al.
Machine learning and data mining techniques have been used extensively in order to detect credit card frauds. However, most studies consider credit card transactions as isolated events and not as a sequence of transactions. In this article, we model a sequence of credit card transactions from three different perspectives, namely (i) does the sequence contain a Fraud? (ii) Is the sequence obtained by fixing the card-holder or the payment terminal? (iii) Is it a sequence of spent amount or of elapsed time between the current and previous transactions? Combinations of the three binary perspectives give eight sets of sequences from the (training) set of transactions. Each one of these sets is modelled with a Hidden Markov Model (HMM). Each HMM associates a likelihood to a transaction given its sequence of previous transactions. These likelihoods are used as additional features in a Random Forest classifier for fraud detection. This multiple perspectives HMM-based approach enables an automatic feature engineering in order to model the sequential properties of the dataset with respect to the classification task. This strategy allows for a 15% increase in the precision-recall AUC compared to the state of the art feature engineering strategy for credit card fraud detection.