84.2SIJun 2
Explainable Forecasting of Scientific Breakthroughs from Concept Network DynamicsThomas Maillart, Thibaut Chataing, Ntorina Antoni et al.
We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors -- particularly Adamic-Adar similarity and degree-based Hadamard measures -- consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture -- detection, expert translation, institutional integration -- that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.
87.5SIJun 2
Forecasting Conceptual Diffusion in Science: The Case of Quantum ComputingThomas Maillart, Thibaut Chataing, David Dosu et al.
Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.
CLDec 12, 2022Code
Robust and Explainable Identification of Logical Fallacies in Natural Language ArgumentsZhivar Sourati, Vishnu Priya Prasanna Venkatesh, Darshan Deshpande et al.
The spread of misinformation, propaganda, and flawed argumentation has been amplified in the Internet era. Given the volume of data and the subtlety of identifying violations of argumentation norms, supporting information analytics tasks, like content moderation, with trustworthy methods that can identify logical fallacies is essential. In this paper, we formalize prior theoretical work on logical fallacies into a comprehensive three-stage evaluation framework of detection, coarse-grained, and fine-grained classification. We adapt existing evaluation datasets for each stage of the evaluation. We employ three families of robust and explainable methods based on prototype reasoning, instance-based reasoning, and knowledge injection. The methods combine language models with background knowledge and explainable mechanisms. Moreover, we address data sparsity with strategies for data augmentation and curriculum learning. Our three-stage framework natively consolidates prior datasets and methods from existing tasks, like propaganda detection, serving as an overarching evaluation testbed. We extensively evaluate these methods on our datasets, focusing on their robustness and explainability. Our results provide insight into the strengths and weaknesses of the methods on different components and fallacy classes, indicating that fallacy identification is a challenging task that may require specialized forms of reasoning to capture various classes. We share our open-source code and data on GitHub to support further work on logical fallacy identification.
59.9CYMay 24Code
Building Digital Societies as Ecosystems: How Recognition and Repeat Relationships Sustain Cross-Community Work in Open SourceLucia Gomez Tejeiro, Thibaut Chataing, Julian Jang-Jaccard et al.
We measure cross-boundary collaboration in an open-source software (OSS) ecosystem by reconstructing the bipartite contributor-repository graph of 464 cybersecurity projects and 11,372 contributors active over October 2001-May 2022 (Rawsec Cybersecurity Inventory). Louvain community detection identifies 163 non-singleton communities; per-community contributor count scales superlinearly with repository count (n_contributors ~ n_repos^1.4), and community formation follows a logistic trajectory saturating around 2018. Three patterns support a recognition/repeat-relationship account of cross-boundary work. First, cross-community work concentrates in a thin carrier layer: only nine canonical humans span seven or more communities at the commit level, authoring 14% of 4,015 inter-community merged pull requests; the top 50 cross-community contributors produce 54%. Second, boundary friction is a recognition cost, not a fixed boundary property: inter-community pull-request acceptance rises from 42% at breadth k=1 to 87% at k=5-9, with median latency compressing from 147 h to 49 h. Third, community survival is cohort-structured: per-cohort residualisation hazard rises an order of magnitude between pre-2010 and 2018 cohorts, and external community reach predicts survival mainly through size, leaving late cohorts under-served despite a stable carrier layer. The corpus predates mainstream LLM coding assistants; this baseline of carrier-layer thinness, friction gradient, and cohort hazard informs debates on social coding as a template for digital societies and on what AI-mediated OSS ecosystems should not optimise away.
CLMar 21, 2023
Fundamentals of Generative Large Language Models and Perspectives in Cyber-DefenseAndrei Kucharavy, Zachary Schillaci, Loïc Maréchal et al.
Generative Language Models gained significant attention in late 2022 / early 2023, notably with the introduction of models refined to act consistently with users' expectations of interactions with AI (conversational models). Arguably the focal point of public attention has been such a refinement of the GPT3 model -- the ChatGPT and its subsequent integration with auxiliary capabilities, including search as part of Microsoft Bing. Despite extensive prior research invested in their development, their performance and applicability to a range of daily tasks remained unclear and niche. However, their wider utilization without a requirement for technical expertise, made in large part possible through conversational fine-tuning, revealed the extent of their true capabilities in a real-world environment. This has garnered both public excitement for their potential applications and concerns about their capabilities and potential malicious uses. This review aims to provide a brief overview of the history, state of the art, and implications of Generative Language Models in terms of their principles, abilities, limitations, and future prospects -- especially in the context of cyber-defense, with a focus on the Swiss operational environment.
CYNov 28, 2022
Beyond S-curves: Recurrent Neural Networks for Technology ForecastingAlexander Glavackij, Dimitri Percia David, Alain Mermoud et al.
Because of the considerable heterogeneity and complexity of the technological landscape, building accurate models to forecast is a challenging endeavor. Due to their high prevalence in many complex systems, S-curves are a popular forecasting approach in previous work. However, their forecasting performance has not been directly compared to other technology forecasting approaches. Additionally, recent developments in time series forecasting that claim to improve forecasting accuracy are yet to be applied to technological development data. This work addresses both research gaps by comparing the forecasting performance of S-curves to a baseline and by developing an autencoder approach that employs recent advances in machine learning and time series forecasting. S-curves forecasts largely exhibit a mean average percentage error (MAPE) comparable to a simple ARIMA baseline. However, for a minority of emerging technologies, the MAPE increases by two magnitudes. Our autoencoder approach improves the MAPE by 13.5% on average over the second-best result. It forecasts established technologies with the same accuracy as the other approaches. However, it is especially strong at forecasting emerging technologies with a mean MAPE 18% lower than the next best result. Our results imply that a simple ARIMA model is preferable over the S-curve for technology forecasting. Practitioners looking for more accurate forecasts should opt for the presented autoencoder approach.
CRSep 6, 2022
Orchestrating Collaborative Cybersecurity: A Secure Framework for Distributed Privacy-Preserving Threat Intelligence SharingJuan R. Trocoso-Pastoriza, Alain Mermoud, Romain Bouyé et al.
Cyber Threat Intelligence (CTI) sharing is an important activity to reduce information asymmetries between attackers and defenders. However, this activity presents challenges due to the tension between data sharing and confidentiality, that result in information retention often leading to a free-rider problem. Therefore, the information that is shared represents only the tip of the iceberg. Current literature assumes access to centralized databases containing all the information, but this is not always feasible, due to the aforementioned tension. This results in unbalanced or incomplete datasets, requiring the use of techniques to expand them; we show how these techniques lead to biased results and misleading performance expectations. We propose a novel framework for extracting CTI from distributed data on incidents, vulnerabilities and indicators of compromise, and demonstrate its use in several practical scenarios, in conjunction with the Malware Information Sharing Platforms (MISP). Policy implications for CTI sharing are presented and discussed. The proposed system relies on an efficient combination of privacy enhancing technologies and federated processing. This lets organizations stay in control of their CTI and minimize the risks of exposure or leakage, while enabling the benefits of sharing, more accurate and representative results, and more effective predictive and preventive defenses.
AIJan 27, 2023
Case-Based Reasoning with Language Models for Classification of Logical FallaciesZhivar Sourati, Filip Ilievski, Hông-Ân Sandlin et al.
The ease and speed of spreading misinformation and propaganda on the Web motivate the need to develop trustworthy technology for detecting fallacies in natural language arguments. However, state-of-the-art language modeling methods exhibit a lack of robustness on tasks like logical fallacy classification that require complex reasoning. In this paper, we propose a Case-Based Reasoning method that classifies new cases of logical fallacy by language-modeling-driven retrieval and adaptation of historical cases. We design four complementary strategies to enrich input representation for our model, based on external information about goals, explanations, counterarguments, and argument structure. Our experiments in in-domain and out-of-domain settings indicate that Case-Based Reasoning improves the accuracy and generalizability of language models. Our ablation studies suggest that representations of similar cases have a strong impact on the model performance, that models perform well with fewer retrieved cases, and that the size of the case database has a negligible effect on the performance. Finally, we dive deeper into the relationship between the properties of the retrieved cases and the model performance.
AIDec 11, 2022
Multimodal and Explainable Internet Meme ClassificationAbhinav Kumar Thakur, Filip Ilievski, Hông-Ân Sandlin et al.
In the current context where online platforms have been effectively weaponized in a variety of geo-political events and social issues, Internet memes make fair content moderation at scale even more difficult. Existing work on meme classification and tracking has focused on black-box methods that do not explicitly consider the semantics of the memes or the context of their creation. In this paper, we pursue a modular and explainable architecture for Internet meme understanding. We design and implement multimodal classification methods that perform example- and prototype-based reasoning over training cases, while leveraging both textual and visual SOTA models to represent the individual cases. We study the relevance of our modular and explainable models in detecting harmful memes on two existing tasks: Hate Speech Detection and Misogyny Classification. We compare the performance between example- and prototype-based methods, and between text, vision, and multimodal models, across different categories of harmfulness (e.g., stereotype and objectification). We devise a user-friendly interface that facilitates the comparative analysis of examples retrieved by all of our models for any given meme, informing the community about the strengths and limitations of these explainable methods.
CLDec 11, 2022
A Study of Slang Representation MethodsAravinda Kolla, Filip Ilievski, Hông-Ân Sandlin et al.
Considering the large amount of content created online by the minute, slang-aware automatic tools are critically needed to promote social good, and assist policymakers and moderators in restricting the spread of offensive language, abuse, and hate speech. Despite the success of large language models and the spontaneous emergence of slang dictionaries, it is unclear how far their combination goes in terms of slang understanding for downstream social good tasks. In this paper, we provide a framework to study different combinations of representation learning models and knowledge resources for a variety of downstream tasks that rely on slang understanding. Our experiments show the superiority of models that have been pre-trained on social media data, while the impact of dictionaries is positive only for static word embeddings. Our error analysis identifies core challenges for slang representation learning, including out-of-vocabulary words, polysemy, variance, and annotation disagreements, which can be traced to characteristics of slang as a quickly evolving and highly subjective language.
CLDec 12, 2023
LLMs Perform Poorly at Concept Extraction in Cyber-security Research LiteratureMaxime Würsch, Andrei Kucharavy, Dimitri Percia David et al.
The cybersecurity landscape evolves rapidly and poses threats to organizations. To enhance resilience, one needs to track the latest developments and trends in the domain. It has been demonstrated that standard bibliometrics approaches show their limits in such a fast-evolving domain. For this purpose, we use large language models (LLMs) to extract relevant knowledge entities from cybersecurity-related texts. We use a subset of arXiv preprints on cybersecurity as our data and compare different LLMs in terms of entity recognition (ER) and relevance. The results suggest that LLMs do not produce good knowledge entities that reflect the cybersecurity context, but our results show some potential for noun extractors. For this reason, we developed a noun extractor boosted with some statistical analysis to extract specific and relevant compound nouns from the domain. Later, we tested our model to identify trends in the LLM domain. We observe some limitations, but it offers promising results to monitor the evolution of emergent trends.
CLOct 29, 2025
Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple GraphsAlexander Sternfeld, Andrei Kucharavy, Dimitri Percia David et al.
Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.
CRDec 10, 2021
TechRank: A Network-Centrality Approach for Informed Cybersecurity-InvestmentAnita Mezzetti, Dimitri Percia David, Thomas Maillart et al.
The cybersecurity technological landscape is a complex ecosystem in which entities -- such as companies and technologies -- influence each other in a non-trivial manner. Measuring the influence between entities is a tenet for informed technological investments in critical infrastructure. To study the mutual influence of companies and technologies from the cybersecurity field, we consider a bi-partite graph that links both sets of entities. Each node in this graph is weighted by applying a recursive algorithm based on the method of reflection. This endeavor helps to measure the impact of an entity on the cybersecurity market. Our results help researchers measure more precisely the magnitude of influence of each entity, and allows decision-makers to devise more informed investment strategies, according to their portfolio preferences. Finally, a research agenda is suggested, with the aim of allowing tailor-made investments by arbitrarily calibrating specific features of both types of entities.
IRDec 9, 2021
From Scattered Sources to Comprehensive Technology Landscape: A Recommendation-based Retrieval ApproachChi Thang Duong, Dimitri Percia David, Ljiljana Dolamic et al.
Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and incomplete approach. In this work, we propose an end-to-end recommendation based retrieval approach to support automatic retrieval of technologies and their associated companies from raw Web data. This is a two-task setup involving (i) technology classification of entities extracted from company corpus, and (ii) technology and company retrieval based on classified technologies. Our proposed framework approaches the first task by leveraging DistilBERT which is a state-of-the-art language model. For the retrieval task, we introduce a recommendation-based retrieval technique to simultaneously support retrieving related companies, technologies related to a specific company and companies relevant to a technology. To evaluate these tasks, we also construct a data set that includes company documents and entities extracted from these documents together with company categories and technology labels. Experiments show that our approach is able to return 4 times more relevant companies while outperforming traditional retrieval baseline in retrieving technologies.
CRDec 8, 2021
Cyber-Security Investment in the Context of Disruptive Technologies: Extension of the Gordon-Loeb ModelDimitri Percia David, Alain Mermoud, Sébastien Gillard
Cyber-security breaches inflict significant costs on organizations. Hence, the development of an information-systems defense capability through cyber-security investment is a prerequisite. The question of how to determine the optimal amount to invest in cyber-security has been widely investigated in the literature. In this respect, the Gordon-Loeb model and its extensions received wide-scale acceptance. However, such models predominantly rely on restrictive assumptions that are not adapted for analyzing dynamic aspects of cyber-security investment. Yet, understanding such dynamic aspects is a key feature for studying cyber-security investment in the context of a fast-paced and continuously evolving technological landscape. We propose an extension of the Gordon-Loeb model by considering multi-period and relaxing the assumption of a continuous security-breach probability function. Such theoretical adaptations enable to capture dynamic aspects of cyber-security investment such as the advent of a disruptive technology and its investment consequences. Such a proposed extension of the Gordon-Loeb model gives room for a hypothetical decrease of the optimal level of cyber-security investment, due to a potential technological shift. While we believe our framework should be generalizable across the cyber-security milieu, we illustrate our approach in the context of critical-infrastructure protection, where security-cost reductions related to risk events are of paramount importance as potential losses reach unaffordable proportions. Moreover, despite the fact that some technologies are considered as disruptive and thus promising for critical-infrastructure protection, their effects on cyber-security investment have been discussed little.
NIAug 19, 2021
5G System Security AnalysisGerrit Holtrup, William Lacube, Dimitri Percia David et al.
Fifth generation mobile networks (5G) are currently being deployed by mobile operators around the globe. 5G acts as an enabler for various use cases and also improves the security and privacy over 4G and previous network generations. However, as recent security research has revealed, the standard still has security weaknesses that may be exploitable by attackers. In addition, the migration from 4G to 5G systems is taking place by first deploying 5G solutions in a non-standalone (NSA) manner where the first step of the 5G deployment is restricted to the new radio aspects of 5G, while the control of the user equipment is still based on 4G protocols, i.e. the core network is still the legacy 4G evolved packet core (EPC) network. As a result, many security vulnerabilities of 4G networks are still present in current 5G deployments. This paper presents a systematic risk analysis of standalone and non-standalone 5G networks. We first describe an overview of the 5G system specification and the new security features of 5G compared to 4G. Then, we define possible threats according to the STRIDE threat classification model and derive a risk matrix based on the likelihood and impact of 12 threat scenarios that affect the radio access and the network core. Finally, we discuss possible mitigations and security controls. Our analysis is generic and does not account for the specifics of particular 5G network vendors or operators. Further work is required to understand the security vulnerabilities and risks of specific 5G implementations and deployments.
CRMar 3, 2021
Blockchain in Cyberdefence: A Technology Review from a Swiss PerspectiveLuca Gambazzi, Patrick Schaller, Alain Mermoud et al.
Since the advent of bitcoin in 2008, the concept of a blockchain has widely spread. Besides crypto currencies and trading activities, there is a wide range of potential application areas where blockchains are providing the main building block for secure solutions. From a technical point of view, a blockchain involves a set of cryptographic primitives to provide a data structure with security and trust properties. However, a blockchain is not a golden bullet. It may be well suited for some problems, but often an inappropriate data structure for many applications. In this paper, we review the high-level concept of a blockchain and present possible applications in the military field. Our review is targeted to readers with little prior domain knowledge as a support to decide where it makes sense to use a blockchain and where a blockchain might not be the right tool at hand.
CRJul 6, 2020
Contact Tracing: An Overview of Technologies and Cyber RisksFranck Legendre, Mathias Humbert, Alain Mermoud et al.
The 2020 COVID-19 pandemic has led to a global lockdown with severe health and economical consequences. As a result, authorities around the globe have expressed their needs for better tools to monitor the spread of the virus and to support human labor. Researchers and technology companies such as Google and Apple have offered to develop such tools in the form of contact tracing applications. The goal of these applications is to continuously track people's proximity and to make the smartphone users aware if they have ever been in contact with positively diagnosed people, so that they could self-quarantine and possibly have an infection test. A fundamental challenge with these smartphone-based contact tracing technologies is to ensure the security and privacy of their users. Moving from manual to smartphone-based contact tracing creates new cyber risks that could suddenly affect the entire population. Major risks include for example the abuse of the people's private data by companies and/or authorities, or the spreading of wrong alerts by malicious users in order to force individuals to go into quarantine. In April 2020, the Pan-European Privacy-Preserving Proximity Tracing (PEPP-PT) was announced with the goal to develop and evaluate secure solutions for European countries. However, after a while, several team members left this consortium and created DP-3T which has led to an international debate among the experts. At this time, it is confusing for the non-expert to follow this debate; this report aims to shed light on the various proposed technologies by providing an objective assessment of the cybersecurity and privacy risks. We first review the state-of-the-art in digital contact tracing technologies and then explore the risk-utility trade-offs of the techniques proposed for COVID-19. We focus specifically on the technologies that are already adopted by certain countries.