83.5SIJun 2
Explainable Forecasting of Scientific Breakthroughs from Concept Network DynamicsThomas Maillart, Thibaut Chataing, Ntorina Antoni et al.
We introduce an explainable machine-learning approach that forecasts the structural precursors of scientific breakthroughs -- the emergence and intensification of links between research concepts -- by modelling how OpenAlex concept networks evolve over time. Using 59 semantic and topological features, a two-stage LightGBM model jointly predicts the formation and the future weight of concept pairs, adding a regression stage that quantifies expected intensity to prior link-existence forecasts. Relative to the state of the art, the approach improves accuracy and explainability at once: comparative validation across four technology and biomedical domains yields ROC-AUC in [0.954, 0.967] at all horizons without re-tuning, exceeding the roughly 0.90 of prior models, while every forecast rests on structural, auditable features rather than opaque embeddings. Classification performance is high (AUC about 0.95) and regression remains stable (RMSLE 0.45 to 0.6 over one to five years). Feature attribution shows that structural factors -- particularly Adamic-Adar similarity and degree-based Hadamard measures -- consistently drive accuracy, suggesting that breakthrough-relevant recombinations emerge in tightly connected sub-networks. Two expert-anchored cases, quantum annealing and AI-enabled quantum architectures, show the model surfacing technological convergence consistent with expert expectations. We then outline a three-layer decision architecture -- detection, expert translation, institutional integration -- that turns these forecasts into evidence-based research strategy and policy, anchored in open data and explainable features.
86.9SIJun 2
Forecasting Conceptual Diffusion in Science: The Case of Quantum ComputingThomas Maillart, Thibaut Chataing, David Dosu et al.
Understanding and anticipating scientific change requires models that distinguish between endogenous consolidation and exogenous diffusion of scientific concepts. Using the quantum computing subtree of concepts in OpenAlex, we construct a temporally resolved concept co-occurrence network and track each concept pair through its upstream citation lineage and downstream diffusion. We train LightGBM models on distributional and diversity-aware features to predict four outcomes: endogenous reinforcement, exogenous diffusion, their ratio, and diffusion entropy. After controlling for overall publication growth of the scientific body, endogenous reinforcement proves largely unpredictable in the primary quantum-computing benchmark. In contrast, exogenous diffusion and entropy are strongly predictable ($R^2$ up to $0.78à) and are driven by upstream heterogeneity, citation breadth, and distributional dispersion, as shown by SHAP analyses; replications on robotics, advanced materials, and neuro implants confirm that exogenous diffusion remains the top-ranked target across fields ($R^2_test \sim 0.60-0.87$), while endogenous predictability rises markedly in neuro implants (R^2_test = 0.83), indicating that the quantum-computing asymmetry does not generalise uniformly. Case studies reveal that sharp entropy increases coincide with the opening of new conceptual frontiers, while entropy collapses signal technological convergence or paradigm displacement. These results demonstrate that conceptual diffusion is governed by stable structural regularities embedded in semantic and citation environments. By identifying early diversity-based signals of cross-domain uptake, the approach provides a scalable foundation for anticipatory scientometrics, technology foresight, and innovation-oriented policy analysis in rapidly evolving research fields.
59.5CYMay 24Code
Building Digital Societies as Ecosystems: How Recognition and Repeat Relationships Sustain Cross-Community Work in Open SourceLucia Gomez Tejeiro, Thibaut Chataing, Julian Jang-Jaccard et al.
We measure cross-boundary collaboration in an open-source software (OSS) ecosystem by reconstructing the bipartite contributor-repository graph of 464 cybersecurity projects and 11,372 contributors active over October 2001-May 2022 (Rawsec Cybersecurity Inventory). Louvain community detection identifies 163 non-singleton communities; per-community contributor count scales superlinearly with repository count (n_contributors ~ n_repos^1.4), and community formation follows a logistic trajectory saturating around 2018. Three patterns support a recognition/repeat-relationship account of cross-boundary work. First, cross-community work concentrates in a thin carrier layer: only nine canonical humans span seven or more communities at the commit level, authoring 14% of 4,015 inter-community merged pull requests; the top 50 cross-community contributors produce 54%. Second, boundary friction is a recognition cost, not a fixed boundary property: inter-community pull-request acceptance rises from 42% at breadth k=1 to 87% at k=5-9, with median latency compressing from 147 h to 49 h. Third, community survival is cohort-structured: per-cohort residualisation hazard rises an order of magnitude between pre-2010 and 2018 cohorts, and external community reach predicts survival mainly through size, leaving late cohorts under-served despite a stable carrier layer. The corpus predates mainstream LLM coding assistants; this baseline of carrier-layer thinness, friction gradient, and cohort hazard informs debates on social coding as a template for digital societies and on what AI-mediated OSS ecosystems should not optimise away.
SEAug 11, 2016Code
Aristotle vs. Ringelmann: On Superlinear Production in Open Source SoftwareThomas Maillart, Didier Sornette
Organizations exist because they provide additional production gains, in comparison to horizontal ways of allocating resources, such as markets, and the open source movement is deemed to be a new kind of peer-production organization somehow in between hierarchically organized firms and markets. However, to strive as a new kind of organization, open source must provide production gains, which in turn should be measurable. The open source movement is particularly interesting to study for this reason. Here, we confront and discuss two contrasting views, which were reported in the literature recently. On the one hand, Sornette et al. uncovered a superlinear production mechanism, which quantifies Aristotle adage: `the whole is more than the sum of its parts'. On the other hand, Scholtes et al. found opposite results, and referred to Maximilien Ringelmann, a French agricultural engineer (1861-1931), who discovered the tendency for individual members of a group to become increasingly less productive as the size of their group increases. Since Ringelmann, the topic of collective intelligence has interested numbers of researchers in social sciences and social psychology, as well as practitioners in management aiming at improving the performance of their team. In most research and practice case studies, the Ringelmann effect has been found to hold, while, in contrast, the superlinear effect found by Sornette et al.is novel and may challenge common wisdom. Here, we compare these two theories, weigh their strengths and weaknesses, and discuss how they have been tested with empirical data. We find that they may not contradict each other as much as was claimed by Scholtes et al.
CRDec 10, 2021
TechRank: A Network-Centrality Approach for Informed Cybersecurity-InvestmentAnita Mezzetti, Dimitri Percia David, Thomas Maillart et al.
The cybersecurity technological landscape is a complex ecosystem in which entities -- such as companies and technologies -- influence each other in a non-trivial manner. Measuring the influence between entities is a tenet for informed technological investments in critical infrastructure. To study the mutual influence of companies and technologies from the cybersecurity field, we consider a bi-partite graph that links both sets of entities. Each node in this graph is weighted by applying a recursive algorithm based on the method of reflection. This endeavor helps to measure the impact of an entity on the cybersecurity market. Our results help researchers measure more precisely the magnitude of influence of each entity, and allows decision-makers to devise more informed investment strategies, according to their portfolio preferences. Finally, a research agenda is suggested, with the aim of allowing tailor-made investments by arbitrarily calibrating specific features of both types of entities.
CRAug 11, 2016
Given Enough Eyeballs, All Bugs Are Shallow? Revisiting Eric Raymond with Bug Bounty ProgramsThomas Maillart, Mingyi Zhao, Jens Grossklags et al.
Bug bounty programs offer a modern platform for organizations to crowdsource their software security and for security researchers to be fairly rewarded for the vulnerabilities they find. Little is known however on the incentives set by bug bounty programs: How they drive new bug discoveries, and how they supposedly improve security through the progressive exhaustion of discoverable vulnerabilities. Here, we recognize that bug bounty programs create tensions, for organizations running them on the one hand, and for security researchers on the other hand. At the level of one bug bounty program, security researchers face a sort of St-Petersburg paradox: The probability of finding additional bugs decays fast, and thus can hardly be matched with a sufficient increase of monetary rewards. Furthermore, bug bounty program managers have an incentive to gather the largest possible crowd to ensure a larger pool of expertise, which in turn increases competition among security researchers. As a result, we find that researchers have high incentives to switch to newly launched programs, for which a reserve of low-hanging fruit vulnerabilities is still available. Our results inform on the technical and economic mechanisms underlying the dynamics of bug bounty program contributions, and may in turn help improve the mechanism design of bug bounty programs that get increasingly adopted by cybersecurity savvy organizations.