LGMar 13, 2022
Automated fault tree learning from continuous-valued sensor data: a case study on domestic heatersBart Verkuil, Carlos E. Budde, Doina Bucur
Many industrial sectors have been collecting big sensor data. With recent technologies for processing big data, companies can exploit this for automatic failure detection and prevention. We propose the first completely automated method for failure analysis, machine-learning fault trees from raw observational data with continuous variables. Our method scales well and is tested on a real-world, five-year dataset of domestic heater operations in The Netherlands, with 31 million unique heater-day readings, each containing 27 sensor and 11 failure variables. Our method builds on two previous procedures: the C4.5 decision-tree learning algorithm, and the LIFT fault tree learning algorithm from Boolean data. C4.5 pre-processes each continuous variable: it learns an optimal numerical threshold which distinguishes between faulty and normal operation of the top-level system. These thresholds discretise the variables, thus allowing LIFT to learn fault trees which model the root failure mechanisms of the system and are explainable. We obtain fault trees for the 11 failure variables, and evaluate them in two ways: quantitatively, with a significance score, and qualitatively, with domain specialists. Some of the fault trees learnt have almost maximum significance (above 0.95), while others have medium-to-low significance (around 0.30), reflecting the difficulty of learning from big, noisy, real-world sensor data. The domain specialists confirm that the fault trees model meaningful relationships among the variables.
SIApr 13, 2022
Large-scale multi-objective influence maximisation with network downscalingElia Cunegatti, Giovanni Iacca, Doina Bucur
Finding the most influential nodes in a network is a computationally hard problem with several possible applications in various kinds of network-based problems. While several methods have been proposed for tackling the influence maximisation (IM) problem, their runtime typically scales poorly when the network size increases. Here, we propose an original method, based on network downscaling, that allows a multi-objective evolutionary algorithm (MOEA) to solve the IM problem on a reduced scale network, while preserving the relevant properties of the original network. The downscaled solution is then upscaled to the original network, using a mechanism based on centrality metrics such as PageRank. Our results on eight large networks (including two with $\sim$50k nodes) demonstrate the effectiveness of the proposed method with a more than 10-fold runtime gain compared to the time needed on the original network, and an up to $82\%$ time reduction compared to CELF.
LGJul 4, 2024
gFlora: a topology-aware method to discover functional co-response groups in soil microbial communitiesNan Chen, Merlijn Schram, Doina Bucur
We aim to learn the functional co-response group: a group of taxa whose co-response effect (the representative characteristic of the group showing the total topological abundance of taxa) co-responds (associates well statistically) to a functional variable. Different from the state-of-the-art method, we model the soil microbial community as an ecological co-occurrence network with the taxa as nodes (weighted by their abundance) and their relationships (a combination from both spatial and functional ecological aspects) as edges (weighted by the strength of the relationships). Then, we design a method called gFlora which notably uses graph convolution over this co-occurrence network to get the co-response effect of the group, such that the network topology is also considered in the discovery process. We evaluate gFlora on two real-world soil microbiome datasets (bacteria and nematodes) and compare it with the state-of-the-art method. gFlora outperforms this on all evaluation metrics, and discovers new functional evidence for taxa which were so far under-studied. We show that the graph convolution step is crucial to taxa with relatively low abundance (thus removing the bias towards taxa with higher abundance), and the discovered bacteria of different genera are distributed in the co-occurrence network but still tightly connected among themselves, demonstrating that topologically they fill different but collaborative functional roles in the ecological community.
LGMay 26, 2023Code
Understanding Sparse Neural Networks from their Topology via Multipartite Graph RepresentationsElia Cunegatti, Matteo Farina, Doina Bucur et al.
Pruning-at-Initialization (PaI) algorithms provide Sparse Neural Networks (SNNs) which are computationally more efficient than their dense counterparts, and try to avoid performance degradation. While much emphasis has been directed towards \emph{how} to prune, we still do not know \emph{what topological metrics} of the SNNs characterize \emph{good performance}. From prior work, we have layer-wise topological metrics by which SNN performance can be predicted: the Ramanujan-based metrics. To exploit these metrics, proper ways to represent network layers via Graph Encodings (GEs) are needed, with Bipartite Graph Encodings (BGEs) being the \emph{de-facto} standard at the current stage. Nevertheless, existing BGEs neglect the impact of the inputs, and do not characterize the SNN in an end-to-end manner. Additionally, thanks to a thorough study of the Ramanujan-based metrics, we discover that they are only as good as the \emph{layer-wise density} as performance predictors, when paired with BGEs. To close both gaps, we design a comprehensive topological analysis for SNNs with both linear and convolutional layers, via (i) a new input-aware Multipartite Graph Encoding (MGE) for SNNs and (ii) the design of new end-to-end topological metrics over the MGE. With these novelties, we show the following: (a) The proposed MGE allows to extract topological metrics that are much better predictors of the accuracy drop than metrics computed from current input-agnostic BGEs; (b) Which metrics are important at different sparsity levels and for different architectures; (c) A mixture of our topological metrics can rank PaI algorithms more effectively than Ramanujan-based metrics. The codebase is publicly available at https://github.com/eliacunegatti/mge-snn.
CVFeb 16, 2022
Bias in Automated Image Colorization: Metrics and Error TypesFrank Stapel, Floris Weers, Doina Bucur
We measure the color shifts present in colorized images from the ADE20K dataset, when colorized by the automatic GAN-based DeOldify model. We introduce fine-grained local and regional bias measurements between the original and the colorized images, and observe many colorization effects. We confirm a general desaturation effect, and also provide novel observations: a shift towards the training average, a pervasive blue shift, different color shifts among image categories, and a manual categorization of colorization errors in three classes.
LGNov 9, 2021
Look back, look around: a systematic analysis of effective predictors for new outlinks in focused Web crawlingThi Kim Nhung Dang, Doina Bucur, Berk Atil et al.
Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.
SIOct 19, 2021
The network signature of constellation line figuresDoina Bucur
In traditional astronomies across the world, groups of stars in the night sky were linked into constellations -- symbolic representations rich in meaning and with practical roles. In some sky cultures, constellations are represented as line (or connect-the-dot) figures, which are spatial networks drawn over the fixed background of stars. We analyse 1802 line figures from 56 sky cultures spanning all continents, in terms of their network, spatial, and brightness features, and ask what associations exist between these visual features and culture type or sky region. First, an embedded map of constellations is learnt, to show clusters of line figures. We then form the network of constellations (as linked by their similarity), to study how similar cultures are by computing their assortativity (or homophily) over the network. Finally, we measure the diversity (or entropy) index for the set of constellations drawn per sky region. Our results show distinct types of line figures, and that many folk astronomies with oral traditions have widespread similarities in constellation design, which do not align with cultural ancestry. In a minority of sky regions, certain line designs appear universal, but this is not the norm: in the majority of sky regions, the line geometries are diverse.
CVJun 1, 2021
Independent Prototype Propagation for Zero-Shot CompositionalityFrank Ruis, Gertjan Burghouts, Doina Bucur
Humans are good at compositional zero-shot reasoning; someone who has never seen a zebra before could nevertheless recognize one when we tell them it looks like a horse with black and white stripes. Machine learning systems, on the other hand, usually leverage spurious correlations in the training data, and while such correlations can help recognize objects in context, they hurt generalization. To be able to deal with underspecified datasets while still leveraging contextual clues during classification, we propose ProtoProp, a novel prototype propagation graph method. First we learn prototypical representations of objects (e.g., zebra) that are conditionally independent w.r.t. their attribute labels (e.g., stripes) and vice versa. Next we propagate the independent prototypes through a compositional graph, to learn compositional prototypes of novel attribute-object combinations that reflect the dependencies of the target distribution. The method does not rely on any external data, such as class hierarchy graphs or pretrained word embeddings. We evaluate our approach on AO-Clever, a synthetic and strongly visual dataset with clean labels, and UT-Zappos, a noisy real-world dataset of fine-grained shoe types. We show that in the generalized compositional zero-shot setting we outperform state-of-the-art results, and through ablations we show the importance of each part of the method and their contribution to the final results.
SIJun 13, 2020
Top influencers can be identified universally by combining classical centralitiesDoina Bucur
Information flow, opinion, and epidemics spread over structured networks. When using individual node centrality indicators to predict which nodes will be among the top influencers or spreaders in a large network, no single centrality has consistently good ranking power. We show that statistical classifiers using two or more centralities as input are instead consistently predictive over many diverse, static real-world topologies. Certain pairs of centralities cooperate particularly well in statistically drawing the boundary between the top spreaders and the rest: local centralities measuring the size of a node's neighbourhood benefit from the addition of a global centrality such as the eigenvector centrality, closeness, or the core number. This is, intuitively, because a local centrality may rank highly some nodes which are located in dense, but peripheral regions of the network---a situation in which an additional global centrality indicator can help by prioritising nodes located more centrally. The nodes selected as superspreaders will usually jointly maximise the values of both centralities. As a result of the interplay between centrality indicators, training classifiers with seven classical indicators leads to a nearly maximum average precision function (0.995) across the networks in this study.