Andres R. Masegosa

LG
h-index52
6papers
63citations
Novelty44%
AI Score38

6 Papers

AIApr 27, 2016Code
Probabilistic Graphical Models on Multi-Core CPUs using Java 8

Andres R. Masegosa, Ana M. Martinez, Hanen Borchani

In this paper, we discuss software design issues related to the development of parallel computational intelligence algorithms on multi-core CPUs, using the new Java 8 functional programming features. In particular, we focus on probabilistic graphical models (PGMs) and present the parallelisation of a collection of algorithms that deal with inference and learning of PGMs from data. Namely, maximum likelihood estimation, importance sampling, and greedy search for solving combinatorial optimisation problems. Through these concrete examples, we tackle the problem of defining efficient data structures for PGMs and parallel processing of same-size batches of data sets using Java 8 features. We also provide straightforward techniques to code parallel algorithms that seamlessly exploit multi-core processors. The experimental analysis, carried out using our open source AMIDST (Analysis of MassIve Data STreams) Java toolbox, shows the merits of the proposed solutions.

LGNov 4, 2024
Revisiting K-mer Profile for Effective and Scalable Genome Representation Learning

Abdulkadir Celikkanat, Andres R. Masegosa, Thomas D. Nielsen

Obtaining effective representations of DNA sequences is crucial for genome analysis. Metagenomic binning, for instance, relies on genome representations to cluster complex mixtures of DNA fragments from biological samples with the aim of determining their microbial compositions. In this paper, we revisit k-mer-based representations of genomes and provide a theoretical analysis of their use in representation learning. Based on the analysis, we propose a lightweight and scalable model for performing metagenomic binning at the genome read level, relying only on the k-mer compositions of the DNA fragments. We compare the model to recent genome foundation models and demonstrate that while the models are comparable in performance, the proposed model is significantly more effective in terms of scalability, a crucial aspect for performing metagenomic binning of real-world datasets.

LGSep 30, 2025
UncertainGen: Uncertainty-Aware Representations of DNA Sequences for Metagenomic Binning

Abdulkadir Celikkanat, Andres R. Masegosa, Mads Albertsen et al.

Metagenomic binning aims to cluster DNA fragments from mixed microbial samples into their respective genomes, a critical step for downstream analyses of microbial communities. Existing methods rely on deterministic representations, such as k-mer profiles or embeddings from large language models, which fail to capture the uncertainty inherent in DNA sequences arising from inter-species DNA sharing and from fragments with highly similar representations. We present the first probabilistic embedding approach, UncertainGen, for metagenomic binning, representing each DNA fragment as a probability distribution in latent space. Our approach naturally models sequence-level uncertainty, and we provide theoretical guarantees on embedding distinguishability. This probabilistic embedding framework expands the feasible latent space by introducing a data-adaptive metric, which in turn enables more flexible separation of bins/clusters. Experiments on real metagenomic datasets demonstrate the improvements over deterministic k-mer and LLM-based embeddings for the binning task by offering a scalable and lightweight solution for large-scale metagenomic analysis.

SEDec 13, 2021
From Anecdote to Evidence: The Relationship Between Personality and Need for Cognition of Developers

Daniel Russo, Andres R. Masegosa, Klaas-Jan Stol

There is considerable anecdotal evidence suggesting that software engineers enjoy engaging in solving puzzles and other cognitive efforts. A tendency to engage in and enjoy effortful thinking is referred to as a person's 'need for cognition.' In this article we study the relationship between software engineers' personality traits and their need for cognition. Through a large-scale sample study of 483 respondents we collected data to capture the six 'bright' personality traits of the HEXACO model of personality, and three `dark' personality traits. Data were analyzed using several methods including a multiple Bayesian linear regression analysis. The results indicate that ca. 33% of variation in developers' need for cognition can be explained by personality traits. The Bayesian analysis suggests four traits to be of particular interest in predicting need for cognition: openness to experience, conscientiousness, honesty-humility, and emotionality. Further, we also find that need for cognition of software engineers is, on average, higher than in the general population, based on a comparison with prior studies. Given the importance of human factors for software engineers' performance in general, and problem solving skills in particular, our findings suggest several implications for recruitment, working behavior, and teaming.

LGDec 18, 2019
Learning under Model Misspecification: Applications to Variational and Ensemble methods

Andres R. Masegosa

Virtually any model we use in machine learning to make predictions does not perfectly represent reality. So, most of the learning happens under model misspecification. In this work, we present a novel analysis of the generalization performance of Bayesian model averaging under model misspecification and i.i.d. data using a new family of second-order PAC-Bayes bounds. This analysis shows, in simple and intuitive terms, that Bayesian model averaging provides suboptimal generalization performance when the model is misspecified. In consequence, we provide strong theoretical arguments showing that Bayesian methods are not optimal for learning predictive models, unless the model class is perfectly specified. Using novel second-order PAC-Bayes bounds, we derive a new family of Bayesian-like algorithms, which can be implemented as variational and ensemble methods. The output of these algorithms is a new posterior distribution, different from the Bayesian posterior, which induces a posterior predictive distribution with better generalization performance. Experiments with Bayesian neural networks illustrate these findings.

LGOct 2, 2014
Stochastic Discriminative EM

Andres R. Masegosa

Stochastic discriminative EM (sdEM) is an online-EM-type algorithm for discriminative training of probabilistic generative models belonging to the exponential family. In this work, we introduce and justify this algorithm as a stochastic natural gradient descent method, i.e. a method which accounts for the information geometry in the parameter space of the statistical model. We show how this learning algorithm can be used to train probabilistic generative models by minimizing different discriminative loss functions, such as the negative conditional log-likelihood and the Hinge loss. The resulting models trained by sdEM are always generative (i.e. they define a joint probability distribution) and, in consequence, allows to deal with missing data and latent variables in a principled way either when being learned or when making predictions. The performance of this method is illustrated by several text classification problems for which a multinomial naive Bayes and a latent Dirichlet allocation based classifier are learned using different discriminative loss functions.