GN LG MLDec 23, 2022

Neural Networks beyond explainability: Selective inference for sequence motifs

Antoine Villié, Philippe Veber, Yohann de Castro, Laurent Jacob

arXiv:2212.12542v11.2h-index: 20

Originality Incremental advance

AI Analysis

This work provides a more rigorous statistical framework for analyzing neural networks in regulatory genomics, potentially enhancing genome-wide association studies, though it is incremental as it builds on existing selective inference procedures.

The authors tackled the problem of testing associations between neural network-extracted sequence motifs and phenotypes, introducing SEISM, a selective inference procedure that adapts sampling-based methods to handle motif selection from infinite sets. They demonstrated that their method is well-calibrated, powerful, and fast, with a trade-off compared to simpler data-split strategies.

Over the past decade, neural networks have been successful at making predictions from biological sequences, especially in the context of regulatory genomics. As in other fields of deep learning, tools have been devised to extract features such as sequence motifs that can explain the predictions made by a trained network. Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. In particular, we discuss how training a one-layer convolutional network is formally equivalent to selecting motifs maximizing some association score. We adapt existing sampling-based selective inference procedures by quantizing this selection over an infinite set to a large but finite grid. Finally, we show that sampling under a specific choice of parameters is sufficient to characterize the composite null hypothesis typically used for selective inference-a result that goes well beyond our particular framework. We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. SEISM paves the way to an easier analysis of neural networks used in regulatory genomics, and to more powerful methods for genome wide association studies (GWAS).

View on arXiv PDF

Similar