GNDec 9, 2022Code
TargetCall: Eliminating the Wasted Computation in Basecalling via Pre-Basecalling FilteringMeryem Banu Cavlak, Gagandeep Singh, Mohammed Alser et al.
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, i.e., reads. State-of-the-art basecallers employ complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, the majority of reads do no match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads; and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31x while maintaining high (98.88%) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality compared to prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
GNSep 11, 2024Code
AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference GenomesJeremie S. Kim, Can Firtina, Meryem Banu Cavlak et al.
AirLift is the first read remapping tool that enables users to quickly and comprehensively map a read set, that had been previously mapped to one reference genome, to another similar reference. Users can then quickly run a downstream analysis of read sets for each latest reference release. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces the overall execution time to remap read sets between two reference genome versions by up to 27.4x. We validate our remapping results with GATK and find that AirLift provides high accuracy in identifying ground truth SNP/INDEL variants AirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.
QMJun 14, 2022Code
COVIDHunter: COVID-19 pandemic wave prediction and mitigation via seasonality-aware modelingMohammed Alser, Jeremie S. Kim, Nour Almadhoun Alserr et al.
Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region, predicts COVID-19 statistics (the daily number of cases, hospitalizations, and deaths), and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity), different variants of concern, vaccination rate, and mitigation measures. Using Switzerland as a case study, COVIDHunter estimates that we are experiencing a deadly new wave that will peak on 26 January 2022, which is very similar in numbers to the wave we had in February 2020. The policy-makers have only one choice that is to increase the strength of the currently applied mitigation measures for 30 days. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. We release the source code of the COVIDHunter implementation at https://github.com/CMU-SAFARI/COVIDHunter.
GNJan 23Code
FASTR: Reimagining FASTQ via Compact Image-inspired RepresentationAdrian Tkachenko, Sepehr Salem, Ayotomiwa Ezekiel Adeniyi et al.
Motivation: High-throughput sequencing (HTS) enables population-scale genomics but generates massive datasets, creating bottlenecks in storage, transfer, and analysis. FASTQ, the standard format for over two decades, stores one byte per base and one byte per quality score, leading to inefficient I/O, high storage costs, and redundancy. Existing compression tools can mitigate some issues, but often introduce costly decompression or complex dependency issues. Results: We introduce FASTR, a lossless, computation-native successor to FASTQ that encodes each nucleotide together with its base quality score into a single 8-bit value. FASTR reduces file size by at least 2x while remaining fully reversible and directly usable for downstream analyses. Applying general-purpose compression tools on FASTR consistently yields higher compression ratios, 2.47, 3.64, and 4.8x faster compression, and 2.34, 1.96, 1.75x faster decompression than on FASTQ across Illumina, HiFi, and ONT reads. FASTR is machine-learning-ready, allowing reads to be consumed directly as numerical vectors or image-like representations. We provide a highly parallel software ecosystem for FASTQ-FASTR conversion and show that FASTR integrates with existing tools, such as minimap2, with minimal interface changes and no performance overhead. By eliminating decompression costs and reducing data movement, FASTR lays the foundation for scalable genomics analyses and real-time sequencing workflows. Availability and Implementation: https://github.com/ALSER-Lab/FASTR
ARJul 20, 2022
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome AnalysisCan Firtina, Kamlesh Pillai, Gurpreet S. Kalsi et al.
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures. These probabilities are subsequently used to compute the similarity score between a sequence and a pHMM graph. The Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these probabilities to optimize and compute similarity scores. However, the Baum-Welch algorithm is computationally intensive, and existing solutions offer either software-only or hardware-only approaches with fixed pHMM designs. We identify an urgent need for a flexible, high-performance, and energy-efficient HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm for pHMMs. We introduce ApHMM, the first flexible acceleration framework designed to significantly reduce both computational and energy overheads associated with the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in the Baum-Welch algorithm by 1) designing flexible hardware to accommodate various pHMM designs, 2) exploiting predictable data dependency patterns through on-chip memory with memoization techniques, 3) rapidly filtering out negligible computations using a hardware-based filter, and 4) minimizing redundant computations. ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and 27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations in three key bioinformatics applications: 1) error correction, 2) protein family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x - 1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency by 64.24x - 115.46x, 1.75x, 1.96x.
CVFeb 24Code
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep IntelligenceSepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman et al.
Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints. Conventional models often act as opaque ''black boxes'' and fail to quantify the complex, irregular tumor boundaries that characterize malignant growth. To address these challenges, we present XMorph, an explainable and computationally efficient framework for fine-grained classification of three prominent brain tumor types: glioma, meningioma, and pituitary tumors. We propose an Information-Weighted Boundary Normalization (IWBN) mechanism that emphasizes diagnostically relevant boundary regions alongside nonlinear chaotic and clinically validated features, enabling a richer morphological representation of tumor growth. A dual-channel explainable AI module combines GradCAM++ visual cues with LLM-generated textual rationales, translating model reasoning into clinically interpretable insights. The proposed framework achieves a classification accuracy of 96.0%, demonstrating that explainability and high performance can co-exist in AI-based medical imaging systems. The source code and materials for XMorph are all publicly available at: https://github.com/ALSER-Lab/XMorph.
CVMay 8
NeuroGAN-3D: Enhancing Intrinsic Functional Brain Networks via High-Fidelity 3D Generative Super-ResolutionM. Moein Esfahani, Sepehr Salem Ghahfarokhi, Mohammed Alser et al.
Recent advances in neuroimaging have deepened our understanding of the brain's complex functional and structural organization. Among these, functional Magnetic Resonance Imaging (fMRI) - particularly resting-state fMRI (rs-fMRI) - has emerged as a tool for identifying biomarkers of intrinsic brain connectivity and delineating large-scale neural networks. These networks are typically represented as volumetric spatial maps that capture functionally coherent brain regions and reflect individual differences in brain activity and structure. The spatial resolution of these maps plays an important role, as it determines the ability to localize functional units with precision, perform reliable brain parcellation, and detect subtle, spatially specific neurobiological alterations associated with development, aging, or disease. Therefore, improving the effective resolution of neuroimaging-derived maps holds significant promise for enabling more detailed insights into brain architecture and its relationship to behavior and pathology. To address this need, we propose NeuroGAN-3D, a novel 3D generative super-resolution model tailored to the computational demands of volumetric neuroimaging. Our model leverages a generative adversarial network architecture to enhance the spatial resolution of rs-fMRI spatial maps, significantly outperforming a conventional baseline.
DCJun 9, 2021
IChannels: Exploiting Current Management Mechanisms to Create Covert Channels in Modern ProcessorsJawad Haj-Yahya, Jeremie S. Kim, A. Giray Yaglikci et al.
To operate efficiently across a wide range of workloads with varying power requirements, a modern processor applies different current management mechanisms, which briefly throttle instruction execution while they adjust voltage and frequency to accommodate for power-hungry instructions (PHIs) in the instruction stream. Doing so 1) reduces the power consumption of non-PHI instructions in typical workloads and 2) optimizes system voltage regulators' cost and area for the common use case while limiting current consumption when executing PHIs. However, these mechanisms may compromise a system's confidentiality guarantees. In particular, we observe that multilevel side-effects of throttling mechanisms, due to PHI-related current management mechanisms, can be detected by two different software contexts (i.e., sender and receiver) running on 1) the same hardware thread, 2) co-located Simultaneous Multi-Threading (SMT) threads, and 3) different physical cores. Based on these new observations on current management mechanisms, we develop a new set of covert channels, IChannels, and demonstrate them in real modern Intel processors (which span more than 70% of the entire client and server processor market). Our analysis shows that IChannels provides more than 24x the channel capacity of state-of-the-art power management covert channels. We propose practical and effective mitigations to each covert channel in IChannels by leveraging the insights we gain through a rigorous characterization of real systems.
PEFeb 6, 2021
COVIDHunter: An Accurate, Flexible, and Environment-Aware Open-Source COVID-19 Outbreak Simulation ModelMohammed Alser, Jeremie S. Kim, Nour Almadhoun Alserr et al.
Background: Early detection and isolation of COVID-19 patients are essential for successful implementation of mitigation strategies and eventually curbing the disease spread. With a limited number of daily COVID-19 tests performed in every country, simulating the COVID-19 spread along with the potential effect of each mitigation strategy currently remains one of the most effective ways in managing the healthcare system and guiding policy-makers. Methods: We introduce COVIDHunter, a flexible and accurate COVID-19 outbreak simulation model that evaluates the current mitigation measures that are applied to a region and provides suggestions on what strength the upcoming mitigation measure should be. The key idea of COVIDHunter is to quantify the spread of COVID-19 in a geographical region by simulating the average number of new infections caused by an infected person considering the effect of external factors, such as environmental conditions (e.g., climate, temperature, humidity) and mitigation measures. Results: Using Switzerland as a case study, COVIDHunter estimates that if the policy-makers relax the mitigation measures by 50% for 30 days then both the daily capacity need for hospital beds and daily number of deaths increase exponentially by an average of 5.1x, who may occupy ICU beds and ventilators for a period of time. Unlike existing models, the COVIDHunter model accurately monitors and predicts the daily number of cases, hospitalizations, and deaths due to COVID-19. Our model is flexible to configure and simple to modify for modeling different scenarios under different environmental conditions and mitigation measures. Availability: We release the source code of the COVIDHunter implementation at https://github.com/CMU- SAFARI/COVIDHunter and show how to flexibly configure our model for any scenario and easily extend it for different measures and conditions than we account for.
GNFeb 12, 2019
Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing AlgorithmCan Firtina, Jeremie S. Kim, Mohammed Alser et al.
Long reads produced by third-generation sequencing technologies are used to construct an assembly (i.e., the subject's genome), which is further used in downstream genome analysis. Unfortunately, long reads have high sequencing error rates and a large proportion of bps in these long reads are incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e., read-to-assembly alignment information). However, assembly polishing algorithms can only polish an assembly using reads either from a certain sequencing technology or from a small assembly. Such technology-dependency and assembly-size dependency require researchers to 1) run multiple polishing algorithms and 2) use small chunks of a large genome to use all available read sets and polish large genomes. We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e., both large and small genomes) using reads from all sequencing technologies (i.e., second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real read sets demonstrate that Apollo is the only algorithm that 1) uses reads from any sequencing technology within a single run and 2) scales well to polish large assemblies without splitting the assembly into multiple parts.