PhaGO: Protein function annotation for bacteriophages by integrating the genomic context
This addresses the challenge of annotating diverse and scarce phage proteins, which is critical for understanding phage biology and microbial ecology, though it is an incremental advance in domain-specific bioinformatics.
The authors tackled the problem of limited protein function annotation for bacteriophages by developing PhaGO, a tool that integrates genomic context and protein embeddings, achieving improvements of 6.78% and 13.05% over state-of-the-art methods for specific protein types.
Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, PhaGO surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. PhaGO can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of PhaGO by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of PhaGO to extend our understanding of newly discovered phages.