Bridging the Gaps in Statistical Models of Protein Alignment
This work addresses the need for more accurate protein alignment models in bioinformatics, though it is incremental as it builds on existing substitution matrices and benchmarks.
The authors tackled the problem of evaluating and improving statistical models for protein alignment by constructing a complete model from time-parameterised substitution matrices and alignment machines, inferring parameters from benchmark datasets. They compared 15 matrices using Minimum Message Length inference, resulting in a new best-performing matrix, MMLSUM, which outperformed others in measuring Shannon information content across six benchmarks.
This work demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed from a time-parameterised substitution matrix and a time-parameterised 3-state alignment machine. All parameters of such a model can be inferred from any benchmark data-set of aligned protein sequences. This allows us to examine nine well-known substitution matrices on six benchmarks curated using various structural alignment methods; any matrix that does not explicitly model a "time"-dependent Markov process is converted to a corresponding base-matrix that does. In addition, a new optimal matrix is inferred for each of the six benchmarks. Using Minimum Message Length (MML) inference, all 15 matrices are compared in terms of measuring the Shannon information content of each benchmark. This has resulted in a new and clear overall best performed time-dependent Markov matrix, MMLSUM, and its associated 3-state machine, whose properties we have analysed in this work. For standard use, the MMLSUM series of (log-odds) \textit{scoring} matrices derived from the above Markov matrix, are available at https://lcb.infotech.monash.edu.au/mmlsum.