Protein Function Prediction methods exploiting the CATH Protein Domain Classification
Proteins are biological building blocks that carry out crucial functions in an organism. The function of proteins is essential to understand the role they play in an organism and for the role they play in disease.
As technology advances, the number of sequences available for analysis has also been increasing rapidly. While experimental characterisation remains the gold standard for assigning function to proteins, this process is notoriously slow leading to a large discrepancy between annotated and unannotated proteins.
This work describes the development of three methods to predict protein function using computational methods through which it is possible to bridge the gap between the annotated and unannotated proteins. The methods leverage CATH Functional Families (FunFams), which comprise evolutionary related domains grouped into functionally coherent sets, to score the mapping between the FunFams and the GO Terms of the proteins that map to the FunFams.
dcGO4CATH is a method that uses a statistical approach to calculate the size of the overlap between the FunFams and the proteins’ GO Terms. Originally developed for SCOP, the method was adapted to work with CATH FunFams. SetCATH uses a set-based approach and applies the Jaccard, Sørensen-Dice and the Overlap Similarity Indexes for the mapping. Additionally, FunPredCATH is an ensemble based on a set-theoretic approach that combines dcGO4CATH, SetCATH and FunFamer(a predictor developed by the Orengo Group). FunPredCATH capitalises on the strengths of the individual predictors to improve predictions.
The methods were tested using the CAFA3 benchmark. The results show that SetCATH and FunPredCATH achieve good results, placing in the top tier. Moreover, the methods were applied to proteins involved in the immune response to SARS-CoV-2 which showed that the methods have practical applications in Biology.
https://discovery.ucl.ac.uk/id/eprint/10179491/10/Bonello_10179491_Thesis.pdf