Improving diagnosis of genetic disease through computational investigation of splicing
Despite an estimate of 50% of pathogenic genomic mutations being related to splicing, this inherently complex mechanism is not yet fully understood. Identifying splice disruption is a complicated expert task requiring manual labour and expensive sequencing. With the emergence of Machine Learning for targeted medicine, modelling splicing computationally allows faster and less expensive analysis and ultimately, treatment. This project curates, analyses, optimises, and utilises Machine Learning datasets and algorithms for splicing related disease using supervised and unsupervised techniques. A clinical dataset of splice disrupting variants is curated, processed, and validated to assess algorithmic predictive performance in clinically relevant data. Predictions are improved by data engineering to include isoforms with lower expressions. Other avenues such as including protein binding sites, incorporating genomic conservation, and semantic encoding of DNA data are explored. CI-SpliceAI, a new algorithm to predict aberrant splicing, is developed and made available to the wider scientific community. Methods of how to explain shallow and deep learning are applied in order to visualise feature contribution of otherwise black-box algorithms to extract new insights about the underlying biological problem.
https://eprints.soton.ac.uk/475951/
https://eprints.soton.ac.uk/475951/1/Thesis_Yaron_Strauch.pdf