Integration of health informatics: ‘big data’ for clinical translation in inflammatory bowel disease
Inflammatory bowel disease (IBD) is a chronic, complex autoimmune disease characterised by relapsing-remitting gastrointestinal tract inflammation. It is considered to arise from interactions between an individual’s genetic susceptibility, environmental factors, immune dysregulation, and gut microbial dysbiosis. Genetics can make a larger contribution to IBD pathology in some patients, and this is thought to be linked to age of diagnosis, with genetic factors having the largest effects in very young children. There are two main subtypes of IBD: ulcerative colitis (UC) and Crohn’s disease (CD). Within subtypes, there are different disease behaviours and severities. One particular disease behaviour of interest is the stricturing endotype, which causes a narrowing of the gastrointestinal tract that often requires surgery. This thesis first examines oxidative stress in IBD patients, through the use of assay data. Here, statistical and machine learning (ML) methods are employed to examine the relationship between clinical and genomic characteristics of a set of paediatric patients, and their measured oxidative stress and antioxidant potential. In this work, no results suggested that these assay data could be used as an indicator for these clinical features, or for pathogenic variation in key oxidative stress genes. The predominant focus of this thesis is the use of genomic data and ML to stratify IBD patients. In order to prepare genomic data for use in ML pipelines, the GenePy algorithm was used. GenePy takes in information regarding zygosity, allele frequency, and predicted deleteriousness for every variant in a gene. The scores for each variant are summed to create an overall gene score, and this becomes are per-gene, per-individual matrix of scores. The two clinical problems analysed here were classifying IBD patients according to their subtype, and stratifying CD patients by the presence or absence of a stricturing endotype. This was achieved with an ML random forest classifier. Optimisation of both the input data and ML algorithm for these classifications was a important aspect of this work. Several gene panels were trialled for these classifications, and an autoimmune gene panel outperformed an IBD gene panel for determining IBD subtype. Stratifying CD patients by their stricturing endotype was subsequently performed with a random survival forest, which combined a random forest with survival analysis methods. This method is better suited to the longitudinal nature of stricturing endotype developed. This work demonstrated challenges that arise from the sparsity of genomic data, and required the development of a pipeline that could reduce the sparsity of the features used by the ML algorithm. The patient stratification performed here demonstrated strong evidence for the presence of different genomic variation patterns within IBD subtypes, and within the CD stricturing endotype. With increased dataset sizes, it may be possible to more clearly detect and cluster patients according to their genomic variation. In order to take full advantage of this knowledge, there is an additional requirement for deep, varied and longitudinal clinical data. Then, genomic data can guide each patient’s clinical pathway, providing individuals with more personalised, life-long care.