Learning representations of visual semantics for out-of-distribution generalization - PhDData

Access database of worldwide thesis




Learning representations of visual semantics for out-of-distribution generalization

The thesis was published by Jiao, Yue, in January 2023, University of Southampton.

Abstract:

Building systems that can understand and represent visual semantic knowledge is one of the fundamental problems towards artificial general intelligence. Much of the previous work on visual semantic understanding has used visual semantic embedding (VSE) to align well-represented visual features and language features. However, these techniques are insufficient to generalize beyond the data they have seen during training. In this thesis, we initially revisit the hierarchy of levels between the raw media and full semantics. Then we propose the hypothesis that learning multimodal knowledge representations which can be recomposed dynamically as needed is a path towards out-of-distribution (OOD) visual semantic understanding. The first focus of this thesis is the behaviour of current VSE systems. We develop a variety of probing techniques to answer the question: what kind of semantic information in unimodal pre-training is learnt by VSE? We show that static relational information in large text corpus and expert-made knowledge bases does not remain in the semantic space of VSE models. When VSE models learn contextual information with frozen language models, a mutual exclusivity bias is lacked. This limits their performance on OOD recognition. The second focus of this thesis is to understand the ability of pre-trained multimodal models. We try to answer the question: do multimodal models systematically generalize, and to what extent do they understand and produce new compositions from known concepts? Across many experiments, we demonstrate that the language encoder in a pre-trained multimodal model plays an important role in both producing concept compositions and enhancing unfamiliar visual concepts. Based on the above two concerns, the final chapter of the thesis confirms that multimodal pre-training plays a core role in OOD semantic understanding. Future research of learning visual semantics towards the objective of OOD generalization should first develop probing tools to explore how visual concepts emerge in pre-trained multimodal models.



Read the last PhD tips