Graph Neural Networks for Sematic Entity suggestion
The primary objective of this research project is to enhance the accuracy of entity linking inscientific table data by leveraging a knowledge base. This will be achieved by investigatingthe feasibility of employing machine learning techniques to generate multimodal embeddingsfor both entity linking and corpus embedding. In contrast to the current state-of-the-artmethods that heavily depend on lexicographical features, this project aims to exploit thecapabilities of a multimodal embedding approach to improve the suggestion of candidatesfor entity linking. The main focus is to understand how multimodal embedding can beused to extract relevant entities, considering the contextual data within the corpus forEntity Linking with a Knowledge Base.The structure of this thesis is organized into six chapters. Chapter 1 serves as an introduc-tion to the subject, providing necessary background information and discussing relatedworks that have influenced this project. Chapter 2 delves into the datasets used in theproject, specifically focusing on the conversion of these datasets into mention datasetsthat cover both tabular data and text data for the mentions. This chapter also coversthe target knowledge graphs. Chapter 3 presents the model architecture of the project,including the projection head, mention encoder, entity encoder, and the dual encoderarchitecture. It also discusses the scoring functions that will be utilized. Chapter 4 outlinesthe experiments that will be conducted in the project and the evaluation metrics thatwill be used to assess the results. Chapter 5 presents the results of the experiments andprovides a thorough evaluation of these results. Finally, Chapter 6 and Chapter 7 concludesthe project, summarizing the findings and suggesting potential avenues for future work.This research project provides an in-depth exploration of the methodologies employed inthe project, focusing on the concepts of text embedding and mention embedding. Theinput data for the project is categorized into text data and tabular data, each with itsunique input structure in the BERT tokenizer. The text data input structure is based onthe methodology proposed by Wu et al. , where each mention in the corpus is encapsulatedwithin a specific string format. On the other hand, the tabular data input structure isinspired by the work of Trabelsi et al. , but with a simplified format due to the specificrequirements and constraints of the project.The projection head, a fundamental component in machine learning, is utilized to transforminput data, specifically embeddings, into a different space, thereby generating projectedembeddings . This transformation is accomplished through a series of operations collectivelyknown as projection layers. The projection head is essentially a simple feed-forward neuralnetwork that serves to project the entity and mention embeddings to the same dimensionality. It is equipped with a ReLU activation function and its primary function is to reduce thedimensionality of the embeddings, making them comparable.The Mention Encoder model is a sophisticated architecture that leverages the power ofthe BERT model with an additional projection head. The input data for this model isdivided into two categories: text data and tabular data. Each mention in the corpus isencapsulated within a string and structured as per the methodology proposed by Wu et al.For tabular data, the project aims to perform Column Type Annotation, with the inputinto the BERT tokenizer inspired by the work of Trabelsi et al. The Mention Encoder’sarchitecture includes the BERT model and a projection head, which allows the model toproject the output of the BERT model into a lower-dimensional space, enabling efficientcomputation and storage.The Entity Embedding model architecture is a crucial component of our research. Themodel architecture is divided into two main subsections: Input Encoding and OntologyEmbedding Model. The input format for the ontology embedding process is derived fromvarious ontologies/knowledge graphs, as detailed in Section 2. The model architecturefor ontology embedding is based on the work of Wu et al. and Louis et al.The model’sflexibility, particularly in the utilization of Graph Neural Network (GNN) layers, is a keyfeature that allows it to be tailored to specific requirements and scenarios.The dual encoding model architecture is another significant aspect of our research. Asunderscored by Dong et al. , there exists a variety of dual encoder architectures, includingbut not limited to Siamese Dual Encoder (SDE), Asymmetric Dual Encoder (ADE), ADEwith Shared Token Embedding (ADE-STE), ADE with Frozen Token Embedding (ADE-FTE), and ADE with a yet to be defined component (ADE-SPL). The primary focus ofthis project is the ADE model architecture, which is further bifurcated into two maincomponents: the entity encoder and the mention encoder. This architecture facilitatessimpler modifications to the different components of the dual encoder, thereby enablingeach stack to adapt and better fit the conclusion.The loss function plays a critical role in steering both encoders to acquire identicalrepresentations. The score function must be adept at evaluating the similarity betweenmention and entity embeddings. Furthermore, it is essential for practical applications thatthe entity’s embedding can be computed offline. The inference should be achievable bycalculating the mention embedding and retrieving the nearest k neighbors in less than aminute, even in extensive knowledge bases like DBpedia, which encompasses billions ofentities.This study presents three subsections, each focusing on a different function relevant to thethesis. The first function, Cosine similarity/dot function, introduces the cosine embeddingloss. This is a common scoring function that enables dense retrieval based on the angularsimilarity of embeddings representing entities. The second function, Triplet margin loss,introduces the triplet loss function. This is a common loss function that allows for denseretrieval based on the Euclidean similarity of embeddings representing entities. Fromthe perspective that embeddings are maps from higher dimensionality into a manifold inlower dimension, the idea behind triplet loss is to move similar entities closer together,and dissimilar entities farther away. The third function, with slight abuse of notationcross-entropy, is a scoring function aimed at classification. However, the aim of the functionis to maximize the value of the right ”class” and minimize the value to the other negativeanchors. This can be seen as an updated triplet margin loss where the positive anchoris pushed closer and the remainder is pushed farther away. These three functions arepresented because they provide a comprehensive understanding of the scoring and lossfunctions used in machine learning, which is crucial to the thesis.The experimental setup for this study is detailed in Table 4.2. The subsequent sectionsprovide an in-depth analysis of the results derived from these experiments, with a particularemphasis on the influence of the configuration on the final outcomes. Due to constraintsin resources, each configuration is executed only once. For a more robust statistical un-derstanding of the significance of each configuration, it would be necessary to conductadditional runs. However, due to time limitations, not all configurations of the model withthe text encoder being bert-base-uncased were executed, with 4 runs remaining incomplete.The results of the executed runs are presented in Table 5.1.Table 7.1 indicates that while the model is unable to retrieve the correct entities, thesuggested entities are more semantically in nature as opposed to lexicographic similarity.The model is capable of making suggestions based on semantic similarities, demonstratingthe feasibility of entity suggestion based on dense retrieval. However, the model requiresfurther fine-tuning to achieve state-of-the-art performance.