Data Mining Using the Crossing Minimization Paradigm - PhDData

Access database of worldwide thesis




Data Mining Using the Crossing Minimization Paradigm

The thesis was published by Abdullah, Ahsan, in September 2022, University of Stirling.

Abstract:

Our ability and capacity to generate, record and store multi-dimensional, apparently
unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining.

Because of the size, and complexity of the problem, practical data mining problems are
best attempted using automatic means.
Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes.

In this dissertation, a novel fast and white noise tolerant data mining solution is
proposed based on the Crossing Minimization (CM) paradigm; the solution works for
one-way as well as two-way clustering for discovering overlapping biclusters. For
decades the CM paradigm has traditionally been used for graph drawing and VLSI
(Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains.

Two other interesting and hard problems also addressed in this dissertation are (i) the
Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth
Minimization (BWM) problem of sparse matrices. The proposed CM technique is
demonstrated to provide very convincing results while attempting to solve the said
problems using real public domain data.

Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has
been observed during 1989-97 between cotton yield and pesticide consumption in
Pakistan showing unexpected periods of negative correlation. By applying the
indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis.



Read the last PhD tips