Reducing streaming data storage required for provenance retrieval using Fourier Transform - PhDData

Access database of worldwide thesis




Reducing streaming data storage required for provenance retrieval using Fourier Transform

The thesis was published by Huang, Zheng, in January 2023, University of Southampton.

Abstract:

In this work, we investigate existing works on provenance in the streaming environment. Despite the various reduction techniques proposed for provenance or stream storage, the storage for whole source and intermediate streams is always necessary to answer how-provenance (”show me the data and operations that lead to the output data”). This makes the size of streaming data required for provenance retrieval unworkably large. In this work, we investigate a method for manipulating the streams that provides information to answer how-provenance without pre-determining what information to keep and what to remove. We look at the Fourier transform (FT) as a tool to encode portions of the streaming data information for provenance retrieval. We use a real-world, respiratory streaming use case to highlight the needs for provenance information. We build our stream reduction model and test it against the use case. The experiments show that FT can reduce the size of streaming data (in our demonstration of the technique over the second one-minute time window, it leads to a 15.7 times reduction effect for eligible streams for a streaming application to get the respiration rate and a 36.6 times reduction effect for eligible streams for a streaming application to find the best position), yet the utility of the streaming data for provenance retrieval comes with some limitation. While using the FT technique doesn’t affect the answerability of the query that requires stream ID but not specific data (such as PQ1 for the use case), the query that requires examining data content (such as PQ2 for the use case), and the query that requires contributing operators (such as PQ3 for the use case), for the query that requires returning the figure of the stream (such as PQ4 and PQ5 for the use case), it’s possible that there can be some pattern losses. The post-processing time for respiration sensor data from the second one-minute window is reasonable enough (0.873 seconds for a streaming application to get the respiration rate and 2.816 seconds for a streaming application to find the best position) to support more reactive provenance retrieval which is usually desirable for stream processing. While there is no significant difference between the medians of the query time using unreduced streaming data and reduced streaming data for PQ1 which doesn’t require specific data at a 5% significance level, there exists a positive shift in the median of query time from the query using original data to query using reduced data at the 5 % significance level for PQ2, PQ4, and PQ5 which requires reconstruction of specific contributing data. The use of the FT technique doesn’t affect the query time for PQ3 as the contributing operators can be retrieved from the table storing the metadata.



Read the last PhD tips