Selected Applications in Data Intensive Computing
MetadataShow full item record
As advances of science and technology develop, large amount of data are exponentially generated every day through different ways, such as scientific instruments, computer simulations, and many other methods. How to mine valuable nuggets of knowledge to make informed decisions from such large amount of data in an efficient way is challenging. However, the development of distributed computing techniques and high speed networks provides us good opportunities to solve big data problems. In this thesis, I focus on developing data intensive computing algorithms and applying data mining methods to analyze massive biological and medical data under cloud computing environments. There are many approaches which can parallelize an existing data mining algorithm in a cloud computing environment. Achieving better performance by manipulating data in an intelligent way has attracted a lot of attention. In this thesis, I propose two different approaches to parallelize the existing random decision tree algorithm, which has been implemented in the Sector/Sphere cloud environment. Some comparisons about cost and accuracy are also conducted for these two different implementations and are presented here. Recently, with the development of ChIP-chip and ChIP-seq technology, huge amounts of genome wide protein-DNA binding sites data are now available for many transcription factors and chromatin regulators for many species. Previous studies have already shown that the distribution of their localizations and modification can offer novel insight into the mechanisms of regulation. As it is strongly believed that multiple chromatin factors can work together to regulate a common target, I formally define this problem and propose a novel graph-based algorithm called Patterns of Marks (PoM) to efficiently identify these types of geometric patterns in the massive genomic data. In addition, as the amount of data grows, it is impossible to integrate data manually, therefore, I propose two algorithms to automatically integrate big tabular data. I also conduct an experimental study by developing a customizable lightweight web crawler to collect various data from Internet.
SubjectData Intensive Computing