Scalable Mining of Large Graphs and Its Applications
In recent years, as the semantics of real-world applications becomes more and more complex, graph has emerged as a basic data manifestation unit in many domains, where its expressiveness can provide the best mathematical model for advanced use cases that involve a set of individual entities plus their inter-relationships. Responding to this trend, it is critically important to transform state-of-the-art data mining and data management systems so that they are able to handle such data sources with rich interconnecting structures, e.g., those representing XML, circuits, cyber-networks, chemical compounds, biological data, social data and the Web. One major challenge faced by the end-users of graph data is the large size of the graphs they want to analyze. Accompanying the arrival of the information era, huge repositories are created due to the constant flowing-in of data. People have been putting great emphasis on developing scalable methods to mine and manage large databases. Nonetheless, scalability is even more urgently expected in the graph scenario, since the structural complexity of graphs can often make computations more costly and thus the efficiency of algorithms is a critical concern. In this thesis, we are going to focus on scalable mining of large graphs. We propose a novel graph summarization technique to reduce the size of graphs, so that the computational complexity associated with processing the data is more manageable. In order to guarantee the result accuracy, we further relate patterns existing on the original graphs with those found on the summarized ones, and prove that they are very close to each other if certain conditions are met. This graph summarization concept has proved to be very useful for many concrete data analysis tasks, even including some cases where the underlying information is not naturally represented as graphs. For these cases, we build graphs to reveal insightful observations of the data from alternative angles, while summarization techniques are applied on the resulted graphs to make the analysis scalable. With efficient methods that are capable of extracting useful knowledge, one can also utilize the extracted knowledge to help other data management tasks on large graphs, as well. For the latter part of this thesis, we shall examine one application of scalable large graph mining algorithms, namely the indexing problem for graph search. We explore our research on both graph containment search and graph location search. In the first case, people care about a binary containment relationship between a query graph q and a target graph g. Focusing on the indexing deficiencies brought up by the large size of graphs, we design and implement a novel indexing scheme called CP-Index, which summarizes graphs for feature extraction and preserves the contact information among features for further pruning opportunities. In the second case where it becomes necessary to locate all identical copies of q in g given graphs with larger size and more complex internal structures, we take advantage of feature-based indexing in containment search to tackle the mining deficiency and pattern recurrence curse. A composite dynamic-and-static indexing (DS-Index) is further proposed, which is proved to be both effective and storage efficient.