[Seminar] Big Graph Mining: Theory, Engineering and Discoveries
What is the diameter of a billion-node Web graph? Can we spot anomalous accounts in the who-follows-whom Twitter graph, with billions of links? Such graphs often do not fit in memory. How to use parallelism for Tera- or Peta-scale graphs?
In this talk, I will present Pegasus, a large scale graph mining system implemented on the top of the Hadoop platform, the open source version of MapReduce.
I will discuss three important algorithms in Pegasus.
First, I will describe GIM-V, a unifying primitive for many different graph mining operations. GIM-V is highly optimized, achieving good scale-up on the number of available machines, linear running time on the number of edges, and ~10 times faster performance over the non-optimized version of GIM-V. I will show surprising discoveries including the 7-degrees of separation in one of the largest publicly available Web graphs, with ~7 billion edges.
Second, I will present HEigen, an eigensolver for spectral analysis on billion-scale graphs.
HEigen is carefully designed to efficiently run on the highly scalable Hadoop environment, and handles matrices more than 1000 times larger than the state of the art.
I will show how to use HEigen to discover anomalous adult advertisers in the who-follows-whom Twitter graph with 3 billion edges.
Finally, I will describe GigaTensor, a scalable tensor decomposition algorithm for very large tensors with millions of sizes and billions of nonzeros.
GigaTensor solves more than 100 times bigger problems than existing methods.
Furthermore, we employ GigaTensor in order to analyze a very large real world, knowledge base tensor and present our findings, which include the discovery of potential synonyms among millions of noun-phrases (e.g. the noun `pollutant' and the noun-phrase `greenhouse gases').
U Kang is an assistant professor in the Department of Computer Science of KAIST.
He received Ph.D. in Computer Science at Carnegie Mellon University, after receiving B.S. in Computer Science and Engineering at Seoul National University.
He won 2013 SIGKDD Doctoral Dissertation Award (Honorable Mention),
2013 New Faculty Award from Microsoft Research Asia, and two best paper awards.
He has published over 30 refereed articles in major data mining and database venues.
He holds four U.S. patents.
His research interests include data mining in big graphs.