In a previous post I summarized some of the plenary talks from the most recent ICDM held in Atlantic city. In this follow up, I will discuss some of the ideas from sessions.
In the main conference track, there were sessions spanning over many of today’s trending topics in computer science: Big Data, social network mining, clustering, spatio-temporal and multilabel learning, classification, dimensionality reduction, and online and social learning. The approaches and applications varied from session to session and talk to talk, but there was, naturally, an overarching theme of efficiently and effectively working with data.
One approach to massive data problems is sampling. Through data sampling, one is able to maintain a subset of the data which roughly represents the “shape” of the full data. We are able to surmise the importance of certain points by how they appear in the sampled set.
Sampling is the guiding technique behind the algorithm of the Best Paper Award recipient, “Diamond Sampling for Approximate Maximum Allpairs Dotproduct (MAD) Search,” by Grey Ballard, Tamara G. Kolda, Ali Pinar, and C. Seshadhri. Their sampling approach is used to identify strong correlations between points by “counting” similarities. This can be applied to various link prediction, collaborative filtering, or recommendation problems. For instance, imagine that a company wants to encourage more product reviews by offering free samples of products to certain customers. The so-called MAD search identifies products to offer for existing reviewers which are similar, but not identical to, previously reviewed products. As another example, say a certain company wants to feature a “two-for-one” sale. The MAD search can identify which products are best to pair for the promotion.
Specifically, MAD is the Maximum All-Pairs Dot-Product problem. That is, given two sets of vectors, find the top pairs of vectors, each vector in the pair from a different set, with the highest dot product. The abstraction is that the vectors correspond to items, and a high dot product between two vectors indicates high similarity or correlation. For instance, if a vector corresponds to a piece of merchandise, say a video game, and the vector indicates how different users rate the video game, the dot product between two vectors is a measure of how similar two video games are.
Rather than computing the dot product directly, however, the authors use the following idea: if two video games are similar it is likely that users will give them high ratings in common. Then, if we are able to find enough users that rate both of these games highly (without finding all of them), we can say with an amount of certainty that the video games are similar. Therefore, these would be two good candidates to pair for a two-for-one sale.
As mentioned above, sampling saves time by working with fewer data points which approximate the full data set. In a similar vein, it is crucial in conserving space requirements. A streaming algorithm focuses on just this: conserving space budget.
In a streaming algorithm, data points are observed one at a time in a “stream.” At the point of observation, one can decide whether or not to sample it – that is, to store it – or not. If the data point is not sampled it is essentially gone forever as we do not have access to points earlier in the stream.
In “Catching the head, tail, and everything in between: a streaming algorithm for the degree distribution,” the authors, C. Seshadhri, Andrew McGregor, and myself, consider a stream of edges in a graph. They show that the degree distribution of the graph can be estimated quite well by sampling certain nodes or edges of the stream, and using only about 1% of the size of the stream (which is the number of edges).
Graph streams are becoming an increasingly relevant model for networks such as Facebook or an Internet topology which are essentially an accumulation of edges, but quickly become too large to store in main memory. Our algorithm estimates the degree distribution of the graph stream which gives an idea of the number of connections nodes have.
It is commonly observed that many real world networks have “heavy tails” – that is, there are very many nodes with small degrees, but there are still some nodes with extremely high degrees. The basic idea behind our algorithm is to maintain two sets of samples. In one, we sample nodes preferential to commonly occurring degrees. From this sample we can estimate the number of nodes with small degrees. In the other, we sample nodes with probability proportional to degree, from which we estimate the very high degrees and number of nodes with very high degrees.
Sampling is a great technique because it is easy to do, and with the flavor of a randomized algorithm the analysis tends to be quite elegant. Best of all it works experimentally!
Of course this only scrapes the surface of the multitude of ideas presented at ICDM – hopefully this whets the reader’s appetite to explore more titles from the conference.