This week, the Simons Institute hosted a workshop entitled Unifying Theory and Experiment for Large-Scale Networks. The goal of the workshop is to bring together researchers involved in various large networks problems to discuss both the theoretical models and the empirical process for testing and validating them. Even further, the “unifying” in the title suggests a forum on where the ends of the spectrum may meet.
Many questions came up around the very issue of finding unifying ground. How much should we invest in the theory if we have the empirical results? To what degree do we tune our models to mirror real-world data?
The structure of the workshop was well-suited to such a discussion. The four-day series was divided into sessions, each of which consisted of four 30-minute talks and then an open, unrecorded panel discussion. The full schedule, with abstracts and video, can be found here. There was a good representation of theory/experiment across and within the sessions. While there was some clear polarity, there were at least some important insights as to the reasons for the distance between theory and practice in large-scale network problems. Below are some of the major themes that came up.
- Do we have the model right?
Theory that can be used in practice depends on the theory assuming the right model. But what is the “right” model? In general, there seems to be little agreement on this. Firstly, models that can be nicely abstracted may not accurately model network data that occurs in nature. How relevant is a stochastic assumption, or i.i.d. sampling in practice? A poignant example was given by C. Seshedhri in his talk on finding triangles in graphs in an online streaming model . After his conclusions, he mentioned a question posed by the experimental community at his lab: edges may occur more than once in real online streaming networks, so how can these results be applied to multigraphs? No matter how promising the results, if they do not fit a scenario, they cannot be implemented in practice. And, for clarification, the results have been extended to multigraphs by considering the induced simple graph  (but these are still not used in production).Second, many problems seem to be lacking the theory because statistical models change very quickly. The variety of models, metrics, and parameters may even distract from the development of good practical algorithms with theoretical backing. As a result, many theoretical researchers will essentially pick the model most attractive to them and develop the theory there, whether or not there is a practical need.
- Empirical results beat theoretical bounds.
Even problems with backing theory are generally not applied. Most algorithms presented at conferences are not available off-the-shelf. Getting these algorithms ready for prime time will almost certainly require many iterations of improvement and calibrating, and a team with a very good understanding of both the theory and practice.Why isn’t the theory enough? One reason may be the focus on scalability among the theory community. It can be generally observed that memory capacity is growing more quickly than problem sizes, so it is not necessarily useful in practice for algorithms to be highly scalable. Somewhat along these lines also is the question of what needs to be achieved. Complexity and accuracy analyses presented in algorithms papers are typically worst-case bounds. However, it can often be the case that data we see in real-world scenarios behave better, even much better, than the worst case. It may be unnecessary to implement robust algorithms when simpler heuristics perform better in practice. Vigna delivered a provocative view early in the workshop on why approximate set representations used to analyze large networks perform better than their theoretical guarantees.
- How can we improve if we don’t know how we’re doing?
This was the tagline of Salganik’s talk on modeling epidemic networks in Rwanda. Without methods for validation, it is impossible to bridge the gap between theory and practice. Surprisingly, though, some classes of algorithms have achieved success in spite of this. In particular, local partitioning algorithms have demonstrated both theoretical and practical success. I’ll discuss one example next.
Computing overlapping clusters with local computations
The goal of a global clustering is to partition the nodes of a graph into distinct subsets such that there is little communication between the clusters (few crossing edges) and no single cluster is too large. This has clear applications to distributed computing, where large datasets are relatively equally partitioned onto a group of processors, with minimal communication required between processors. In an overlapping clustering, partitions are not required to be disjoint, and more of the graph is stored than is required.
The goal of an overlapping clustering can be formulated in terms of random walks. Consider the walk on the graph v1, v2, …, vt. Let T be a mapping from the vertex set to a set of clusters, and say and the corresponding sequence of active clusters containing the vertex at each step is C1 = T(v1), C2 = T(v2), …, Ct = T(vt). Then the goal of an overlapping clustering is to divide the graph into clusters which minimizes the number of times the active cluster must be changed during a random walk.
In this work, overlapping clusters are found first through local partitions. Candidate clusters are generated from local clusters computed from each vertex, using the PageRank procedure of , for example. Overlapping clusters with upper bounded volume are then computed by combining the local clusters in a certain way which minimizes random walk cluster crossing. The result is a set of clusters each with a bounded volume and for which communication, modeled through the random walk probability diffusion, is kept minimal.
Local algorithms in practice
Local clustering is an example where the best performing algorithms have the backing of theoretical guarantees. For finding small clusters, for example, local algorithms are the best option in terms of performance and quality. Global clustering, however, is a different story, and is generally too hard to move beyond empirical guarantees. The overlapping clusters algorithm is a nice example of applying a theoretically sound procedure to a problem which generally does not have the theoretical backing.
However, even still, the best global clustering algorithms are those based on good empirical performance. So do we need the theory? Will there always be an impenetrable gap between theory and practice? The workshop set the stage for some necessary discussion, but the general consensus seemed to be that the goals of these two ends are irreconcilably different. Of course, theorists need the motivation of practice, and practitioners need the inspiration of theory, but we may be some time away from theoretical results being applied to large-scale problems.
As a footnote, Ravi Kanaan is giving an upcoming talk at the Simons Institute on whether theory is necessary in clustering, which might offer some insight.