The post Flex, Regular Expressions, and Lexical Analysis appeared first on XRDS.

]]>There is an alphabet, words, grammar, statements, semantics, and various ways to organize the previous in order to create a computer program in a programming language. Flex helps developers create a tool called a lexical analyzer which identifies the words of a program during the compilation/interpretation process.

A compiler takes a text file and parses it by character trying to match patterns at each of the aforementioned levels. The initial parse is the lexical analysis, this pass ensures that there are no lexical errors, which are characters that don’t contribute to meaningful words.

Meaningful words for a programming language are described by a regular language. An implementation for describing a regular language is regular expressions. An implementation for parsing text while looking for matches to regular expressions is a flex lexical analyzer.

Essentially, programming a flex lexer means defining various regular expressions which detail all of the possible words and characters that are meaningful in a correct program for your language.

To illustrate flex, BooleanLogicLanguage is an example language which a flex-based lexer will lexically analyze. For no reason in particular, the purpose of this language is to evaluate Boolean logic expressions.

This diagram is an example of what a correct program would look like in this BooleanLogicLanguage. The lexical analyzer should be able to parse this without errors. Note that in this language, ‘INTEGER(‘ and ‘INTEGER)’ are the ‘words’ used to separate pieces of code — as opposed to ‘(‘, ‘)’, ‘{‘, or ‘}’. When creating your own language, you are free to do whatever you want, but following convention, to some degree, just ensures that the tool you are creating is easy to use.

NOTE: A programming language is a tool for using a computer. A GUI is a tool for using a computer. Siri or Cortana are tools for using a computer, etc.

This diagram is a sketch of the regular expressions which will be used in the flex program in order to describe meaningful words. The phrases on the left hand side are token names. Token names are not necessarily important during lexical analysis, but they are of the utmost importance when performing the syntactic analysis (which comes after a lexical analysis during compilation/interpretation — not covered in this post).

The most difficult part of this process is defining tokens and figuring out what sort of regular expression should describe them. For this example, I decided to make a language that would evaluate Boolean logic. Then I started writing potential programs in this language, and once I wrote enough that looked interesting and useful, I defined tokens and regular expressions to ensure those particular programs would be correct. I made code up that I liked and then fit tokens and regex to make them work.

**A Quick Note on Regular Expressions (Regex)**

[…] denotes a single thing, whose identity is dictated by what’s inside the brackets.

A single character is a single thing.

* denotes zero or more of the single thing before it.

? denotes one or zero of the single thing before it.

(…) is just a grouping, usually used with * or ?.

The difference between […] and (…) is that the square brackets represents one of what is inside of it and the parentheses are a grouping. For example [abc] and (abc): ‘a’, ‘b’, ‘c’ all match the former, and ‘abc’ matches the latter.

– denotes a range and has specific applications that are very useful: A-Z, a-z, 0-9.

Flex is like a C program, except it has a further defined layout.

%{

C code declarations

%}

definitions of regular expressions

%%

regular expression1 code to execute on match (such as a macro)

regular expression2 other code to execute on match (such as a function)

%%

C code functions

main function

The power of Flex comes from being able to define regular expressions (between ‘%}’ and ‘%%’) and then attach them to code (between ‘%%’ and ‘%%’). The additional areas for C code are both handy and what gives the lexer its true functionality (doing something when a regex is matched).

NOTE: flex essentially turns this flex file (extension ‘.l’) into a C program which is then compiled like you would compile any C program. The result is an object-file/program which you execute on/with a text file containing programming code. And in this case, the output of the program is to a text file (Tokens.txt) and also to stdout (terminal).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
%{ int numErrors = 0; char* arg; typedef struct node { char* lexeme; char* token; struct node* next; } node_t; node_t head; node_t* current = &head; int yywrap(void); void store(char* lexeme); void error(void); void printStored(void); %} whitespace (\t|" "|\r) newline \n real -?[0-9]+\.[0-9]+ integer -?[0-9]+ string \".*\" boolean_op_binary (" AND "|" OR "|" XOR ") boolean_op_unary "NOT " boolean_literal ("True"|"False") identifier [A-Z]+ separator_open ({real}|{integer})\( separator_close \)({real}|{integer}) assignment = IO "print " terminal ; %% {whitespace} {ECHO;} {newline} {ECHO;} {real} {ECHO;} {integer} {ECHO;} {string} {ECHO;} {boolean_op_binary} {ECHO; arg = "BOOL_OP_BINARY"; store(yytext);} {boolean_op_unary} {ECHO; arg = "BOOL_OP_UNARY"; store(yytext);} {boolean_literal} {ECHO; arg = "BOOL_LITERAL"; store(yytext);} {identifier} {ECHO; arg = "IDENTIFIER"; store(yytext);} {separator_open} {ECHO; arg = "OPEN"; store(yytext);} {separator_close} {ECHO; arg = "CLOSE"; store(yytext);} {assignment} {ECHO; arg = "ASSIGN"; store(yytext);} {IO} {ECHO; arg = "IO"; store(yytext);} {terminal} {ECHO; arg = "TERMINAL"; store(yytext);} . {ECHO; numErrors++; error();} %% int yywrap(void) { return 1; } void store(char* lexeme) { current->lexeme = malloc(sizeof(strlen(lexeme)+1)); strcpy(current->lexeme,lexeme); current->token = malloc(sizeof(strlen(arg)+1)); strcpy(current->token,arg); node_t* temp; temp = malloc(sizeof(node_t)); current->next = temp; current = current->next; } void error(void) { printf("[e]"); } void printStored(void) { node_t* c = &head; FILE* f = fopen("Tokens.txt","w"); while (c->next) { fprintf(f,"%s\t%s\n",c->lexeme,c->token); c = c->next; } fclose(f); printf("Tokens.txt written.\n"); } int main(int argc, char *argv[]) { // ensures number of command line arguments if (argc != 2) { printf("Please enter one filename as an argument.\n"); return -1; } // opens the file with name of second argument yyin = fopen(argv[1],"r"); yylex(); // close file fclose(yyin); printf("\nLexicalErrors %d\n",numErrors); printStored(); return 0; } |

The above code is a flex file which parses the BooleanLogicLanguage.

1 2 3 4 5 6 7 8 9 |
CC=gcc CFLAGS= LexerFile=lexer lexer: lex.yy.c $(CC) $(CCFLAGS) -o lexer lex.yy.c lex.yy.c: $(LexerFile).l flex $(LexerFile).l |

The above code is a makefile, which when run in the same directory as the flex file, will create the ‘lexer’ program. This was tested on an Ubuntu 16.04 operating system. The GNU C Compiler (gcc) is required in addition to flex.

1 2 3 4 5 |
P = True; R = False; Q = 1(NOT P)1 XOR 2(P AND 3(NOT R)3)2; print Q; |

The above text should be parsed without errors by our lexer. And the lexer should output a Tokens.txt matching each lexeme to the token describing its regex.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
P IDENTIFIER = ASSIGN True BOOL_LITERAL ; TERMINAL R IDENTIFIER = ASSIGN False BOOL_LITERAL ; TERMINAL Q IDENTIFIER = ASSIGN 1( OPEN NOT BOOL_OP_UNARY P IDENTIFIER )1 CLOSE XOR BOOL_OP_BINARY 2( OPEN P IDENTIFIER AND BOOL_OP_BINARY 3( OPEN NOT BOOL_OP_UNARY R IDENTIFIER )3 CLOSE )2 CLOSE ; TERMINAL print IO Q IDENTIFIER ; TERMINAL |

Here is the Token.txt for that initial errorless program.

The above is a screenshot of an example of parsing a program with lexical errors.**What is Next?**

Classically, once you can identify words, including what type of word (token), you can then create a grammar. You could think of a token as ‘noun’ or ‘verb’ or ‘adjective’, if comparing a programming language to a natural language. The step after lexical analysis (checking for correctness of words) is syntactic analysis (checking for correctness of grammar). Making a comparison to natural languages again, an English grammar could be PHRASE: article noun verb (The dog ran, A bird flies, etc). In BooleanLogicLanguage there could be STATEMENT: IDENTIFIER ASSIGN BOOL_LITERAL (Q = True, A = False, etc). But each of those are an example of just one type of phrase or statement, you can have multiple definitions for a production in the grammar for your language.

Non-Classically? I don’t know. Natural Language Processing (NLP) is a very active area, but I’m not aware of anyone using those techniques for parsing textual programming languages.

The post Flex, Regular Expressions, and Lexical Analysis appeared first on XRDS.

]]>The post How Connected are Quantum Graphs? appeared first on XRDS.

]]>Let’s begin by setting the scene. Let *X* be a vector space equipped with an inner product (also known as a *Hilbert space*) and let *T: X → X* be a self-adjoint bounded linear operator on the vector space with domain *D(T)*. (If these terms are foreign to you, don’t worry too much. It is enough to accept that the Laplacians we’ll be talking about come with things called eigenvalues — keep reading!) The *point spectrum* of *T* is

*σ _{p}(T) = { λ | ∃ some u in D(T) with Tu = λu }*

We call these *λ* the *eigenvalues* of *T*, and they will be our main players.

Let’s begin with something comfy — discrete graphs. That is, a collection of vertices, *V*, and edges, *E*, connecting the vertices. In the discrete graph world, when we talk about functions on the graph we mean a function over the vertices. Most often, this function will assign a real number to each vertex in the graph, something like this: *φ: V → R. *The

*(Lφ)(v) = ∑ _{u~v} (φ(u) – φ(v)) = (D – A)φ(v)*

(Take a look at my previous post if this looks like nonsense.)

An important question in spectral geometry is

To what degree does the spectrum of the Laplace operator determine the underlying space?

That is, what can the eigenvalues of the graph Laplacian tell us about the graph itself? In this post, I’ll explore this question and specifically how quantum graph spectra behave in somewhat unexpected ways.

Again, before tackling quantum graphs, lets become a little more familiar with the discrete setting. If the graph has *n* vertices, the size of the discrete Laplacian will be *n x n*. The discrete Laplacian is positive semidefinite, and so is has exactly *n* non-negative real eigenvalues *0 = λ _{0} ≤ λ_{1} ≤ … ≤ λ_{n-1}*. The second smallest eigenvalue,

This seems simple enough, but already things get wacky with quantum graphs.

A quantum graph is a simple graph (a set of vertices and set of undirected edges with no self-loops or multiedges) now with lengths associated to each edge. A first difference between a quantum graph and a typical graph is that things are no longer discrete. Now when we think of function on the graph, we think of a function that operates on the vertices and *all points along the edges*. The last piece that defines a quantum graph is a self-adjoint differential operator, which acts much like the Laplacian on discrete graphs. Now when we speak about the spectrum, we are referring to eigenvalues of this operator.

So what can be said about the connectivity about a quantum graph? Well, let’s think about adding an edge between two existing vertices. In the discrete case, adding an edge essentially made the two endpoints “closer” in graph terms. In the quantum case, however, we are now adding to the total *length* of the graph, since each edge has an associated length and adds all points along the edge. So, when the edge is long enough, we are actually decreasing the connectivity of the graph! That is, by adding this edge we may be shrinking *λ _{1}*. The exact statement is given by Kurasov et al. in this paper, and includes the specific conditions on the edge for which the connectivity shrinks.

So is there a way to change the connectivity without adding length to the graph? The same paper tells us there is! Namely, if we change the graph by picking two vertices and joining them together, but keeping the set of edges the same, we can increase connectivity! Their proof is some clever manipulation of the Rayleigh quotient to show that *λ _{1}* for the newly obtained graph is at least as big as

One fun consequence of this is the fact that the most highly connected quantum graph is the flower graph with *n* loops attached to one vertex. Tell that to your sweetheart next time you bring them daisies.

* The flower graph with n loops has the largest spectral gap among all graphs formed by a given set of edges.*

To end, we’ll have just a bit of fun. Last time we asked whether or not, given a set of eigenvalues, it is possible to determine the shape of the graph. This is related to the question of hearing the shape of a drum given its fundamental tone.

We said generally that it is possible to have non-isomorphic graphs share the same spectrum (isospectral). So, unfortunately, it is not possible to hear the shape of a graph (or a drum, for that matter). But how bad are our chances for discrete vs. quantum graphs? Well, for discrete graphs, our chances are pretty bad. It has been shown that there are families of non-isomorphic isospectral graphs of size that grow exponentially in the number of edges. That’s a lot of graphs that sound the same! The quantum case? Well, according to Ralf Rueckriemen, who did his thesis on quantum graph spectra, we at least know that any family of isospectral quantum graphs is finite. So if you really want a better shot at hearing the shape of a graph, go quantum!

The post How Connected are Quantum Graphs? appeared first on XRDS.

]]>The post The Geometric Origins of Spectral Graph Theory appeared first on XRDS.

]]>As I develop in my research in spectral graph theory, I am consistently amazed by the truth that many results in spectral graph theory can be seen as discrete analogues to results in spectral geometry. I am not accustomed to thinking of graphs as geometric objects, but in fact graph Laplacian matrices are nicely related to Laplace operators on Riemannian manifolds. In this post, I’d like to discuss a few of these relationships.

Let’s consider a simple example. Imagine a membrane in the $latex xy$ plane with vertical displacement given by $latex z(x, y)$. Then the movement of the membrane can be described by the wave equation:

$latex \Delta z = \frac{1}{c^2}\cdot\frac{\partial^2 z}{\partial t^2}, \ \ \ \ \ \ \ (1)$

where $latex t$ represents time, $latex c$ is the speed of the wave, and $latex \Delta$ is the Laplace operator acting on the function $latex z$. If we assume the membrane moves like a spring, Hooke’s law gives us

$latex \frac{\partial^2 z}{\partial t^2} = -kz,$

where $latex k$ is a constant representing the stiffness of the spring. Then, plugging

this in to (1), our wave equation simplifies to

$latex -\Delta z = \frac{k}{c^2}z. \ \ \ \ \ \ \ (2)$

In other words, $latex z$ is an eigenfunction of the Laplace operator $latex \Delta$. We’ll return to this in a bit, but for now let’s do a little magic on the left side of the equation.

The Laplace operator returns the sum of second partial derivatives, so we can expand the above as

$latex \frac{\partial^2 z}{\partial x^2}+\frac{\partial^2 z}{\partial y^2} = \frac{1}{c^2}\cdot\frac{\partial^2 z}{\partial t^2}.$

Suppose we want to approximate the membrane by a discrete grid of size $latex w$.

That is, we would like to approximate the continuous Laplacian on the left side of the equation. First, we have that

$latex \frac{\partial z(x,y)}{\partial x} \approx \frac{z(x,y) – z(x-w,y)}{w},$

$latex \frac{\partial z(x,y)}{\partial y} \approx \frac{z(x,y) – z(x,y-w)}{w}$

so we can approximate the second partial derivative with respect to $latex x$ by

$latex \frac{\partial^2 z(x,y)}{\partial x^2} \approx \frac{\frac{\partial z(x+w,y)}{\partial x} – \frac{\partial z(x,y)}{\partial x}}{w} = \frac{z(x+w,y) + z(x-w,y) – 2z(x,y)}{w^2},$

and similarly for $latex y$. Then the left side of (2) becomes:

$latex -\Delta z \approx \frac{4z(x,y) – z(x+w,y) – z(x-w,y) – z(x,y+w) – z(x, y-w)}{w^2}.$

(This is the finite difference method.) Now recall, this was derived by approximating the membrane by a discrete mesh. Let’s visualize a few points on the grid:

If we instead consider these points as nodes and edges in a graph, and use the graph Laplacian $latex L = D-A$ as our operator. Then we can see that

$latex Lz(x,y) = 4z(x,y) – z(x+w,y) – z(x-w,y) – z(x,y+w) – z(x, y-w).$

But this is exactly the (unnormalized) approximated discrete Laplacian above! Remarkable! Now we’re beginning to see that this whole graph “Laplacian” nomenclature has some substance after all, it really is a discrete analogue of the Laplace operator!

Beyond their operations, there are elegant parallels between the Riemannian manifold setting and the discrete graph setting for Laplacians. For example, let $latex U \subset \mathbb{R}^d$ be an open, non-empty set and let $latex L^2(U)$ denote the space of square integrable functions on $latex U$. Then the domain of the Laplace operator $latex D(\Delta)$ is the space of smooth, compactly supported functions on $latex U$ and dense in $latex L^2(U)$. Then a first property is that the Laplace operator is symmetric on $latex D(\Delta)$. Similarly, the Laplacian $latex L$ is a symmetric matrix. Another parallel is the range of the spectra of $latex \Delta$ and $latex L$. On $latex \mathbb{R}^n$, the spectrum of $latex \Delta$ is $latex [0,\infty)$, and similarly $latex L$ has all non-negative real eigenvalues (that is, it is always positive semidefinite).

According to spectral graph theorist Fan Chung, it is possible to treat the continuous and discrete cases by a universal approach:

The general setting is as follows:

- an underlying space $latex M$ with a finite measure $latex \mu$;

- a well-defined Laplace operator $latex \mathcal{L}$ on functions on $latex M$ [a matrix or a graph] so that $latex \mathcal{L}$ is a self-adjoint operator in $latex L^2(M,\mu)$ with a discrete spectrum;
- if $latex M$ has a boundary then the boundary condition should be chosen so that

it does not disrupt self-adjointness of $latex \mathcal{L}$;- a distance function $latex dist(x,y)$ on $latex M$ so that $latex |\nabla dist| \leq 1$ for an appropriate notion of gradient.
(Chung,

Spectral Graph Theory, 48.)

The spectrum of the continuous Laplace operator gained due recognition with the famous question posed by Mark Kac: *can you hear the shape of a drum?* The question essentially asked whether drums can be isospectral, or share eigenvalues.

Specifically, let’s model a drum as a membrane stretched and clamped over a boundary, represented by some domain $latex D$ in the plane. Let $latex \lambda_i$ be the *Dirichlet eigenvalues*, defined by

$latex -\Delta u = \lambda u,$

with the constraint that $latex u = 0$ on the boundary of $latex D$ (consider the membrane from equation (2) with a boundary, for instance). Then these Dirichlet eigenvalues are precisely the fundamental tone and harmonics the drum can produce. The question, then, is: given the set of Dirichlet eigenvalues, can we infer the shape of the drum? That is, do there exist distinct isospectral domains in the plane?

We can describe a similar problem in the discrete case. Let $latex G$ be a graph and $latex S$ and induced subgraph with non-empty vertex boundary (the set of vertices not in $latex S$ but adjacent to vertices in $latex S$). Then we say a function $latex f: V \rightarrow \mathbb{R}$ satisfies the Dirichlet boundary condition when $latex f(v) = 0$ for every vertex $latex v$ in the vertex boundary of $latex S$. Then, for some function $latex f$ satisfying the Dirichlet boundary condition, the Dirichlet eigenvalues of $latex G$ with respect to $latex S$ are the $latex \lambda_i$ satisfying

$latex \mathcal{L}f(v) = \lambda f(v)$

for every $latex v$ in $latex S$. Note here that $latex \mathcal{L}$ is the *normalized Laplacian* given by $latex \mathcal{L} = D^{-1/2}LD^{-1/2}$. Then an analagous question is: given the set of Dirichlet eigenvalues, can we infer the shape of $latex S$?

The answer to both questions turns out to be yes. The first construction of isospectral drums in two dimensions was given by Gordon, Webb, and Wolpert in 1992, and toward the end of the 2000 ought’s, Ram Band, Ori Parzanchevski, and Gilad Ben-Shach gave a construction of isospectral drums and graphs (a follow up is here).

So, as it turns out the “Laplacian” name of our star player in spectral graph theory is not so arbitrary, and there are many parallels between the continuous Laplace operator and the discrete graph Laplacian. As I continue to enrich my understanding of the connections between the two cases, I can only hope that the power of the Laplace operator will help me gain intuition about the power of the graph Laplacian. To conclude, I leave the reader with a few words from Chung’s book:

For almost every known result in spectral geometry, a corresponding [question] can be asked: Can the results be translated to graph theory?

Chung, Spectral Graph Theory, 54.

*A special thanks to Kyle Mooney for the images.*

The post The Geometric Origins of Spectral Graph Theory appeared first on XRDS.

]]>The post Big Data, Communication and Lower Bounds appeared first on XRDS.

]]>However, transferring the big data is very expensive. In fact, it is more expensive than the computations on the datasets. Thus, in the distributed model, the amount of communication plays an important role in the total cost of an algorithm and the aim is to minimize the amount of communication among processors (CPUs). This is one of the main motivations to study the theory of Communication Complexity, which originates from Big Data processing.

Communication Complexity (CC) has a rich theory behind it and exhibits a beautiful mathematical structure which can be explored by using various mathematical tools.

In fact, Communication Complexity can be applied to many different problems from theory of computation to other related fields, making this area a fundamental key in our understanding of computation.

CC studies the amount of communication bits that the participants of a communication system need to exchange in order to perform certain tasks. A very simple model for exploring this type of questions was proposed by Yao et. al. in 1979[1]. In their model there are two parties, denoted as Alice and Bob and their goal is to compute some specific function f(x, y), which x is input for Alice and y is input for Bob. The results proven in this model can be generalized to more complicated scenarios as well.

Although, at first glance it seems that the field of communication complexity is mostly related to problems in which explicit communication is involved, such as distributed computing, the fact is that its applications are much broader, some of which communication does not even appear in the problem. Examples of such problems are: designing Boolean Circuits, Networks and Data Structures, in particular with regards to computing the lower bounds on the related cost in these type of problems.

It might be surprising and odd that CC can be applied to problems in which communication is not involved. Thus, here I discuss about a few basic problems which communication complexity plays a key role:

**Distributed Learning via CC:
**

Let’s consider a framework where the data is distributed between different locations and parties (each having an arbitrary partition of an overall dataset) and our main goal is to learn a low error hypothesis with respect to the overall distribution of data, using as small amount of communication and as few rounds of communication, as possible, i.e. in distributed learning we are looking for applicable techniques for achieving communication-efficient learning. Different problems such as classification, optimization and differential privacy have been discussed in this setting in some recent work [2, 3, 4].

**Data Outsourcing and Streaming Interactive Proofs via CC:
**

When the dataset is fairly large, the data owner cannot retain all the data and so the storage and computation needs to be outsourced to some service provider. In such situations, data owners wants to rest assured that the computations performed by service provider are correct and complete. We can model this scenario by a verification protocol over data stream, in which there is a resource-limited verifier V and more powerful prover P. The verifier starts a conversation with the prover which does the computations and solves the problem. Then, the prover sends a proof to show the validity of his answer and convince the verifier to accept its results. The streaming data models the incremental pass over the data by the verifier as it sends the data to the cloud. In this setting, verifier just requires tracking logarithmic amount of the data, but instead this requires the communication of information among the players. Here, the goal is to design an interactive proof system [5] with logarithmic communication to verify each query, i.e. after seeing the input and the proof, the verifier should be able to verify the proof of a correct statement with high probability, and reject every possible proof which is presented for a wrong statement. Note that here we consider a more powerful verifier by allowing probabilistic verification. This way the problem of verification in cloud computing for massive data streams links to the communication complexity theory and Arthur-Merlin games [6]. There has been a series of works on streaming interactive proofs for different problems, which can be found in [7].

**Data Structure Lower Bound via CC:
**

Here the golden key is to discover the link between communication complexity and data structure and then use this connection to prove lower bounds for data structures supporting certain type of queries. For example, consider we want to design an efficient data structure for answering the queries of type “is i in S?”. To evaluate the quality of the implementation, there are two measures: (1) space which is the total number of memory cells which is used; and, (2) time which is the number of accesses (reads or writes) to the memory needed to accomplish a task and answer a query. This data structure problem can be viewed as a communication complexity problem by setting two parties: One party (Alice) gets as an input a set S and the other party (Bob) gets as an input an element i. The goal is to check whether i is in S. It can be shown that any implementation for the data structure problem can be reduced to a protocol in communication complexity problem in which complexity is related to the complexity of the data structure and as a result, bound for the communication complexity implies the time-space trade-off for the corresponding data structure.

A simple scenario to show this connection is as follows: suppose there is a cell-probe algorithm [8] for a problem which uses a data structure with space s and t queries. This results in a communication protocol for the same problem with communication t (w + log s) in the following way: when the processor asks for the contents of a memory cell, this can be done by Alice sending a message of log s bits, indicating the index of the desired cell and Bob answers with w bits to describe the content of the cell and this scenario will be done in t rounds of communication. A nice study of communication complexity techniques for computing data structure lower bounds can be found in [9].

**Property Testing Lower Bound via CC:
**

Property testing was discussed in a previous [post] as a type of sublinear algorithms. To recap, in here our goal is to formalize the question “what can be determined about a large object when we have limited access to it?”. Studies show that there is strong connection between testing and communication complexity [10]. The biggest similarity is that both involve parties (tester and communication players) with unbounded computational power and restricted access to their input.

In [10] they consider the case where the large object is the Boolean function f on n input bits and the goal is to decide whether this function has the property P. A variety of techniques and algorithms have been developed for testing Boolean functions, but what distinguishes this work is that they propose techniques for reducing property testing to communication complexity and use this connection for proving lower bounds in certain types of testing problems.

The main idea behind the reduction from testing to communication complexity problem is to set up a communication game as follows: Alice has a function f and Bob has a function g as inputs and they want to check if the joint function h, which is some combination of functions f and g, has a particular property P or is \epsilon-far from all the functions which have the property P. In this setting, now the link is that the number of required queries for testing whether function h has this property will be related to the number of bits which Alice and Bob need to communicate to do this task.

As you can see from what we discussed above, the cases in which communication is not explicitly used, communication complexity is used for proving lower bounds. The communication complexity framework has been well-studied and there are several basic problems which are known to require a large amount of communication. Then, the hardness of these and related problems has been used to obtain lower bounds in many areas such as streaming algorithms, circuit complexity, data structures, proof complexity and property testing. The basic idea used here is as follows: in some specific problem that we would like to bound, instead of starting from “scratch” by studying the structure of the problem, we try to find a connection between that and a hard communication problem in which probably the communication complexity is well known . If we can ﬁnd such a connection, then we can reduce the work involved for proving new bounds, or give simpler proofs of known bounds.

Now maybe the big question here is that why we care about computing lower bounds and what is important about it?

Observe the main difference between upper bounds and lower bounds [11]: Upper bounds show the existence of an efficient solution, while lower bounds must say something about all possible solutions even those which no one has thought of yet. So it’s not surprising that proving some non-trivial lower bound is significantly harder than obtaining some non-trivial upper bound. The natural goal when proving lower bounds is of course to show that the upper bounds we know for some problem are optimal, i.e. there cannot exist a faster data structure than the one we already have.

Now think of big data: after decades of research, we arrived at efficient solutions for most of the well-known problems in the field, i.e., the upper bounds. However, since we are dealing with massive data sets, even a small improvement in the performance of any key algorithms or data structure, would have a huge impact.

Thus researchers strive to improve the known solutions. But when does it end? Can we always improve the solutions we have? Or is there some limit to how efficiently a data structure problem can be solved? This is exactly the question addressed by lower bounds. Lower bounds are mathematical functions putting a limit on the performance of algorithms and data structures [11].

As the concluding remark, it seems that theory of communication complexity and techniques for proving lower bounds serve as two important tools for improving our power to design efficient algorithms and data structures for massive data.

The post Big Data, Communication and Lower Bounds appeared first on XRDS.

]]>The post Software Packages for Theoreticians by Theoreticians appeared first on XRDS.

]]>His talk (as well as many of the rest) are archived and available thanks to ICERM. I will focus on one highlight – a point that resonated with the conclusion of Richard Peng’s talk – a call for more software implementing these new, fast algorithms. In this light, I’d like to briefly discuss some of the software packages out there for spectral graph theory and the analysis of large graphs being developed by theoreticians active in the area.

Trilinos is a project out of Sandia National Labs for developing robust parallel algorithms and implementing them with general purpose software. The focus of the project is enabling technologies for large-scale scientific problems as to encourage further research in the field of parallel, robust, large-scale algorithms. In recognition of existing software for numerical computation, the developers at Trilinos make use of established packages such as LAPACK (for solving systems of simultaneous linear equations), and provides interfaces for Aztec (a parallel solver for sparse linear systems), SuperLU (high performance LU factorization for solving linear systems), Mumps, and Umfpack among others. Currently, Trilinos provides robust parallel numerical algorithms for automatic differentiation, partitioning, preconditioning, and solving linear and nonlinear systems to name a few. The beauty of having a team of algorithmists behind the project is the emphasis on enabling further algorithmic research by building tools for developing tools.

Erik Boman has contributed important work in the area of the preconditioners for linear systems and the support theory for preconditioners. Zoltan, one of his projects, is a toolkit also from Sandia National Labs comprised of combinatorial algorithms for parallel or unstructured applications. It uses dynamic load balancing and partitioning algorithms for parallelizing the computation of applications whose work loads change over the course of the computation. To deal with the problem of *dynamic *partitioning, a suite of partitioning algorithms is included in the Zoltan toolkit. In particular, it makes use of geometric algorithms (group together objects that are physically close), graph algorithms (minimize a cut dividing groups of objects), and hypergraph algorithms (minimize communication costs between groups of objects) for load balancing partitioning. Another important function Zoltan provides is for graph coloring and graph ordering, which in turn can be used for parallel preconditioners and linear solvers.

Sangria is another project focused on developing and implementing parallel geometric and numerical algorithms. Housed at Carnegie Mellon, the software uses parallel algorithms for simulating complex flows with dynamic interfaces that achieves good accuracy.

MatlabBGL is a Matlab package written by David Gleich, designed to work with sparse graphs on hundreds of thousands of nodes. The library includes common graph algorithms such as computing shortest paths (Dijkstra, Bellman-Ford, Floyd-Warshall), finding an minimum spanning tree (Kruskal, Prim), depth-first search, breadth-first search, and max flow – all optimized for efficiency on large graphs. One of the most useful features of MatlabBGL is the visitor feature for monitoring an algorithm, implemented from the Boost Graph Library. Visitors output all the steps taken by the algorithm, and dissecting the output is useful for optimization. A nice illustration of this feature with Dijkstra’s algorithm in the documentation tells us how a graph is explored, vertex by vertex.

Benoît Hudson’s PhD work was on sparse mesh refinement, and SVR (for Sparse Voronoi Refinement) is the implementation of his algorithm for Delaunay refinement. SVR is a provably fast algorithm for producing small meshes, which is useful for when remeshing occurs during simulation due to domain change or refinement.

SpA is a Matlab program for computing the effective resistances in an electrical network and is an implementation of the Spielman-Srivastrava. As this requires solving linear systems, it uses Combinatorial Multigrid (CMG), a Matlab-based solver for linear systems in symmetric diagonally-dominant matrices written by Yiannis Koutis.

This is by no means a comprehensive list, I encourage you include more useful software in the comments.

The post Software Packages for Theoreticians by Theoreticians appeared first on XRDS.

]]>The post The Thesis appeared first on XRDS.

]]>**Problem Definition**

A typical PhD follows a simple process: read, think, propose, publish, and the *thesis*. It is straightforward and one can imagine that if you are already there with the rest of the stuff, the write up would be rather easy. But it is not.

The problem lies, mostly in that writing the thesis is a lengthy and lonely act. You have to do it, nobody will come to your aid, except maybe from your advisor.

In my case, I faced the following problem; for quite some time, I could not motivate myself to write it down. I began writing and half page later, I always stopped. I tried everything, but nothing seemed to motivate me. My advisor got uncomfortable and we began talking about a method to track my progress that would motivate me.

**The Idea**

Then I saw it, Georgios Gousios’s Thesis-o-meter (see link below). This was a couple of scripts that posted every day the progress of the PhD in each chapter. I decided to do it myself, introducing some alterations that would work better for me.

First, I had to find a tangible way to measure the progress. I thought that was easy, the number of pages. The number of pages of a document is nice, if you want to measure the size of the text, but surely it cannot act as a day-to-day key performance indicator (KPI). And why is that? Because simply if you bootstrap your thesis in LaTeX and you put all the standard chapters, bibliography, etc you will find yourself with at least 15 pages. So, that day I would have an enormous progress. The next day, I would write only text. I think one or two pages. The other day text and I would put on some charts. This will count as three of four pages. Better huh? This is the problem.

If you are a person like me, you could add one or two figures, and say “Ok, I am good for today, I added two pages!”. This is a nice excuse if you want to procrastinate. I needed something that would present the naked truth. That would make me sit there and make some serious progress.

So, number of pages was out of the question, but I thought that we can actually use it. The number of pages will be the end goal with a minimum and a maximum. In Greece, a PhD usually has 150 to 200 pages length (in my discipline of course, computer science). So, I thought, this is the goal: a large block of text around those limits.

Then I thought that my metric should be the number of words in the text instead of the number of pages. Since, I wrote my thesis in LaTeX, I just count the words for each file with standard UNIX tools, for example with the command `wc -l myfile.tex`

. So, the algorithm has the following steps:

- The goal is set to 150-200 pages in total
- Each day,
- Count the words for all files
- Count the pages of the actual thesis file, for example the output PDF
- Find the word contribution for that day just by subtracting from the previous’s day word count
- Find an average of words per number of pages
- Finally, provide an estimation for the completion of the thesis

**Experience Report**

I implemented this in Python and shell script. The process worked, each day a report was generated and sent to my advisor, but the best thing was that each day, I saw the estimation trimmed down a little. This is the last report I produced:

10c10 1899 build/2-meta-programming.tex 13c13 1164 build/3-requirements.tex 60,61c60,61 < 13931 build/thesis.bib 14058 build/thesis.bib > 55747 total ---- Progress ---- Worked for 167 day(s) ... Last submission: 20121025 Word Count (last version): 55747 Page Count (last version): 179 Avg Words per Page (last version): 311 Last submission effort: 142 ---- Estimations ---- Page Count Range (final version): (min, max) = (150, 200) Word Count Range (final version): (min, max) = (46650, 62200) Avg Effort (Words per Day): 184 Estimated Completion in: (min, max) = (-50, 35) days, (-2.50, 1.75) months Estimated Completion Date: (best, worst) = (2012-08-11, 2012-12-16)

The average words per page was 311 and I wrote almost 184 words each day.

**Epilogue**

I wrote my thesis, but I have not submitted it (at least now, but I hope to soon), for a number of practical reasons. Still, the process succeeded, I found my KPIs and they actually led me to finishing up the work. This is a fact and now I have to find another motivation-driven method to do the rest of the required stuff. C’est la vie.

**Related Links and Availability**

I plan to release an open source version of my thesis-o-meter in my Github profile soon. I also found various alternative thesis-o-meters:

- Salvatore Scellato, http://www.cl.cam.ac.uk/~ss824/thesisometer.html
- Georgios Gousios, http://www.gousios.gr/sw/tom.html
- Justin Boyan, http://www.cs.cmu.edu/~jab/tom/

The post The Thesis appeared first on XRDS.

]]>The post The Evolution of Local Graph Partitioning appeared first on XRDS.

]]>The goal of a local partitioning algorithm is to identify a community in a massive network. In this context, a community can be loosely defined as a collection of well-connected vertices who are are also reasonably well-separated from the rest of the network. The quality of a community given by a subset of vertices $latex S\subseteq V$ can be determined by a number of measures. One common measure is a ratio of edge connections from $latex S$ to the rest of the network divided by the size of $latex S$, known as the *conductance*. Specifically, let $latex \partial(S)$ be the number of edges with one endpoint in $latex S$ and the other not in $latex S$, and let the *volume* of $latex S$, $latex \textmd{vol}(S) = \sum_{v\in S}deg(v)$ be the sum of the degrees of vertices in $latex S$. Then the conductance of $latex S$ is $latex \phi(S) := \partial(S)/\textmd{vol}(S)$. The goal of a local partitioning algorithm is, given a vertex $latex v\in V$, to identify a subset $latex S$ near $latex v$ with small conductance.

Local algorithms may be used iteratively to find global balanced cuts or a clustering of the entire network. However, the problem is also of independent interest in itself. The ability to identify local communities has important implications in social networks, for instance, when we are only interested in a set of entities with particular features, rather than how these features vary over the entire network. A clear example is advertising. A particular product will be of highest interest to a particular constituent, and advertisers are probably unconcerned with the rest of the market.

In any case, running time determines how these can be applied to massive networks. In general, local algorithms will have running times in terms of the size of the output – the local cluster – rather than the entire graph. The trend in the last decade or so has been in developing local partitioning algorithms which run in time nearly linear in the size of the output.

Results for partitioning algorithms can be traced back to the Cheeger inequalities given by Alon and Milman in ’85. Recall that the *edge expansion* of a set, related to the conductance, is defined by $latex h_G(S) = \partial(S)/|S|$, where now we are simply concerned with the number of vertices in $latex S$, $latex |S|$. The edge expansion of a graph, $latex h(G)$, is the minimum edge expansion over all subsets. The Cheeger inequalities give a way to relate $latex h(G)$ to the second smallest eigenvalue of a normalized adjacency matrix. In her book on spectral graph theory, Chung gives an analog of the Cheeger inequalities for the conductance of a set as defined above, this time using the second smallest eigenvalue of the normalized Laplacian. The Cheeger inequalities prove that, in nearly linear time, an $latex O(1/\sqrt{\phi(G)})$-approximation to the conductance of the graph can be computed.

Following the Cheeger inequalities, a local partitioning algorithm was studied by Spielman and Teng beginning with their STOC result of 2004. Their result was improved in 2006 by Andersen, Chung, and Lang, later in 2009 by Andersen and Peres, and in 2012 by Gharan and Trevisan.

The intuition behind these algorithms has to do with mixing rates. Let us revisit our notion of a community – a collection of well-connected vertices which are relatively separate from the rest of the graph. This can also be understood in term of random walks. Namely, if we start a random walk within a high-quality community, we can expect with reasonably high probability to remain within the community after a certain number of steps.

Theoretical results for local partitioning algorithms are typically formulated as follows. Given a set $latex S\subseteq V$, if a starting vertex $latex v$ is sampled from $latex S$, then with certain probability a set $latex T$ can be found with small conductance, in terms of $latex \phi(S)$, in time proportional to the size of $latex T$. The *work/volume* ratio is the work required for a single run of the algorithm divided by the volume of the output set. Spielman and Teng find such a set by examining threshold sets of the probability distribution of a $latex t$-step random walk from the vertex $latex v$ sampled with probability proportional to degree in $latex S$. The set output by their algorithm achieves conductance $latex O(\phi(S)^{1/2}\log^{3/2}n)$ with work/volume ratio $latex O(\phi(S)^{-2} polylog(n))$. This is improved by Andersen, Chung, and Lang using distributions of PageRank vectors to analyze random walks with a probability of being “reset” to the starting vertex at each step. Their algorithm improves the conductance of the output set to $latex O(\phi(S)^{1/2}\log^{1/2}n)$ with work/volume ratio $latex O(\phi(S)^{-1} polylog(n))$.

Andersen and Peres manage to improve the work/volume ratio of the output set. Their methods simulate a volume biased evolving set process which is a Markov chain with states that are subsets of vertices and transition rules that grow or shrink the current set. Specifically, start with a vertex $latex v$ and produce a sequence of sets $latex S_1, S_2, \ldots, S_{\tau}$ with the property that at least one set $latex S_t$ is such that $latex \partial(S_t)/\textmd{vol}(S_t) \leq O(\sqrt{\log(\textmd{vol}(S_{\tau}))/\tau})$. Then if the process constructs sets of volume at most $latex \gamma$ for all sets up to some time $latex T$, they achieve a set of volume at most $latex \gamma$ and conductance $latex O(\sqrt{\log\gamma/T})$.

As mentioned, the results of Andersen and Peres have recently been improved by Gharan and Trevisan in their 2012 FOCS paper. Their main tool is again threshold sets of random walks, with an improved lower bound on the probability that a lazy random walk is entirely contained in a subset $latex S$ after $latex t$ steps. This is an improvement on the work of Spielman and Teng. In their algorithm, they also use an evolving set process, and beat the bounds of Andersen and Peres by performing copies of the evolving set process in parallel. They show that, for a starting vertex $latex v$, a target conductance $latex \phi\in(0,1)$, a target volume $latex \gamma$, and $latex 0 < \epsilon < 1$, their algorithm outputs a set $latex S$ with conductance $latex O(\sqrt{\phi/\epsilon})$ that performs with work/volume ratio $latex O(\gamma^{\epsilon}\phi^{-1/2}\log^2 n)$. The running time is slightly super linear in the size of the optimum.

All the results are summarized in the following table.

This is a hard problem, but a rewarding one. Gharan and Trevisan design a local variant of the Cheeger inequalities that give a sublinear time algorithm with an approximation guarantee that does not depend on the size of the graph. This opens many doors for algorithms on massive networks. There is already a body of work that applies local partitioning algorithms. For instance, in identifying local alignments in protein networks. However, as I also mentioned in my previous post, there is often some calibration involved in moving these algorithms to production. I think local partitioning algorithms are primed for application, due to their promising theoretical bounds and the simplicity of their implementations (simulating random walks) and can be close to mainstream application.

The post The Evolution of Local Graph Partitioning appeared first on XRDS.

]]>The post Theory Behind Big Data appeared first on XRDS.

]]>One is more focused on industry and business aspects of big data, and includes many IT companies who work on analytics. These companies believe that the potential of big data lies in its ability to solve business problems and provide new business opportunities. To get the most from big data investments, they focus on questions which companies would like to answer. They view big data not as a technological problem but as a business solution, and their main goals are to visualize, explore, discover and predict.

On the more theoretic side, researchers are interested in the theory behind big data, and its use in designing efficient algorithms. As a personal experience, I have been to different career fairs with companies that work in the area of big data. I expected we will have many common things to talk about, but when I described my work and the problems which are interesting to us, I realized that the way we are looking at this problem in academia is different from what the industry is looking for.

While there will always be a gap between theory in academia and applications in industry, I feel that since the origin of the “big data” problem is real world applications and challenges this gap should be less pronounces than in other theory fields in Computer Science. The main question that arises is that what it means when we say “theory for big data”? How it is different from “classic” theoretical computer science?

There seem to be different perspectives among theoreticians regarding this question. Some researchers consider big data as “bad news” for algorithm design, since this leads to intelligent and sophisticated algorithms being replaced by less clever algorithms that can be applied efficiently to massive data sets (as pointed out by Prabhakar Raghavan in STOC13).

On the other side, many researchers have a more positive view, and think of big data as a great opportunity to rethink classic techniques for algorithm design and underlying theoretical foundations.

Moritz Hardt has an interesting blog post discussing this point. He argues that the starting point is to explore the properties that large data sets exhibit and how they might affect algorithm design.

There is an ongoing effort in the community to make the most of the big data opportunity. As part of the program called “The Theoretical Foundation of Big Data”, which is held at Simons Institute for Theoretical Computer Science this year with many visiting scientists working in the area of massive data, there are several workshops and lots of interesting talks on different related topics. Olivia has covered briefly one of the recent ones, titled, “Unifying Theory and Experiment for Large-Scale Networks”, here.

In my future posts, I will try to discuss some interesting problems which may be considered as the core of theoretic research on big data. I will try to show why studying the theory behind big data is important, and assess how much of this study has been effective and helpful to the main goal – making data processing faster.

The post Theory Behind Big Data appeared first on XRDS.

]]>The post Ultra-Efficient via Sublinearity appeared first on XRDS.

]]>We can think of sublinear algorithms in the area of big data in three different categories:

Sublinear space algorithms: Here we are more focused on algorithms for processing data streams which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size which is assumed to be sublinear and is typically polylogarithmic) and also limited processing time per item. Based on this settings, the algorithm produces an approximate answer using a summary of the data stream in memory. [1]

Sublinear communication algorithms: The scenario in this category is a bit different: data is distributed among multiple machines and the goal is to compute some function on the union of data sets. Apparently to do these distributed computations, we let the machines to have communications among each other and of course, the goal is to do this with the least amount of communications.

Sublinear time algorithms: Here we are more looking for algorithms that do not even need to read whole input to answer the query on that. Since these algorithms must provide an answer without reading the entire input, they are typically heavily depend on randomization and provide approximate answer. In other words, we can look at the sublinear time algorithms as sort of randomized approximation algorithms. Still there are problems for which deterministic exact sublinear time algorithms are known. But typical algorithms that are exact and yet run in sub-linear time use parallel processing or alternatively have guaranteed assumptions on the input structure (as the logarithmic time binary search and many tree maintenance algorithms do)- However, the specific term sublinear time algorithm is usually reserved to algorithms that run over classical serial machine models and are not allowed prior assumptions on the input.

In the scope of sublinear time algorithms, there are two main categories of interest: The algorithms which need to compute an approximate value and the ones which require to make an approximation decision and are called as “property testing”. Informally speaking, in property testing the goal is to design efficient algorithms to decide whether a given mathematical object has a certain property or is `far’ from having this property, i.e. is significantly different from any object that has the property. To make this decision, algorithm can perform local queries on the object, but the final decision consider a global view of the object and decision task should be performed by querying the object in as few places as possible.

More precisely, for a fixed property P and any object O, if object O has property P the algorithm should accept with probability at least 2/3, otherwise if the object O is ε-far from having property P, then the algorithm should reject with probability at least 2/3. Here one-sided error is much more desired, means the accepting probability be 1 instead of 2/3. In such cases, the algorithm has a non-zero error probability only on inputs that are far from having the property and never reject inputs that have the property. This necessitate the algorithm in the case of rejecting some input to provide (small) evidence to show that the input does not have the property.

To determine what exactly means to be ε-far from having property P, we need to define the distance measure based on the problem. Then we can interpret it as the Hamming distance between object O and any other object O’ having the property P is at least ε|O|. For example, if the property is in graphs to test whether they have k-clique(a clique of size k), then being ε-far from this property means that more than ε -fraction of edges should be added to the graph so that it have a clique of size k.

While each algorithm has features that are specific to the property it tests, there are several common algorithmic and analysis techniques for property testing[2]. Probably the most popular one is applying the idea of Szemeredi’s Regularity Lemma, which is very important tool and central key to the analysis of testing graph properties in the dense-graphs model.

Property testing initially was defined by Rubinfeld and Sudan[3] for testing algebraic properties of functions and was discussed in the context of Program Testing and Probabilistically Checkable Proofs(PCP). In program checking, the idea is to test that the program satisfies certain properties before checking whether it computes a specified function.

Later, Goldreich, Goldwasser and Ron [4] initiated the study of testing properties of graphs and presented some general results on the relations between testing and learning. In recent years there has been a growing body of work dealing with properties of functions, distributions and combinatorial objects such as graphs, strings, sets of points and many algorithms have been developed with complexity that is sub-linear or even independent of size of the object. But still the research in this area is new and there are much left to understand and explore.

If you are interested to follow the research trends in the area of Sublinear time algorithms and property testing, there is a great blog- [PTReview]- which discusses and report about the latest news, research develops and papers on the property testing and sublinear time algorithms. There are also a bunch of available surveys by researchers working in the area of sublinear time algorithms.

Maybe the interesting point about property testing algorithms is that while they are decision algorithms, in several cases they can be transformed to optimization problems which actually constructs the approximation solutions and this is the key link between property testing and “classical” approximation. Now maybe the main question to ask is that when is it valuable to think about property testers? Is it just restricted to certain problems and just the cases which we are dealing with huge amount of data?

To wrap up, we can summarize the setting of interests for applying property testing algorithms as follows:[2]

– The object is huge and expensive to be fully scanned. We need to make just a approximate decision.

– The object is not very large, but the property we are looking at is NP-hard. This includes many problems in Graph theory, for example coloring.

– The Object is not large and the decision problem has a polynomial-time algorithm. But still we desire to have a more efficient algorithm even by sacrificing some part of accuracy.

– Similar to last case, object is not large and the decision problem has a polynomial-time algorithm, but the final decision must be exact. In this case, the property testing is useful since we can first run it on the data and if it passes the test as accepted, then we run the exact algorithm. This will help us to save time when the input is far from having the property.

As a take-home message, It seems that for every researcher who wants to start working on the area of algorithm design and theoretical foundations of large data analysis, it’s a must to have a good flavor of algorithmic and analysis techniques used for sublinear time algorithms.

The post Ultra-Efficient via Sublinearity appeared first on XRDS.

]]>The post Laying the Foundation for a Common Ground appeared first on XRDS.

]]>Many questions came up around the very issue of finding unifying ground. How much should we invest in the theory if we have the empirical results? To what degree do we tune our models to mirror real-world data?

The structure of the workshop was well-suited to such a discussion. The four-day series was divided into sessions, each of which consisted of four 30-minute talks and then an open, unrecorded panel discussion. The full schedule, with abstracts and video, can be found here. There was a good representation of theory/experiment across and within the sessions. While there was some clear polarity, there were at least some important insights as to the reasons for the distance between theory and practice in large-scale network problems. Below are some of the major themes that came up.

*Do we have the model right?*Theory that can be used in practice depends on the theory assuming the right model. But what is the “right” model? In general, there seems to be little agreement on this. Firstly, models that can be nicely abstracted may not accurately model network data that occurs in nature. How relevant is a stochastic assumption, or i.i.d. sampling in practice? A poignant example was given by C. Seshedhri in his talk on finding triangles in graphs in an online streaming model [1]. After his conclusions, he mentioned a question posed by the experimental community at his lab: edges may occur more than once in real online streaming networks, so how can these results be applied to multigraphs? No matter how promising the results, if they do not fit a scenario, they cannot be implemented in practice. And, for clarification, the results have been extended to multigraphs by considering the induced simple graph [2] (but these are still not used in production).Second, many problems seem to be lacking the theory because statistical models change very quickly. The variety of models, metrics, and parameters may even distract from the development of good practical algorithms with theoretical backing. As a result, many theoretical researchers will essentially pick the model most attractive to them and develop the theory there, whether or not there is a practical need.

*Empirical results beat theoretical bounds.*Even problems

*with*backing theory are generally not applied. Most algorithms presented at conferences are not available off-the-shelf. Getting these algorithms ready for prime time will almost certainly require many iterations of improvement and calibrating, and a team with a very good understanding of both the theory and practice.Why isn’t the theory enough? One reason may be the focus on scalability among the theory community. It can be generally observed that memory capacity is growing more quickly than problem sizes, so it is not necessarily useful in practice for algorithms to be highly scalable. Somewhat along these lines also is the question of what needs to be achieved. Complexity and accuracy analyses presented in algorithms papers are typically worst-case bounds. However, it can often be the case that data we see in real-world scenarios behave better, even much better, than the worst case. It may be unnecessary to implement robust algorithms when simpler heuristics perform better in practice. Vigna delivered a provocative view early in the workshop on why approximate set representations used to analyze large networks perform better than their theoretical guarantees.*How can we improve if we don’t know how we’re doing?*This was the tagline of Salganik’s talk on modeling epidemic networks in Rwanda. Without methods for validation, it is impossible to bridge the gap between theory and practice. Surprisingly, though, some classes of algorithms have achieved success in spite of this. In particular, local partitioning algorithms have demonstrated both theoretical and practical success. I’ll discuss one example next.

Vahab Mirrokni closed the session on clustering with a presentation on his joint work with Reid Andersen and David Gleich on clustering for distributed computing.

The goal of a global clustering is to partition the nodes of a graph into distinct subsets such that there is little communication between the clusters (few crossing edges) and no single cluster is too large. This has clear applications to distributed computing, where large datasets are relatively equally partitioned onto a group of processors, with minimal communication required between processors. In an overlapping clustering, partitions are not required to be disjoint, and more of the graph is stored than is required.

The goal of an overlapping clustering can be formulated in terms of random walks. Consider the walk on the graph v1, v2, …, vt. Let T be a mapping from the vertex set to a set of clusters, and say and the corresponding sequence of active clusters containing the vertex at each step is C1 = T(v1), C2 = T(v2), …, Ct = T(vt). Then the goal of an overlapping clustering is to divide the graph into clusters which minimizes the number of times the active cluster must be changed during a random walk.

In this work, overlapping clusters are found first through *local* partitions. Candidate clusters are generated from local clusters computed from each vertex, using the PageRank procedure of [3], for example. Overlapping clusters with upper bounded volume are then computed by combining the local clusters in a certain way which minimizes random walk cluster crossing. The result is a set of clusters each with a bounded volume and for which communication, modeled through the random walk probability diffusion, is kept minimal.

Local clustering is an example where the best performing algorithms have the backing of theoretical guarantees. For finding small clusters, for example, local algorithms are the best option in terms of performance and quality. Global clustering, however, is a different story, and is generally too hard to move beyond empirical guarantees. The overlapping clusters algorithm is a nice example of applying a theoretically sound procedure to a problem which generally does not have the theoretical backing.

However, even still, the best global clustering algorithms are those based on good empirical performance. So do we need the theory? Will there always be an impenetrable gap between theory and practice? The workshop set the stage for some necessary discussion, but the general consensus seemed to be that the goals of these two ends are irreconcilably different. Of course, theorists need the motivation of practice, and practitioners need the inspiration of theory, but we may be some time away from theoretical results being applied to large-scale problems.

As a footnote, Ravi Kanaan is giving an upcoming talk at the Simons Institute on whether theory is necessary in clustering, which might offer some insight.

The post Laying the Foundation for a Common Ground appeared first on XRDS.

]]>