For a long time in the area of design and analysis of algorithms, when we have said that an algorithm is efficient we meant that it runs in time polynomial in the input size n and finding a linear time algorithm have been considered as the most efficient way to solve a problem. It’s been because of this assumption that we need at least to consider and read all the input to solve the problem. This way it seems that we cannot do much better! But nowadays the data sets are growing fast in various areas and applications in a way that it hardly fits in storage and in this case even linear time is prohibitive. To work with this massive amount of data, the traditional notions of an efficient algorithm is not sufficient anymore and we need to design more efficient algorithms and data structures. This encourages researchers to ask whether it is possible to solve the problems using just sublinear amount of resources? what does that mean exactly when we say ‘sublinear resources’?

We can think of sublinear algorithms in the area of big data in three different categories:

Sublinear space algorithms: Here we are more focused on algorithms for processing data streams which the input is presented as a sequence of items and can be examined in only a few passes (typically just one). These algorithms have limited memory available to them (much less than the input size which is assumed to be sublinear and is typically polylogarithmic) and also limited processing time per item. Based on this settings, the algorithm produces an approximate answer using a summary of the data stream in memory. [1]

Sublinear communication algorithms: The scenario in this category is a bit different: data is distributed among multiple machines and the goal is to compute some function on the union of data sets. Apparently to do these distributed computations, we let the machines to have communications among each other and of course, the goal is to do this with the least amount of communications.

Sublinear time algorithms: Here we are more looking for algorithms that do not even need to read whole input to answer the query on that. Since these algorithms must provide an answer without reading the entire input, they are typically heavily depend on randomization and provide approximate answer. In other words, we can look at the sublinear time algorithms as sort of randomized approximation algorithms. Still there are problems for which deterministic exact sublinear time algorithms are known. But typical algorithms that are exact and yet run in sub-linear time use parallel processing or alternatively have guaranteed assumptions on the input structure (as the logarithmic time binary search and many tree maintenance algorithms do)- However, the specific term sublinear time algorithm is usually reserved to algorithms that run over classical serial machine models and are not allowed prior assumptions on the input.

In the scope of sublinear time algorithms, there are two main categories of interest: The algorithms which need to compute an approximate value and the ones which require to make an approximation decision and are called as “property testing”. Informally speaking, in property testing the goal is to design efficient algorithms to decide whether a given mathematical object has a certain property or is `far’ from having this property, i.e. is significantly different from any object that has the property. To make this decision, algorithm can perform local queries on the object, but the final decision consider a global view of the object and decision task should be performed by querying the object in as few places as possible.

More precisely, for a fixed property P and any object O, if object O has property P the algorithm should accept with probability at least 2/3, otherwise if the object O is ε-far from having property P, then the algorithm should reject with probability at least 2/3. Here one-sided error is much more desired, means the accepting probability be 1 instead of 2/3. In such cases, the algorithm has a non-zero error probability only on inputs that are far from having the property and never reject inputs that have the property. This necessitate the algorithm in the case of rejecting some input to provide (small) evidence to show that the input does not have the property.

To determine what exactly means to be ε-far from having property P, we need to define the distance measure based on the problem. Then we can interpret it as the Hamming distance between object O and any other object O’ having the property P is at least ε|O|. For example, if the property is in graphs to test whether they have k-clique(a clique of size k), then being ε-far from this property means that more than ε -fraction of edges should be added to the graph so that it have a clique of size k.

While each algorithm has features that are specific to the property it tests, there are several common algorithmic and analysis techniques for property testing[2]. Probably the most popular one is applying the idea of Szemeredi’s Regularity Lemma, which is very important tool and central key to the analysis of testing graph properties in the dense-graphs model.

Property testing initially was defined by Rubinfeld and Sudan[3] for testing algebraic properties of functions and was discussed in the context of Program Testing and Probabilistically Checkable Proofs(PCP). In program checking, the idea is to test that the program satisfies certain properties before checking whether it computes a specified function.

Later, Goldreich, Goldwasser and Ron [4] initiated the study of testing properties of graphs and presented some general results on the relations between testing and learning. In recent years there has been a growing body of work dealing with properties of functions, distributions and combinatorial objects such as graphs, strings, sets of points and many algorithms have been developed with complexity that is sub-linear or even independent of size of the object. But still the research in this area is new and there are much left to understand and explore.

If you are interested to follow the research trends in the area of Sublinear time algorithms and property testing, there is a great blog- [PTReview]- which discusses and report about the latest news, research develops and papers on the property testing and sublinear time algorithms. There are also a bunch of available surveys by researchers working in the area of sublinear time algorithms.

Maybe the interesting point about property testing algorithms is that while they are decision algorithms, in several cases they can be transformed to optimization problems which actually constructs the approximation solutions and this is the key link between property testing and “classical” approximation. Now maybe the main question to ask is that when is it valuable to think about property testers? Is it just restricted to certain problems and just the cases which we are dealing with huge amount of data?

To wrap up, we can summarize the setting of interests for applying property testing algorithms as follows:[2]

– The object is huge and expensive to be fully scanned. We need to make just a approximate decision.

– The object is not very large, but the property we are looking at is NP-hard. This includes many problems in Graph theory, for example coloring.

– The Object is not large and the decision problem has a polynomial-time algorithm. But still we desire to have a more efficient algorithm even by sacrificing some part of accuracy.

– Similar to last case, object is not large and the decision problem has a polynomial-time algorithm, but the final decision must be exact. In this case, the property testing is useful since we can first run it on the data and if it passes the test as accepted, then we run the exact algorithm. This will help us to save time when the input is far from having the property.

As a take-home message, It seems that for every researcher who wants to start working on the area of algorithm design and theoretical foundations of large data analysis, it’s a must to have a good flavor of algorithmic and analysis techniques used for sublinear time algorithms.