As the size of the available data increases, massive data sets cannot be stored in their entirety in memory of a single machine. Furthermore due to the small amount of memory and computation power available on a single node, one needs to distribute the data and computation among multiple machines when processing tasks.
However, transferring the big data is very expensive. In fact, it is more expensive than the computations on the datasets. Thus, in the distributed model, the amount of communication plays an important role in the total cost of an algorithm and the aim is to minimize the amount of communication among processors (CPUs). This is one of the main motivations to study the theory of Communication Complexity, which originates from Big Data processing. Continue reading
As a PhD student who does research on theory and algorithms for massive data analysis, I am interested in exploring current and future challenges in this area, which I’d like to share it here. There are two major points of view when we talk about big data problems:
One is more focused on industry and business aspects of big data, and includes many IT companies who work on analytics. These companies believe that the potential of big data lies in its ability to solve business problems and provide new business opportunities. To get the most from big data investments, they focus on questions which companies would like to answer. They view big data not as a technological problem but as a business solution, and their main goals are to visualize, explore, discover and predict.
For a long time in the area of design and analysis of algorithms, when we have said that an algorithm is efficient we meant that it runs in time polynomial in the input size n and finding a linear time algorithm have been considered as the most efficient way to solve a problem. It’s been because of this assumption that we need at least to consider and read all the input to solve the problem. This way it seems that we cannot do much better! But nowadays the data sets are growing fast in various areas and applications in a way that it hardly fits in storage and in this case even linear time is prohibitive. To work with this massive amount of data, the traditional notions of an efficient algorithm is not sufficient anymore and we need to design more efficient algorithms and data structures. This encourages researchers to ask whether it is possible to solve the problems using just sublinear amount of resources? what does that mean exactly when we say ‘sublinear resources’?
We can think of sublinear algorithms in the area of big data in three different categories: