Theory Behind Big Data

As a PhD student who does research on theory and algorithms for massive data analysis, I am interested in exploring current and future challenges in this area, which I’d like to share it here. There are two major points of view when we talk about big data problems:

One is more focused on industry and business aspects of big data, and includes many IT companies who work on analytics. These companies believe that the potential of big data lies in its ability to solve business problems and provide new business opportunities. To get the most from big data investments, they focus on questions which companies would like to answer. They view big data not as a technological problem but as a business solution, and their main goals are to visualize, explore, discover and predict.

On the more theoretic side, researchers are interested in the theory behind big data, and its use in designing efficient algorithms. As a personal experience, I have been to different career fairs with companies that work in the area of big data. I expected we will have many common things to talk about, but when I described my work and the problems which are interesting to us, I realized that the way we are looking at this problem in academia is different from what the industry is looking for.

While there will always be a gap between theory in academia and applications in industry, I feel that since the origin of the “big data” problem is real world applications and challenges this gap should be less pronounces than in other theory fields in Computer Science. The main question that arises is that what it means when we say “theory for big data”? How it is different from “classic” theoretical computer science?

There seem to be different perspectives among theoreticians regarding this question. Some researchers consider big data as “bad news” for algorithm design, since this leads to intelligent and sophisticated algorithms being replaced by less clever algorithms that can be applied efficiently to massive data sets (as pointed out by Prabhakar Raghavan in STOC13).

On the other side, many researchers have a more positive view, and think of big data as a great opportunity to rethink classic techniques for algorithm design and underlying theoretical foundations.

Moritz Hardt has an interesting blog post discussing this point. He argues that the starting point is to explore the properties that large data sets exhibit and how they might affect algorithm design.

There is an ongoing effort in the community to make the most of the big data opportunity. As part of the program called “The Theoretical Foundation of Big Data”, which is held at Simons Institute for Theoretical Computer Science this year with many visiting scientists working in the area of massive data, there are several workshops and lots of interesting talks on different related topics. Olivia has covered briefly one of the recent ones, titled, “Unifying Theory and Experiment for Large-Scale Networks”, here.

In my future posts, I will try to discuss some interesting problems which may be considered as the core of theoretic research on big data. I will try to show why studying the theory behind big data is important, and assess how much of this study has been effective and helpful to the main goal – making data processing faster.

This entry was posted in Algorithms, Theory and tagged by Samira Daruki. Bookmark the permalink.

About Samira Daruki

Samira is a 3th year PhD student at the University of Utah, School of Computing working at Algorithms and Theory Lab. Her research focus on Algorithms and Theoretical Foundations of Massive Data Analysis and she works under Suresh Venkatasubramanian. In particular she is interested in sublinear algorithms, streaming and distributed computations and communication complexity.

Leave a Reply

Your email address will not be published. Required fields are marked *