“Big data” challenges for software engineering evolution

In software engineering the “big data” catchphrase refers to in-homogeneous large-scale data that can stem from all software development cycles. Such data can be: source code, software bugs and errors, system logs, commits, issues from backtracking systems, discussion threads from consulting sites (e.g. stackoverflow.com), emails from mailing-lists, as well as developers’ demographic data and characteristics and user requirements and reviews. Software engineering can benefit from the aforementioned data in many ways, but there are several challenges regarding the handling of such data.

In a nutshell, both researchers and practitioners can monitor software engineering processes and systems in order to collect data (e.g. logs, bugs, reports, and reviews) and get feedback from developers and users. Then, they can provide developers with better software tools and methods, and, consequently, developers can write more stable applications for advanced user experience. In the following paragraphs, we present the challenges of obtaining and handling software data, as well as we make a discussion about cutting-edge data management tools and techniques.

Data collection. Although there exists a plethora of different data sources related to software development cycle, there are data collection obstacles. Main difficulties about the acquisition of software data refer to: data privacy and the gap between the research community and the industry. Specifically, a lot of times, it is unfeasible for researchers to get data from companies in order to observe how employees work and access their source code. Consequently, researchers can mainly use data from open source projects and they are unable to generalize their findings and make recommendations to practitioners. Another problem is that even among companies the communication channels are restricted. For instance, the designers of an API (e.g. Android) cannot be always aware about problems that manifest in client applications (i.e. API designers are not aware of application crashes and associated reports). As a result, software engineers develop software prototypes (e.g. static analysis tools, IDEs, recommendation systems, etc.) that could not be in line with developers’ requirements. Obviously, there is a need for researchers and companies to be able to access crucial data when it is necessary.

Data organization. In addition, given the diversity of software data (source code bugs, logs, reviews), data scientists should know what tools and methodologies they should use per case. For instance, for textual analysis (e.g. comments, reviews, issues, bugs), there are text mining environments and tools, including RapidMiner, Mallet, the Natural Language Toolkit (NLTK) for Python users, the Apache Mahout, as well as powerful Unix tools for data cleaning and filtering, such as grep and awk [1]. In the following, we present some code snippets to show how one can organize the records of a file using simple UNIX scripting.

awk '{print $1, $2, $3}' file.txt | sort | uniq > new_file.txt

This command uses the awk UNIX tool to extract columns 1, 2, and 3 from the file.txt, sort the records and get the unique ones. Then, it prints the output into the new_file.txt.

grep '[0-9]+' file.txt (1)
grep '^$' file.txt (2)

These commands use the grep UNIX tool. In the first case, we search for lines with digits, whereas in the second case, we search for blank lines. More details about the used regular expressions can be found here.

Regarding the analysis of the source code, there are specific tools such as FindBugs, PMD, Soot that perform static analysis to identify bad patterns and optimize the source code. Finally, as far as quantitative analysis is concerned, there are several statistical packages in R, as well as tools for graphics, such as gnuplot, matplotlib, and NetworkX.

Recommendations. After data cleaning, filtering, and organization, researchers seek patterns to use in predictions and to provide recommendations for possible improvements for systems, applications, and software development itself. For this, there are several approaches, including machine learning, heuristics and pattern matching. Some popular machine learning libraries and tools are: Scikit-learn for Python users, the Apache Mahout, the Cloudera Oryx for real-time large-scale machine learning and predictive analysis of live streaming data, the Weka software that provides implemented algorithms in Java, and TensorFlow, which is a “deep learning” framework for numerical computations using data flow graphs.

Performance. Finally, for one to be effective when dealing with such “big data” it is crucial to consider performance issues (i.e. memory consumption, processing time). Fortunately, there are appropriate frameworks, such as Hadoop that uses the MapReduce technique, and the Apache Lucene for search and indexing performance, as well as database/data warehouse systems, such as Cassandra and MongoDB.


[1] Diomidis Spinellis. Working with Unix Tools. IEEE Software, 22(6):9–11, November/December 2005. (doi:10.1109/MS.2005.170)

Leave a Reply

Your email address will not be published. Required fields are marked *