Topic modeling is an Information Retrieval (IR) technique that discovers representative topics from a collection of documents. Thus, we expect that logically related words will co-exist in the same document more frequently than words from different topics. For example, in a document about the space, it is more possibly to find words such as: planet, satellite, universe, galaxy, and asteroid. Whereas, in a document about the wildlife, it is more likely to find words such as: ecosystem, species, animal, and plant, landscape. But why text classification is so useful? In this blog post, we try to explain the importance of topic modeling and its use in software engineering.
With the advent of the “Big Data” and “Data Science” era, the organization of the data in categories and classes becomes the top requirement for the understanding and analysis of the data by scientists of several fields (e.g. bioengineering, social sciences, space sciences, software engineering, robotics, etc.). This occurs because scientists need to identify patterns in large datasets and make predictions based on the known patterns. For this reason, machine learning and natural language processing are now among the most famous methods for studying big data.
In particular, a well-known model that can be used in natural language processing is Latent Dirichlet Allocation (LDA), which is an example of a topic model. LDA can automatically discover topics that documents or parts of text contain (for more details about LDA, see .) We refer to LDA because there is a plethora of tools that have LDA implemented and one can use it automatically. For example, see: topicsModels, mallet, mahoot, lda-c, Stanford Topic Modeling Toolbox, etc.
Regarding software engineering, topic modeling can be applied in a variety studies. For instance, topic modeling can find patterns in social media (twitter, facebook, etc.) for the identification of trends in conversations, by categorizing posts in different topics. In addition, topic modeling can be used in security for the identification of malicious applications (i.e. applications that do different things from what their descriptions say.) Finally, topic modeling can be used in mining software repositories by analyzing elements that are similar to text documents, such as source code, bugs, issue reports, crash reports, mailing lists, and commits.
Topic modeling is a new trend that can offer fast and accurate results when we need to analyze large datasets. However, traditional ways of natural language processing, such as the use of regular expressions, are always powerful for the identification of simple patterns and for the cleaning of the data to be processed.
 David M. Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (April 2012), 77-84. DOI=10.1145/2133806.2133826 http://doi.acm.org/10.1145/2133806.2133826