Many years ago (I will not reveal my age), I began working on my PhD thesis concerning the area of Domain-Specific Languages (DSLs). Research was booming at the time and many research articles stated in their introduction that DSLs are very useful and increase productivity, by reducing lines of code etc. All these claims seemed logical to me, but I always considered them something like urban legends. We all know that they are correct, but cannot easily prove it. Keeping that on the back of my mind, I searched for a way to bring the “legend” down to measurable facts that will provide solid motivation for the importance DSLs in every day programming. I decided to do a simple experiment that measures DSL usage in open source programs. My attack vector was simple; I narrowed down my data set to Java programs and measured package usage of application libraries that implemented DSLs. I selected Java for two reasons; first, it was easier to find and mine open source projects semi-automatically with a little scripting (python) by using the maven repository and second, the Java Standard Development Kit includes many application libraries that enable DSL support. At first, a snapshot (January 2012) of the Maven repository was downloaded locally. The repository contained various projects and their versions, which usually indicate a major release. All versions were filtered out and only the most recent was kept. The final project count included 12,959 projects, and more than 110 million lines of code. The table above shows various size metrics for the selected corpus. Note that Lines of Code (LoC) metrics include blanks. Source Lines of Code (SLoC) and Comment Lines of Code (CLoC) do not include blank lines. This is the reason that LoC != SLoC + CLoC. A set of standard DSL application libraries were identified and the source was scanned for specific import statements e.g. java.util.regex, which indicated that the standard package that implements regular expressions was used, thus regular expressions were used. If a package is detected in the source code, then the project will be tagged as using one DSL. So, if a project has as DSL count four, then it means that four different application libraries were detected during the source code scan.
The initial goal of this experiment was simply to provide quantifiable results that are indicative regarding the usage of each DSL in Java; thus only files containing Java code were scanned and accounted. Build files or other resources that contain other DSLs e.g. Apache ant, were not included. One final assumption was also made; if XML Path (XPATH) or Extensible Stylesheet Language and Transformations (XSLT) were found in the source code, then the project would be marked as using also XML. This is logical, since those two languages are used for query and transformations on XML Document Object Model (DOM) trees. The results were not surprising and the legends seem to be true. More than one-third of the projects (4,655) are using one or more DSLs. The most popular DSL is XML, with a percentage of 75% usage in Java projects. Regular expressions follow with a 25%. Another interesting aspect of the results, refers to the most popular DSL usage combinations. As expected, XML rule in the DSL popularity contest, followed by regular expressions and SQL. The results of this experiment show that DSLs are commonly used and that they are very popular in the Java software ecosystem. Note that this experiment was not exhaustive. If more than 20 application libraries were detected and peripheral files were taken into account, the DSL usage would have been higher. Now I have my motivation, DSLs are really popular and with modern approaches, like Scala, SugarJ etc, it seems they will be more popular in the future.