When a scientific experiment achieves the expected result, researchers hurry up to draft a manuscript, submit it and cross their fingers for acceptance. When that paper gets accepted for publication – a happy camper! However, and not much later, the researchers discover that it was a fool’s paradise. Their work never gets cited by peers. Often times, simply because others cannot reproduce their scientific experiment, i.e. they cannot compare it to their own experiments. There are few reasons that block research reproducibility. In this post, I will preview some of them that frequently appear in the field of computational science.
Reproducibility in Computational Science
Computational science has led to exciting new developments. Largely available datasets allow researchers to do more and more computations and hence more experiments. All steps of the scientific process from data collection and processing to analysis, visualizations and conclusions depend even more on computations and algorithms.
The above figure is a typical scenario that happens in the daily life of researchers when trying to work on someone else’s code. Some of the technical challenges we face are:
1. Dependency Hell
Installing or building software necessary to run the source code assumes the ability to create the same computational environment of the original authors. I personally hate spending days installing dependencies for something that might or might not work at the end.
2. Imprecise Documentation
Holes in the documentation are also barriers for “newbies” – where newbies may be experts in a language other than the one involved in the research study.
3. Code Rot
Is the code robust to changes? Software packages receive frequent updates to fix a bug, add a feature or deprecate others. Having multiple version of the same code might not be the best idea for your audience.
Studies show that less than 50% of software could be successfully built or installed. Isn’t it the time for you to think how your new algorithm will survive in the next years?
There are two major approaches to solve the problem. The first one is using workflow software. These are well-funded and designed software tools that can graphically communicate multiple different tools. They handle issues like version dependencies. However, they suffer low general adoption rate. They are mostly used in the context they were developed for. The second approach is to use virtual machines. Virtual machines capture the whole computation environment from the operating system level to dependencies and top-most level of user interface code. Nonetheless, VMs are too much of a black box for researchers to use a published algorithm and easily modify it.
A more recent approach that emerged lately is to use software containers. Containers wrap the whole computation environment in a light-weight virtualization software. The most popular implementation now is Docker containers. The advantage of Docker containers is that they can be built and run on any operating system that has Docker engine installed. There is no need to carry any other dependencies as they will be already pre-included at the time the algorithm is wrapped. You can also share your computational algorithm container through Docker Hub – a repository freely available to ship containers to any environment.
Now, it’s your turn ..
At the Center for Quantitative Medicine, we have been concerned with enhancing the collaboration of our team. As they are mostly mathematicians and biomedical scientists, we have developed a Docker container template for them to easily publish their computational algorithms. It takes few minutes to let your algorithm immediately available for others to use. Jump to AlgoRun website to start using it. AlgoRun Docker container template requires no change in your source code. Let your research get cited by providing users a single-click web page to test it ..
How do you ensure your algorithm can be used by others when published? Share with us in the comments!