On Unpublished Software
I ran across this post at The Tree of Life entitled ‘Interesting new metagenomics paper w/ one big big big caveat – critical software not available”.
The long and short of it? Paper appears in Science, has fancy new methodology, lacks the software for someone else to use their methodology. Blog author understandably annoyed. But I have some sympathy with the authors of the paper itself, as much as I prefer the code for an analysis to be available for publication. My thoughts after the jump.
First, this post was born as a comment on the OP blog. For some reason, blogger hates my WordPress login.
Second, I understand the spirit of the post. In an ideal world, the software would be available. From a stable URL. For every single platform under the sun. Or maybe even a web interface, if the project was feeling particularly fancy. This is How It Should Be ™.
That being said, there are reasons that isn’t true. I’ve had one collaborator essentially decide not to give out code because further analysis – that would appear in future papers – was already baked in, and they weren’t going to go through their code line by line to make sure some poor grad student’s project didn’t get scooped by someone reading the code carefully, or that in removing that stuff, they didn’t manage to otherwise break the software.
But lets leave that aside for the moment, and say – as the original published paper did – that the purpose of all this is a new technique, with new software, that we’re hoping people use. There’s still reasons for the paper to come out and the software not be available yet – legitimate reasons. The development of software and its use in science, while very closely linked, are actually disjoint processes that need not progress at the same pace. Some issues that have happened to me:
Beautiful software, ready to go, sitting idle for months waiting for the right numbers to come in to make it usable.
Output from software that is interesting science on its own, but the software isn’t ready for primetime. Maybe it’s got a hideous command line interface with a dozen opaque arguments that appear in no logical order. Maybe the quick and dirty solution to something that produced interesting results for several datasets is inefficient enough that it needs more memory than something like it should. Maybe the documentation is a series of ad-hoc scribbles on a white board. Or maybe it works on your machine, works on a student’s machine, but utterly breaks the first time you try it on a colleagues. Or perhaps you simply want to make it better, and while the science is ready to go and won’t improve by festering for six months, in six months you could have a GUI. Or better performance. Or cross-platform software. Or all of the above.
I can understand the bloggers frustration. And papers that do this should absolutely both provide enough methods detail that you could write your own software if you had the inclination, and focus on those methods, not mysterious code you don’t have access to. But when I read a paper where there’s clearly software to be had, but it’s not available yet, my first thought is “What went wrong?”