sharing code and data
I recently discussed an aspect of open research that is of particular relevance to our lab, which is releasing the code and data associated with a project (slides). The general topics of open research and open science are already quite broad, and are evolving. Sharing everything, the end of publishing as we know it (as reason for not worrying about being scooped), alternative means of review, were all ‘showing up’ at the last SoNYC, which was supposed to be about whether scientists are antisocial.
But my focus for ETS was more applied and narrow. I wanted only to consider these issues: why should we share our code and data; why shouldn’t we; if we’re going to, what are some things we should think about; if we write a plan for sharing, what should it include; where do we share the stuff; and who else out there is thinking or discussing these things.
Before going to the first point, I want to say that I think sharing is the right thing to do. In saying that, I don’t mean there aren’t reasons not to do it, nor do I mean we are all in a position to share everything. I just mean that it seems right, and that we’re better off in doing it, that the field is better off with people doing it.
Reasons to share
Reasons to share include that generating verifiable knowledge is a central goal of science, something that is generally impossible based only on what’s presented at conferences and in papers. Sharing allows future generations to build on the work of previous generations. These are kind of the same point, and they are both part of a larger point that sharing allows others to replicate, validate and correct your results. It’s easy to imagine scenarios where you wouldn’t want that. You just published something, and someone comes along and shows how you messed up. It’s so bad that you need to retract something you just put out. But if you think about it, do you really think it’s the right move to try to protect yourself from your own errors? Is that a ‘strategy for the future’?
Some funding agencies require sharing (e.g., NSF, maybe NIH), as do some journals (e.g., Nature, PLoS). To the previous point, even if you had an interest in preventing others from discovering your mistakes, those who pay for it and who work towards communicating it probably don’t. One issue, though, is what the arsenal of rewards and punishments of these organizations is or whether they even have one. And what happens when they run into conflicts? What happens, for example, if your university doesn’t want you to share something, but your funding does, and of the two publications based on it, one journal requires it to happen one way, and the other another?
Sharing is also cool because it makes people more aware of your stuff. That seems like it’s especially true if you’re known for sharing. I don’t know if evidence that supports it, but it seems obvious that sharing increases the impact of your publications. It might also lead to an ever-growing set of microcitations, which couldn’t hurt.
There are also many other more selfish reasons, such as that it preserves your stuff for your own future use. You’ll be able to identify, retrieve and understand the data long after you’ve lost familiarity with it. As you prepare to share, your habits will likely facilitate you being able to go back to a particular point in the history of the project. You might do so with a particular figure. You’ll see what you ran, what parameters you used, what version of the code you used, what data were there, whose data it was, whether you’d whitened it or not, whether you’d discussed it with your lab prior to that point, etc.
Sharing might also be a great way for your students to learn about these things. Misha made a point here that it could be bad for your students to have code and data from previous projects. He was essentially saying that it’s part of their learning process to get to the point of being able to replicate things themselves. I see the point, but just don’t agree. I’m not convinced anyway that it’s even possible for students, or anyone else, to replicate things based only on what’s out there. I also think it allows them to be better, by seeing what you did and being able to quickly build on it. Many of us have been handed other people’s code. It’s often quite confusing or doesn’t do exactly what we want, but it’s enough to get us going. That’s made easier by sharing and it would be possible for many people, even people we don’t know or who aren’t willing to ask us via email.
Reasons not to share
You no longer have the code or the data or both. One reason for that might be that the code evolved to solve other problems. We all agree that evolving code is a good thing. It’s connected to the unit testing stuff that Rich discussed. But not being able to go back is a bad thing and there are tools that can help you to solve this. These include version control software (e.g., Subversion, Mercurial and Git).
Another thing is that cleaning stuff up is a lot of work and might be only achievable by living in a more rigid world, such as one that includes lab-level standardization. True. But again, being motivated to keep cleaner code behooves you, and while it might seem slower in the near term, could actually save you time in the long run. You can build on clean code. Other people can help you with clean code, because they can read it. You can adopt clean code more easily. Having said that, even if you have zero time for cleaning what you have, sharing anything is probably quite helpful in and of itself. It would allow others to check the values of everything, including various things you forgot to mention in a paper or couldn’t mention on a poster.
A related point is that your stuff may only run on commercial, proprietary or copy-righted software that cannot be distributed or that takes special hardware to run. True again. Here again, anything is better than nothing. You can release it as text and someone can inspect and rewrite. And, regardless, this seems unlikely to apply in most cases. The people who would be interested to run your MATLAB code probably have access to MATLAB themselves.
How to do it [first paragraph is from Yale Law School Roundtable on Data and Code Sharing]
When you publish work, provide links to the source code. Assign a unique ID to each version of released code, and update the ID whenever the code or data change. A version-control system can be used in conjunction with a unique identifier (e.g., Universal Numerical Fingerprint) for data. Include a statement describing the computing environment software version used in the publication, with stable links to the accompanying code and data. Use open licensing for code to facilitate reuse. Use an open access contract for published papers and make preprints available on a site such as arXiv.org, PubMed Central, or Harvard’s Dataverse Network. Publish data and code in nonproprietary formats wherever reasonably concordant with established practices, opting for formats that you believe will be readable well into the future.
Besides the above stuff, the code/data sharing plan that you should write should include answers to these questions: what code and data will be shared; who will have access to it (it should be as broad as possible); where will you put it (it should go to places dedicated to hosting it); when will you share it (it should be shared as soon as possible and for as long as possible); how will people locate and access it.
You should put it on an institutional or university web page. You should put it in an openly accessible third-party archived website designed for this stuff. I only know of some of these places and am still discovering them. For code stuff, I’ve found GitHub (supposedly the one preferred by cool people); SourceForge (like others in this set, more for software, but could work); and Bitbucket. For data stuff, Dryad (seems pretty good, although not much there yet). Some other stuff that I’m not sure about or which may not be relevant: the NeuroCommons Project (Science Commons); Linked Open Data; OAIster; DSpace or Harvard-MIT Data Center (HMDC) (but NYU doesn’t have an equivalent); CODATA (but not clear where to go with stuff); and Google Code.
A survey of willingness to share original data [from Savage and Vickers 2009]

People who are into this stuff
Randall LeVeque (U of Washington)
Roger Peng and Sandrah Eckel (Johns Hopkins+USC)
Sergey Fomel and Jon Claerbout (U of Texas+Stanford)
Kaitlin Thaney (Science Commons)
Michael Friedlander (Virginia Tech)
Ian Mitchell (U of British Columbia)
Lisa Larrimore Ouellette (Yale Law School)
Jelena Kovačević and Martin Vetterli (Berkeley)
Nikolaus Kriegeskorte (Cambridge)