Last week, PLoS published an updated data policy in which they are requiring that the data of each paper published must be available publicly. The specific wording was:
authors must make all data publicly available, without restriction, immediately upon publication of the article
Apparently the only change is that now it is required that the publication states where the data is available from, while before it was suggested. The post got a strong response and they have updated it to touch on some of the questions received. However, this is a great opportunity to ask ourselves why is there a strong resistance to share data.
Via Twitter, I stumbled upon this post in Neuropolarbear’s blog that listed several objections to the new policy. It is a good starting point for the discussion on the objections to data sharing that researchers usually have:
1. The policy implies major benefit of data sharing is new discoveries. Authorship on articles resulting is a frequently debated topic. Does PLoS have a policy on whether scoopers need to at least offer middle-position authorship to the people who collected the data?
I guess this will be a very debated issue, but in my opinion there should be no expectation of authorship. You are not a co-author if someone uses your paper in a review, why should you have this privilege with a bare dataset? Being an author entails both ownership of ideas and responsibility for what is published. I would not want either for someone else’s work, unless they want my opinion or expertise. People can already use your data in meta-analyses. Using the raw data is not different, just makes it easier to build a more robust meta-analysis.
2. Does PLoS propose any protections for authors who are worried someone will scoop them on reanalysis of their own data? How about a special vault where the data is posted publicly in one year?
Why would you re-analyze the data? PLoS is only requiring the data that relates to the conclusions presented in the paper, not the whole dataset collected as part of a project. Therefore, you have said most of what you wanted to say in the paper regarding that data, what reason there would be to re-analyze it? Furthermore, what is the chance that someone else will have the same ideas, hypotheses, analyses and conclusions you will from that dataset? If it is an obvious idea, then you should publish it in the same paper or simultaneously in another paper.
The problem of publishing it one year later is that it defeats the purpose. The people with most interest in the data will read the paper as soon as it is published but will have to wait for a year for unclear reasons.
3. PLoS argues that data sharing makes life easier for authors. I (along with Drugmonkey) think this is wrong; if that proves to be the case, and it becomes clear curation is a large burden, will PLoS rethink their policy?
Why would curation of your dataset be a burden? The dataset had to be organized and stored in some way for the authors to be able to analyze it, therefore most of the work has already been done. I work with terabytes of sound recordings and databases with millions of rows and the most burdensome task I have found is to write a metadata file, which is just good practice anyways. The dataset can be messy and noisy, everyone understands that this can happen and it is part of the process.
5. PLoS’ response to researchers worried about being scooped on follow-ups offers no succor. Does PLoS recommend that researchers who are nonetheless still concerned simply submit to another journal?
If all someone else needs to “scoop” your idea is some data, you will get scooped sooner or later. Publish it to claim your idea.
9. Should I recuse myself from reviewing a paper in which I cannot evaluate the raw data? I currently review lots of MRI papers but raw MRI data may as well be ancient Etruscan as far as I am concerned. From this policy, it would be scientific malpractice for me to even pretend to review an MRI paper under these guidelines.
I found this statement puzzling. This tweet also seems to touch on this idea:
Any field in which you are an expert, you should be able to manage and analyze the data generated by other researchers. Maybe the paper uses a new method or type of data, but then you can use the paper to verify if the methods are clear enough for a competent researcher to be able to carry out the same analysis. The data should be stored in standard or widely-used formats unless there is a good reason to use some exotic format. If not, then you are basing your paper in unsubstantiated analyses or data and it should be suspect.
Erin C. McKiernan posted some ideas that this could reduce diversity in a journal like PLoS because small labs and in countries with little research funding might avoid them. While there can be a worry about being scooped, and McKiernan makes it clear there appears to be no data on this problem, we must consider something else. The internet has allowed small companies and ideas to explode by providing a more level playing field for small and huge companies. Large labs and senior researchers are already established, it is harder for the more junior researchers to get noticed. We use the web to try to promote ourselves. Why not use the data as another way to attract attention to the work we do? In my experience, the approaches for new interactions and discovery of data and papers has been numerous. I don’t believe scoops would outnumber the positive impacts of sharing the data.
Another blogger that has posted objections is Orac:
One issue that was brought up that probably isn’t a huge consideration is that some datasets are too large to share easily. […] other types of data lack such public databases.
I found this interesting because it points to a problem that should not be solved by PLoS, but by scientific societies. If the data you use requires a particular infrastructure, then it is the responsibility of the researchers in the field to build it. I am facing a similar problem with gigabytes and terabytes of audio data. So far, I’ve been able to use FigShare and DataDryad, but for the bulk of my dissertation data I will probably have to host it on my own server. But this illustrates that it is time for researchers in the field to admit there is a problem and find ways to solve it. “Because it is difficult” can not be an excuse not to share data.
Several people were linking to Drugmonkey’s arguments, it is unfortunate that they are mixed with a hatred against humanities and people that want to use other people’s data. Tweets have gone even more to the hateful side by calling researchers that want to analyze someone else’s data “leeches.” Drugmoney posted several objections to the new policy:
The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few.
While fraud-prevention is a good reason to force publishing raw data, this is not the only reason. I’ve read plenty of papers that would have been a lot more useful if I had had the chance to repeat the analysis and learn from it. In addition, young researchers and students usually have no funds to collect lots of data and can put their ideas out there for the benefit of everyone if they can use available data.
This Data Access policy requires much additional data curation which will take time. We all handle data in the way that has proved most effective for us in our operations.
As with the post by Neuropolarbear, this is puzzling. We all understand the problems of collecting data and that it might be in a messy format. A good metadata file will take care of all these problems. Either we use standard formats for the field (wav for bioacoustics; vector and raster formats for landscape ecology; csv files can store database tables; etc.), or we had to create a particular format. Either way, documenting the way the data was collected and organized will help the authors in the future.
Maybe the proprietary software we use differs and the smoothest way to manipulate data is different. We use different statistical and graphing programs. Software versions change.
A metadata file will take care of most of this problems. Anyways, these are objections to using someone else’s data, not about sharing it.
Some people’s datasets are so large as to challenge the capability of regular-old, desktop computer and storage hardware.
This is equally puzzling. If I want to re-analyze your data or use it in some other way, I should be competent enough to understand the system requirements for it. This is, again, an argument about the use of the data, not sharing, and it presumes that other researchers will not be able to figure out this kind of issues on their own.
This diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that.
The paper must have used a standard method to analyze the data. Some datasets will require more work than others. There is no requirement for a specific way to format the data, just that is is made available. The only exception would be particular fields that have agreed on particular formats or ways to store or collect data (e.g. GenBank).
Drugmonkey’s post suffers from an additional problem: anti-humanist arguments that have nothing to do with data sharing. In particular:
The second incident has to do with accusations of self-plagiarism based on the sorts of default Methods statements or Introduction and/or Discussion points that get repeated. Look there are only so many ways to say “and thus we prove a new facet of how the PhysioWhimple nucleus controls Bunny Hopping”. […] This is why concepts of what is “plagiarism” in science cannot be aligned with concepts of plagiarism in a bit of humanities text.
This has nothing to do with science or humanities, but with the law. A published work is owned by someone and can not be used without attribution. It is a pain to have to find a new way to explain the same methods, but this can be fixed by proposing stardard methods for the field. No one explains what t-tests, anovas, or multiple regressions do. If a method is used so much, it should have a name and a standard reference that can be cited for details.
Are the standards for dropping subjects the same in every possible experiments. (answer: no) Who annotates the files so that any idiot humanities-major on the editorial staff of PLoS can understand that it is complete?
No editorial staff has ever asked me to explain a figure, table, or appendix. That is the work of the experts in the field: the editors and the reviewers. Why would this be different for data?
I have worked with researchers that do not want to share their data because, when they have done it in the past, it was “misused.” There is no reason why this should be a reason either. Your paper might be misunderstood and therefore cited in the wrong context or to substantiate an idea it can not. Why should data be any different? The authors are responsible for using a correct dataset, not you. If someone publishes a paper using my data to say something it can not, I can (and have done) contact the editor to request space to publish a response or request the paper to be retracted. Science must be open and not subjected to the same culture of industry, where money is the only thing that matters.
Another perspective we should consider is that publishing data will help train future scientists. We have all faced many problems when embarking in a new area just because we could not analyze a dataset and were unaware of the limitations or possibilities. If we share more data, students and researchers dipping their toes in a new area will have a better chance at success since they don’t have to wait months or years to collect data to see if their ideas pan out.
Perhaps this jealousy over data is not about data, but a larger problem: how research is rewarded. It is our responsibility to make sure the work of a researcher is not limited to an index, but we all have to be in the same boat for this to work. Research can not be quantified the same way as a factory worker because the nature of the work is completely different. Lets avoid treating research as a means to increase your citation index.
In closing, a new perspective paper in PLoS Biology discusses some of these issues and will be helpful in the further discussions about these problems:
Troubleshooting Public Data Archiving: Suggestions to Increase Participation (DOI: 10.1371/journal.pbio.1001779)
In particular, they say:
In our experience, however, individuals are most concerned about the loss of priority access following PDA, which could generate competition with others when conducting subsequent analyses.
Why is everyone so afraid of being scooped? Yes, it might happen, but I think the fear is exaggerated. We should work on ways to promote and provide incentives to data sharing to show that the benefits are way larger than any possible, imagined or real, damage from scooping may have.
Probably the biggest hurdle in data sharing is culture and lack of incentive. Young researchers can try to convince their senior co-authors of the benefits of sharing data by using someone else’s data to strengthen a point in the paper. Storing the data in public archives that provide DOI numbers allows it to be cited, which provides an incentive if these can be tracked as easily as paper citations.