Some comments on researchers that do not want to share data

Last week, PLoS published an updated data policy in which they are requiring that the data of each paper published must be available publicly. The specific wording was:

authors must make all data publicly available, without restriction, immediately upon publication of the article

Apparently the only change is that now it is required that the publication states where the data is available from, while before it was suggested. The post got a strong response and they have updated it to touch on some of the questions received. However, this is a great opportunity to ask ourselves why is there a strong resistance to share data.

Via Twitter, I stumbled upon this post in Neuropolarbear’s blog that listed several objections to the new policy. It is a good starting point for the discussion on the objections to data sharing that researchers usually have:

1. The policy implies major benefit of data sharing is new discoveries. Authorship on articles resulting is a frequently debated topic. Does PLoS have a policy on whether scoopers need to at least offer middle-position authorship to the people who collected the data?

I guess this will be a very debated issue, but in my opinion there should be no expectation of authorship. You are not a co-author if someone uses your paper in a review, why should you have this privilege with a bare dataset? Being an author entails both ownership of ideas and responsibility for what is published. I would not want either for someone else’s work, unless they want my opinion or expertise. People can already use your data in meta-analyses. Using the raw data is not different, just makes it easier to build a more robust meta-analysis.

2. Does PLoS propose any protections for authors who are worried someone will scoop them on reanalysis of their own data? How about a special vault where the data is posted publicly in one year?

Why would you re-analyze the data? PLoS is only requiring the data that relates to the conclusions presented in the paper, not the whole dataset collected as part of a project. Therefore, you have said most of what you wanted to say in the paper regarding that data, what reason there would be to re-analyze it? Furthermore, what is the chance that someone else will have the same ideas, hypotheses, analyses and conclusions you will from that dataset? If it is an obvious idea, then you should publish it in the same paper or simultaneously in another paper.

The problem of publishing it one year later is that it defeats the purpose. The people with most interest in the data will read the paper as soon as it is published but will have to wait for a year for unclear reasons.

3. PLoS argues that data sharing makes life easier for authors. I (along with Drugmonkey) think this is wrong; if that proves to be the case, and it becomes clear curation is a large burden, will PLoS rethink their policy?

Why would curation of your dataset be a burden? The dataset had to be organized and stored in some way for the authors to be able to analyze it, therefore most of the work has already been done. I work with terabytes of sound recordings and databases with millions of rows and the most burdensome task I have found is to write a metadata file, which is just good practice anyways. The dataset can be messy and noisy, everyone understands that this can happen and it is part of the process.

5. PLoS’ response to researchers worried about being scooped on follow-ups offers no succor. Does PLoS recommend that researchers who are nonetheless still concerned simply submit to another journal?

If all someone else needs to “scoop” your idea is some data, you will get scooped sooner or later. Publish it to claim your idea.

9. Should I recuse myself from reviewing a paper in which I cannot evaluate the raw data? I currently review lots of MRI papers but raw MRI data may as well be ancient Etruscan as far as I am concerned. From this policy, it would be scientific malpractice for me to even pretend to review an MRI paper under these guidelines.

I found this statement puzzling. This tweet also seems to touch on this idea:

Michael Waskom (‏@michaelwaskom): The most literal reading of the @PLOS guidelines means I'll be sharing k-space data in custom-format spiral .pfiles, so have fun with that

Michael Waskom (‏@michaelwaskom): The most literal reading of the @PLOS guidelines means I’ll be sharing k-space data in custom-format spiral .pfiles, so have fun with that

Any field in which you are an expert, you should be able to manage and analyze the data generated by other researchers. Maybe the paper uses a new method or type of data, but then you can use the paper to verify if the methods are clear enough for a competent researcher to be able to carry out the same analysis. The data should be stored in standard or widely-used formats unless there is a good reason to use some exotic format. If not, then you are basing your paper in unsubstantiated analyses or data and it should be suspect.

 

Erin C. McKiernan posted some ideas that this could reduce diversity in a journal like PLoS because small labs and in countries with little research funding might avoid them. While there can be a worry about being scooped, and McKiernan makes it clear there appears to be no data on this problem, we must consider something else. The internet has allowed small companies and ideas to explode by providing a more level playing field for small and huge companies. Large labs and senior researchers are already established, it is harder for the more junior researchers to get noticed. We use the web to try to promote ourselves. Why not use the data as another way to attract attention to the work we do? In my experience, the approaches for new interactions and discovery of data and papers has been numerous. I don’t believe scoops would outnumber the positive impacts of sharing the data.

 

Another blogger that has posted objections is Orac:

One issue that was brought up that probably isn’t a huge consideration is that some datasets are too large to share easily. […] other types of data lack such public databases.

I found this interesting because it points to a problem that should not be solved by PLoS, but by scientific societies. If the data you use requires a particular infrastructure, then it is the responsibility of the researchers in the field to build it. I am facing a similar problem with gigabytes and terabytes of audio data. So far, I’ve been able to use FigShare and DataDryad, but for the bulk of my dissertation data I will probably have to host it on my own server. But this illustrates that it is time for researchers in the field to admit there is a problem and find ways to solve it. “Because it is difficult” can not be an excuse not to share data.

 

Several people were linking to Drugmonkey’s arguments, it is unfortunate that they are mixed with a hatred against humanities and people that want to use other people’s data. Tweets have gone even more to the hateful side by calling researchers that want to analyze someone else’s data “leeches.” Drugmoney posted several objections to the new policy:

The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time, to address the legitimate sins of the few.

While fraud-prevention is a good reason to force publishing raw data, this is not the only reason. I’ve read plenty of papers that would have been a lot more useful if I had had the chance to repeat the analysis and learn from it. In addition, young researchers and students usually have no funds to collect lots of data and can put their ideas out there for the benefit of everyone if they can use available data.

This Data Access policy requires much additional data curation which will take time. We all handle data in the way that has proved most effective for us in our operations.

As with the post by Neuropolarbear, this is puzzling. We all understand the problems of collecting data and that it might be in a messy format. A good metadata file will take care of all these problems. Either we use standard formats for the field (wav for bioacoustics; vector and raster formats for landscape ecology; csv files can store database tables; etc.), or we had to create a particular format. Either way, documenting the way the data was collected and organized will help the authors in the future.

Maybe the proprietary software we use differs and the smoothest way to manipulate data is different. We use different statistical and graphing programs. Software versions change.

A metadata file will take care of most of this problems. Anyways, these are objections to using someone else’s data, not about sharing it.

Some people’s datasets are so large as to challenge the capability of regular-old, desktop computer and storage hardware.

This is equally puzzling. If I want to re-analyze your data or use it in some other way, I should be competent enough to understand the system requirements for it. This is, again, an argument about the use of the data, not sharing, and it presumes that other researchers will not be able to figure out this kind of issues on their own.

This diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that.

The paper must have used a standard method to analyze the data. Some datasets will require more work than others. There is no requirement for a specific way to format the data, just that is is made available. The only exception would be particular fields that have agreed on particular formats or ways to store or collect data (e.g. GenBank).

Drugmonkey’s post suffers from an additional problem: anti-humanist arguments that have nothing to do with data sharing. In particular:

The second incident has to do with accusations of self-plagiarism based on the sorts of default Methods statements or Introduction and/or Discussion points that get repeated. Look there are only so many ways to say “and thus we prove a new facet of how the PhysioWhimple nucleus controls Bunny Hopping”. […] This is why concepts of what is “plagiarism” in science cannot be aligned with concepts of plagiarism in a bit of humanities text.

This has nothing to do with science or humanities, but with the law. A published work is owned by someone and can not be used without attribution. It is a pain to have to find a new way to explain the same methods, but this can be fixed by proposing stardard methods for the field. No one explains what t-tests, anovas, or multiple regressions do. If a method is used so much, it should have a name and a standard reference that can be cited for details.

Are the standards for dropping subjects the same in every possible experiments. (answer: no) Who annotates the files so that any idiot humanities-major on the editorial staff of PLoS can understand that it is complete?

No editorial staff has ever asked me to explain a figure, table, or appendix. That is the work of the experts in the field: the editors and the reviewers. Why would this be different for data?

 

I have worked with researchers that do not want to share their data because, when they have done it in the past, it was “misused.” There is no reason why this should be a reason either. Your paper might be misunderstood and therefore cited in the wrong context or to substantiate an idea it can not. Why should data be any different? The authors are responsible for using a correct dataset, not you. If someone publishes a paper using my data to say something it can not, I can (and have done) contact the editor to request space to publish a response or request the paper to be retracted. Science must be open and not subjected to the same culture of industry, where money is the only thing that matters.

Another perspective we should consider is that publishing data will help train future scientists. We have all faced many problems when embarking in a new area just because we could not analyze a dataset and were unaware of the limitations or possibilities. If we share more data, students and researchers dipping their toes in a new area will have a better chance at success since they don’t have to wait months or years to collect data to see if their ideas pan out.

Perhaps this jealousy over data is not about data, but a larger problem: how research is rewarded. It is our responsibility to make sure the work of a researcher is not limited to an index, but we all have to be in the same boat for this to work. Research can not be quantified the same way as a factory worker because the nature of the work is completely different. Lets avoid treating research as a means to increase your citation index.

 

In closing, a new perspective paper in PLoS Biology discusses some of these issues and will be helpful in the further discussions about these problems:

Troubleshooting Public Data Archiving: Suggestions to Increase Participation (DOI: 10.1371/journal.pbio.1001779)

In particular, they say:

In our experience, however, individuals are most concerned about the loss of priority access following PDA, which could generate competition with others when conducting subsequent analyses.

Why is everyone so afraid of being scooped? Yes, it might happen, but I think the fear is exaggerated. We should work on ways to promote and provide incentives to data sharing to show that the benefits are way larger than any possible, imagined or real, damage from scooping may have.

Probably the biggest hurdle in data sharing is culture and lack of incentive. Young researchers can try to convince their senior co-authors of the benefits of sharing data by using someone else’s data to strengthen a point in the paper. Storing the data in public archives that provide DOI numbers allows it to be cited, which provides an incentive if these can be tracked as easily as paper citations.

 

This entry was posted in Data, Open Science, Science. Bookmark the permalink.
  • http://smallpondscience.com/ Terry McGlynn

    Your responses to other people’s concerns do not really satisfy their concerns. You merely describe your own practices.

    Worried about being scooped? Just publish faster.

    Worried about the extra work it takes to curate your data? Suck it up, I do it.

    Think it’s a problem that someone else gains from your hard work without adequate credit? Too bad, that’s the way things should be.

    If you look at the concerns of others from their perspectives, and not yours, it might make more sense to you.

  • http://blog.coquipr.com Luis J. Villanueva

    Since they don’t post specific examples, it is possible I can’t think of what they are talking about. However, this is also based on following the discussion of the issue in journals, online, and in person. The people that argue against publishing data are exaggerating the possible problems.

    Part of the problem is a belief of jealous ownership over the data that is not given to the manuscript. We all benefit from others’ hard work by basing our ideas and research on someone else’s papers.

  • http://smallpondscience.com/ Terry McGlynn

    How can you just say that they’re exaggerating the possible problems? That’s not rational or fair argumentation. It’s just an unsupported claim. You aren’t weighing their experiences or fields against yours.

    I humbly suggest avoiding value judgments. Ownership over data doesn’t have to be “jealous.” If you think owning things is jealous, then please send me a large check. It’s not that WE all benefit from others’ hard work. Maybe you do. But I don’t. Please, stop and look at it from the perspective of others.

  • http://blog.coquipr.com Luis J. Villanueva

    Maybe you are misunderstanding me. Yes, there is a possibility of these problems. My argument is that the benefits outweigh them. Yes, this is based on my opinion which is based on the arguments and evidence I’ve seen. Researchers that have shared data for years should be able to say they have had these problems, but they are not. Where is the evidence that data curation is a hassle? Where is the evidence of “scooping”? If someone provides some evidence of this, then I can weigh it against what I’ve seen.

    I use maps, remote sensing data, biodiversity data, species descriptions, etc, from others to do my work. I would have a better time if more people shared their data, so I share mine and try to promote others to do the same. We all benefit from others’ work, that is what citations in a proposal or a paper are.

  • http://smallpondscience.com/ Terry McGlynn

    Actually, I think I understand you, but I could be wrong. You are arguing the benefits outweigh the problems. For *you.*

    This isn’t true for myself, and it isn’t true for a lot of others.

    You can’t generalize from your specific situation to the general. You need to look at many other scientists, and see how their work would benefit from open data sharing. Take a look at my lab website, check out my CV and publications, and build an argument *for me* that the science that I am currently doing would benefit more from other’s people’s data than I would lose from sharing my own. I don’t think you could make that argument.

    Science might be better off as a whole if all data are free, but that won’t emerge unless individual benefits are accrued.

  • EcoGrad

    “In addition, young researchers and students usually have no funds to
    collect lots of data and can put their ideas out there for the benefit
    of everyone if they can use available data.”

    Many young researchers find those funds and sacrifice their time for free, use undergrads and volunteers and go years without publishing to get those data. Why is it so hard to understand this?

    “Perhaps this jealousy over data is not about data, but a larger problem:
    how research is rewarded. It is our responsibility to make sure the
    work of a researcher is not limited to an index, but we all have to be
    in the same boat for this to work”

    But we are not all in the same boat. Some researchers are not producing any data, but they are eager to use the fruits of others’ labor, and they are very impatient to get their hands on it.

  • http://blog.coquipr.com Luis J. Villanueva

    It is not hard to understand, I just can’t agree that that is a good argument. And yes, parasites will exist in every system, but are they a minor nuisance or a major problem?

  • EcoGrad

    From whose perspective? Yeah I know, I’ll all about furthering science! The data collectors are the selfish ones. These petty stamp collectors need to release their data so you can ask “big” questions right? btw you say you release your data, what form is it?

  • http://blog.coquipr.com Luis J. Villanueva

    You are arguing against something I never said and presume things about me without reason. We can agree to disagree unless there is something else to test our opinions against.

    I have been collecting data for more than a decade and have used others’ data too. Also, I recognize that I am not the only one that can get some meaning from it, others will be able to do things I would have never thought about. We should be humble enough to realize this.

    Science is advanced my many, including those that have never worked in the field. Since we are not able to do it all, no matter how long we live, should we be a hindrance to those that want to do something else?

    And yes, I have released tons of data and I’m working on releasing even more in the next few months:

    At FigShare:
    http://figshare.com/authors/Luis_J_Villanueva_Rivera/101547

    At DataDryad:
    http://dx.doi.org/10.5061/dryad.c0g2t
    http://dx.doi.org/10.5061/dryad.g4n13

    I also write software that is available from GitHub:
    https://github.com/ljvillanueva

    Lets talk about evidence and data, not about making this personal.

  • http://blog.coquipr.com Luis J. Villanueva

    Yes, for my experience, from what I have seen in the many labs I have worked at and from my colleagues experience.

    Please tell me how this isn’t true for you. I can’t make an argument for your specific case because I don’t know you or your work. Also, you are one of many. All I am asking is some concrete evidence or data about the risks because no one that has taken your position has given any. All I see is hypothetical scenarios and FUD, not evidence.

    And yes, individual benefits need to happen, but no system is perfect. This should not be a reason not to try.

  • http://smallpondscience.com/ Terry McGlynn

    You must be joking. I wrote a >4000 word explanation.

  • http://blog.coquipr.com Luis J. Villanueva

    I hadn’t connected your comments here with your blog in my mind, don’t assume everyone does. For others reading this: http://smallpondscience.com/2014/03/03/i-own-my-data-until-i-dont/

    I’ll only comment on the specifics, the rest seem to be possible consequences or fears you have:

    About the hassle on organizing data, you accept that you could have done a better job from the beginning. I understand this perfectly because it has happened to me too. However, it is a good argument only for old data. Since the policy is now in place, we can prepare to avoid this from the beginning. Plus, it is just good practice to organize the data and the metadata at the moment since we will forget many details after a few months. If we want to use that data again, it will take some time to get back to all the details of it. Furthermore, I would guess most details would have appeared in the methods section of the paper or would be standard for the field.

    Comparing scientific data with a pharmaceutical company is wrong. Most of our research is funded by the government, NGOs, and others, not by ourselves. Companies invest their own money and the protection they are given is to promote this investment, at least in paper.

  • EcoGrad

    Are these data from your own experiments, from organisms you grew or collected over years, in all weather, etc etc – the couple examples I looked at were not, but I don’t have time at the moment to evaluate all those links. I get the sense here that those who have “never worked in the field” tend to devalue that often frustrating or fruitless work. I am not against sharing, just want some priority and ability to choose collaborators after years of (privately funded) work carried out by dozens of undergrads and volunteers on several continents. What kinds of data are you collecting? Our lab already shares gene sequences and accession numbers, has done for many years-that is no problem.

  • http://blog.coquipr.com Luis J. Villanueva

    I’m not going to get into a “my data vs your data” discussion. The issue is larger than that.

  • Pingback: Notes from the field – succeessional dynamics of a tropical forest in Khao Yai National Park, Thailand | theoretical ecology()