Data Vendetta
03.01.2009 // Bioinformatics // abhishektiwari
Recently Allyson Lister wrote an interesting post “
Do you really mean what you think you mean?” explaining how a single word or sentence means different things to different people
. I will give a real time example about this, there is an ongoing discussion in friendfeed about “
Open data is more important than open source“. I am really confused, does it really mean what it was intended to convey? If I am not wrong then we are talking about “
Open Data Standards are more important than Open Source“, or we are just babbling about making proprietary data available to public. Both are different issues and one need a careful distinction between them. If you are demanding that we should bring all proprietary data in public, because then only your open source software will work then you are probably wrong. You may argue that without data there is nothing for softwares to discover, but then did you make most of available data. The answer is “NO”, we never used what we have and we are demanding more, why? How much data you need to make a discovery. When Watson and Crick discovered the DNA structure how much data they have, and with all data why Rosalind Franklin was never convinced before that DNA is double helix? If you think you have all data and then only you can discover something new then probably you are wrong side, discovery is not in data but it is in our mind. Imagine when there was no bioinformatics, there was no chemoinformatics how drugs have been discovered, we need scientific wisdom to discover the things and not massive amount of open data. Organizations are spending lot of money do generate proprietary data to safeguard its competitive edge, why you are convinced that they need to disclose that, no one is here for charity. Most the companies have their proprietary data policies, and they release the data in public only when there is sufficient overlap from publicly available databases. For example, company xyz have understanding that they will release their protein sequence data in public if there is 30% or higher sequence similarly to data available in NCBI or SwissProt. So I am not positive that companies are making data available way bigger than they did in past.
Of course if the data was generated as part of public funded research then there should be strict guidelines for data access and re-use to everyone without restrictions.
Data provenance is more important than open dataIn life sciences majority of the data is contextual, but because we are not aware about the background information related to data so it gives us freedom to use it anywhere. Take a simple example, some one is creating mathematical model for human heart which require some parameters. He or she collected these values from open databases or freely available literature, which in turn was collected from somewhere else. On further research it was revealed that the human heart model is using the parameters derived from different organs of variety of species, one parameter is coming from mouse liver other is from monkey brain. Data without contextual information (we better call it as data provenance) is major cause of high probability of false discovery, which is nothing but total disaster.
Data standards are more important than open data
Let’s get back where we started, open data is more important than open source, well I will go one step ahead and say that data standards are ‘more important’ than open data. Big question is not whether data is open or not, but if data is well structured and annotated so that one can reproduce the results and re-use it without going out of context. Even when data is freely available, it is unknown whether the results are reproducible by independent scientists which is mostly due to insufficient data annotation or unavailability of specifications for data processing and analysis. Lack of contextual data comparability along with standard and correct data formats restricts the possible true discoveries.
Stress on quality rather than quantity
It is quality and not quantity of open data that matters, good quality open data with excellent annotation and provenance is requirement of the time. Lower the quality higher the chance that you will lost in data jungle.
I may be little against the flow on this topic, but unfortunately I am rarely convinced with this hypocrisy about open data. Making sensational claims about open data is not going to solve any problem, before we start advocacy to have open data we need to think what way it is going to be used, we better be prepared for the debacles those are going to arise from open data.
Abhishek, I beg to differ. Data standards are important, but they are meaningless without access to the data, especially in science. We are talking data access here, not data transport (for which standards are critical).
Incidentally, the statement which prompted a conversation came from a presentation on the semantic web, which can’t exist without open data really
Data Vendetta: Recently Allyson Lister wrote an interesting post “Do you really mean what you think you mean?” e.. http://tinyurl.com/cmxpbs
I guess you are correct, it just like Which came first, the chicken or the egg. Currently we have all kind of data in public domain, I can not see a single data type which is not available in public domain including clinical data. What we really need is to concentrate how this data can be really useful, rather than crying for more data. Pharmaceutical companies are using most of publicly available data, what makes us think that Pharma have hidden gold standard data? None, Lipinsky filter was discovered using simple data available through FDA repositories. So what data we are asking for.
My reply to bits of this blog item can be found at:
http://chem-bla-ics.blogspot.com/2009/03/open-data-versus-capatalism.html