I just found a special report from The Economist on data, “Data, data everywhere.” The report deals, in several articles, on the new trend of massive amounts of data available today. They cover mostly the business implications, but also scientific data. For example, the Large Hadron Collider:
[G]enerate 40 terabytes every second—orders of magnitude more than can be stored or analysed. So scientists collect what they can and let the rest dissipate into the ether.
Another quote that got my attention was:
Only 5% of the information that is created is “structured”, meaning it comes in a standard format of words or numbers that can be read by computers.
This means that very little of the data available can be easily imported to other computer systems for analysis. It will become very important to make data available in a way that other computers can use it, otherwise most of the time and cost will go in re-formatting data. It will be kinda like when transferring data from paper to a computer, one more time. We should make raw data available, but also raw data in structured form. A PDF is great for humans, but it sucks when trying to extract data from it. At least something like a comma-separated file should help this process a lot.
Another evident consequence is that scientists, and most notably the next generation, will need to know how to work with large amounts of data. Programming and databases will have to become part of the scientists education, so you better start sooner than later.