Big data, Big challenges

Data Center Cabling DoImage by Photo Blog 0001 via Flickr

Recently New York Times has published a tech article about big data. According the article some of the largest technology companies like I.B.M. and Google think that students from elite American universities students are not trained well to work on Internet scale problems of tomorrow. If you remember last month a similar story was published in Wall Street Journal highlighting how data deluge has swamped science historians. For example Sloan Digital Sky Survey project has generated 140 terabytes of digital data cataloging 230 million celestial objects, encompassing 930,000 galaxies, 120,000 quasars and 225,000 stars. Another Large Synoptic Survey Telescope will be producing 30 terabytes of data each night (15+ petabytes, in fact every year). Similarly third generation of DNA sequencers will generate many petabytes of information a year. Sooner or later researcher working in data intensive scientific areas such as genomics and astronomy will find themselves overwhelmed with petabytes scale data outputs which will outstrips their ability to maintain them. Jimmy Lin an associate professor at the University of Maryland quoted in the article as “Science these days has basically turned into a data-management problem“. Although we are talking about petabytes of data, ironically so called big data (Internet scale or mega-scale) is not a storage problem. Problems associated with big data are primarily those of analysis. The problem is to actually capture predictable characteristics of this data or how to interact with this kind of data on regular basis. It is hard to imagine what a petabytes scale data will look like which Facebook or Google are dealing on daily basis.
The big question is whether the person on the other side of that machine will have the wherewithal to do something interesting with an almost limitless supply of genetic information.

Well those are not only problems we have, most of our freshly minted university graduates are not prepared to face this kind of data deluge. Why?

For the most part, university students have used rather modest computing systems to support their studies. They are learning to collect and manipulate information on personal computers or what are known as clusters, where computer servers are cabled together to form a larger computer. But even these machines fail to churn through enough data to really challenge and train a young mind meant to ponder the mega-scale problems of tomorrow.

I guess this is something for which we can not blame students alone. The lack of resources and exposer to new technologies is the one of the reasons. To tackle this issue Google and IBM are now promoting Internet-scale research at places like the University of Washington and Purdue by giving students wide access to their powerful computational infrastructure. Idea is to encourage the students to churn the data with the help of open-source tools like Hadoop used for processing Internet-scale data sets. Hadoop which is open source implementation of MapReduce, a software framework introduced by Google to support distributed computing on mega-scale data sets on clusters of computers. By the start of 2008 Google was processing over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters which gives a glimpse of Google’s Internet-scale capabilities. In a similar kind of initiative to promote the cloud based distributed computing learning, Amazon Web Services (AWS) is providing their on-demand infrastructure to the educational purposes for free.
So far we have talked about the next generation of data which is coming out of high throughput technologies in different scientific disciplines, and we all agree that this will have greater impact on the infrastructure of research, research funding and beyond (if and only if this is managed properly). On a further note, this data will need to be annotated with metadata, then archived and curated. Each of these seems to be mammoth task which means focus should not be only on onetime analysis but also on future reusability and interoperability.
In following video Roger Magoulas (Director of Research at O’Reilly) talks about the Big Data in general and gives a glimpse into future technologies and general advice to organizations interested in improving their proficiency in handling web-scale data.

Reblog this post [with Zemanta]
Share and Enjoy:
  • HackerNews
  • Twitter
  • Facebook
  • Google Buzz
  • LinkedIn
  • Posterous
  • Tumblr
  • Digg
  • Reddit
  • del.icio.us
  • DZone
  • FriendFeed
  • Suggest to Techmeme via Twitter
  • Print
  • RSS
  • Slashdot

9 Responses to “Big data, Big challenges”
  1. 10.14.2009

    I have a slide in many of my decksthat says “Data Management is NOT Data Storage”

  2. 10.15.2009

    Big data, Big challenges http://bit.ly/1yr5xG

  3. 10.15.2009

    Big data, Big challenges http://tinyurl.com/ygur5px

  4. 10.15.2009

    Big data, Big challenges: Image by Photo Blog 0001 via Flickr Recently New York Times has published a tech arti.. http://bit.ly/1yr5xG

  5. 10.15.2009

    Liked RT @abhishektiwari: Big data, Big challenges http://bit.ly/1yr5xG

  6. 03.31.2010

    My 2 cents, creativity in science communication was never so relevant until now when we are facing the problem of filter failure. Although I am not big fan of impact factors (IF) sounds like the visual impact of journals can be highly correlated with their IF. In terms of illustration appeal journals like Nature, Cell, PNAS and PLoS are way ahead to their counterparts in other areas.

    This comment was originally posted on Fisheye Perspective

  1. [...] complicated. Science is constantly shifting landscape, both in terms of data type and quantity. We are s... abhishek-tiwari.com/2010/03/put-some-breathe-life-in-your-papers.html
  2. [...] future scale computing Forget about the Internet scale, seriously that is not enough for Google. Googl... abhishek-tiwari.com/2009/10/googles-future-scale-computing.html