Big Data: Hadoop is tuned for availability not efficiency

A very interesting post by UC Berkeley Professor Joe Hellerstein on his blog about two very different big data deployments on Hadoop and Greenplum. Joe was contrasting a recent Yahoo implementation on Hadoop to sort a petabyte using approximately 3800 nodes with eBay hosting of 6.5 petabyte Greenplum database on 96 nodes, which means in terms of computational nodes Yahoo implementation pumped 40 times more than eBay, and computational efficiency gained in this process overlooked energy and deployment costs. Greenplum Database is built from modified PostgreSQL into a massively parallel processing (MPP) database and interestingly it also includes an analytic extension based on MapReduce. Hadoop is open source implementation of Google MapReduce framework and currently deployed at Yahoo, Facebook and Amazon EC2. Assume your computational tasks are employed over Amazon EC2 infrastructure which one will you prefer? Using Hadoop you may end up with higher node density leading to higher bills. Joe argues,
how much hardware should be thrown at these problems? What’s the sweet spot between optimism and pessimism in the software fault tolerance, given the hardware/operational/energy cost to support it? So far all I hear are casual opinions — there’s science to do here

Even in science no one has unlimited resources, Further Joe suggest

  1. Predictive Snapshots for Dataflows: It sounds wise to only play the Google regurgitation game when the cost of staging to disk is worth the expected benefit of enabling restart. Can’t this be predicted reasonably well, so that the choice of pipelining or snapshotting is done judiciously?
  2. TCO metrics for Analytics hardware in modern datacenters: What is the right way to measure cost for these deployments, including energy consumption, rackspace, management, etc.

I could not agree more. Hadoop or no Hadoop, any computational implementation should be optimized according to energy-centric scalable benchmarks as well.

Share and Enjoy:
  • HackerNews
  • Twitter
  • Facebook
  • Google Buzz
  • LinkedIn
  • Posterous
  • Tumblr
  • Digg
  • Reddit
  • del.icio.us
  • DZone
  • FriendFeed
  • Suggest to Techmeme via Twitter
  • Print
  • RSS
  • Slashdot

2 Responses to “Big Data: Hadoop is tuned for availability not efficiency”
  1. 05.17.2009

    Big Data: Hadoop is tuned for availability not efficiency: A very interesting post by UC Berkeley Professor Joe .. http://tinyurl.com/p82xdo

  2. hadoop topsy 5
    05.17.2009

    Big Data: Hadoop is tuned for availability not efficiency- by Fisheye Perspective