Big data beats better algorithms
Anand Rajaraman wrote a series of posts about how more data usually beats better algorithms (1, 2, 3) on his blog. His discussion was mostly focused on web-scale data and he also posed some serious questions in following posts such as “Why should we have to choose between data and algorithms? Why not have more data and better algorithms?”. After reading his posts I decided to explore relevance of his argument in life science data perspective, especially whether or not his arguments holds for non web-scale data. Anand suggests that a simple algorithm with additional independent data usually makes a huge difference and it can outperform better designed algorithms. Different Google services are proof of concept for his argument, for example Google search made a big success by recognizing hyperlinks and anchor text, which are independent data sets from the text of web pages mostly used by first generation search engines. Rather than giving credits to PageRank and other algorithms, it will be better to acknowledge the value of additional independent data sets. Also adding more of the same data may or may not make a difference, which normally depends on data density/distribution and data dimensionality. Adding more of same data would make a difference if data is very sparse or it is scattered across high dimensional space, otherwise it will lead to diminishing returns which means data can beat algorithms either way. Excess but noisy, contradictory and overlapping data may ruin the performance of algorithms and in order to pointing at everything, it would point at nothing leading to more confusion. That is exactly the case of life science data. If your are interested to identify high-confidence Protein-protein interaction (PPI) data using your fancy computational algorithm, then unlike web-scale interactions (hyperlinks) there is no guarantee that these PPIs are reliable or valid, most of PPI data sources are highly error-prone and possibly of lower quality than one can expect. Fortunately there are variety of available interaction data sources such as PSIMAP, STRING, InterDom, iPFAM etc., and these databases contain mutually complementary as well as redundant information. Analyzing PPIs using above databases independently and analysis performed over all databases together after careful selection of independent data source provides two different pictures, of course algorithm perform better when implemented on complementary data sets. When it comes to choose between data and algorithm, then perception that deciding what data to analysis is more important than choosing what algorithm to use. The reasons are very clear, given a huge data sets there will be several algorithms those can reach same level of performance. Other issue is scalability of algorithms which depends on structure and computational complexity of algorithms, not all algorithms can scale over large data sets, for example “simple linear algorithms scale to large data sets (hundreds of gigabytes to terabytes), while quadratic and cubic algorithms cannot scale”. In sumary key is to use as many high quality data sources as available rather than sharping power of your algorithm.



















IMO that’s too simplistic a view. More data definitely helps and is critical, but good theoretical representations are essential. Let’s put it this way. Given the same data, the better algo will prevail
I guess same data and independent data are two different things, Big data with independent blocks does make sense
good data models are primary requirement but it should not be over-engineered, @ Deepak there is nothing wrong to have simplistic views if they provide cleaner and more elegant explanations
Madhu don’t disagree at all. But in the end science is not statistics. You have to have a hypothesis that you can represent mathematically. Not possible for everything yet, but with more data you get the opportunity to test out hypotheses and build more representative models. I agree with Abhishek about independent blocks, since that gives you better coverage to build appropriate representations to test and then use to answer questions
Big data beats better algorithms: Anand Rajaraman wrote a series of posts about how more data usually beats bett.. http://tinyurl.com/cbym9c
Big data beats better algorithms- by Fisheye Perspective http://bit.ly/dzqH2
Big data beats better algorithms- by Fisheye Perspective: Rather than giving credits to PageRank and other algor.. http://bit.ly/N07kM