We R Parallelizing

R is one of the highly used programming language for statistical computing in biology, particularly in area of bioinformatics where Bioconductor framework has been used to solve heterogeneous problems ranging from microarray data analysis to genome annotation. R is open source project and over the years it has evolved as strong alternative to commercial products such as S-Plus and Matlab, and now all focus seems to be diverting towards parallelization of the R using different approaches. In recent years many groups in the R community have contributed packages that enable R code to run on multiprocessor or cluster platforms and mostly they fall in two categories: parallel building blocks (e.g. Rmpi, Rpvm) and task farm (e.g. Biopara,TaskpR) packages. Use of these packages requires extensive knowledge of parallel programming and significant alterations to existing code base and algorithms. In year 2008 community has introduced 2 major add-on modules to support parallel computing in R, R/parallel and SPRINT(Simple Parallel R INTerface). One of major features of R/parallel is its ability to to take full advantage of multi-core processing capabilities with minimal alteration of scripts. R/parallel is implemented using threads in C++ and aims at reduce the processing time on single machine with multi-core processor, processing speed can be increased approximately up to N-fold (N is number of processor cores). Using R/parallel add-on module, a gene expression job processing of 37,685 traits from 73 individuals running on a quad-core processor takes about one hour to complete, compared to four hours running serially, which means additional gain with parallel computing with efficient maximal exploitation of multi-core processors. A sample parallelize job using R/parallel
SPRINT aims to enable the easy exploitation of High Performance Computing (HPC) systems using R without any knowldge or background in parallel computing. SPRINT uses MPI ( Message Passing Interface ) rather than threading techniques and can be used on a wide range of systems, from a cluster of PCs to supercomputers. Unlike R/parallel, SPRINT framework requires re-implementation of functions by either re-writing R functions from the scratch with in-built parallelism or wrap R with MPI, but at the same time it also provide a diverse range of implementation options. A example job parallelize using SPRINT
Parallelism is all about computational gain. Based on problem a single multi-core desktop may gain the same performance as a more traditional parallel environment. Although MPI based approach is scalable, but it is not the easy to implement. Before you decided what will be better option, you have to identify opportunity of parallelism in your code and measure existing code performance. Once you have identified the form of parallelism (task, data, latency hiding, etc), you need check that parallelization can improve the performance of your code or not. Always remember, parallelize only if it is necessary but be prepared for it.

Read more:

Share and Enjoy:
  • HackerNews
  • Twitter
  • Facebook
  • Google Buzz
  • LinkedIn
  • Posterous
  • Tumblr
  • Digg
  • Reddit
  • del.icio.us
  • DZone
  • FriendFeed
  • Suggest to Techmeme via Twitter
  • Print
  • RSS
  • Slashdot

2 Responses to “We R Parallelizing”
  1. anilbioma
    01.27.2009

    R community can learn a lot from other ongoing projects, where they are coding the whole project from the scratch for with in-built parallelism. Although it is possible to run MPIs on multi-core/multi-threaded architectures but there are certain limitations (read out http://blogs.sun.com/jag/entry/mpi_meets_multicore)

  2. 01.27.2009

    I agree with you, for your information CMISS (http://www.cmiss.org/) software has been recoded as openCMISS (http://opencmiss.wiki.sourceforge.net/) to support high performance computing using MPI.