Pair programming and biological curation

BioModels Database PipelineImage via Wikipedia

Manual curation of peer reviewed biomedical articles demands utmost human intervention. Curation process is highly erroneous, especially when curators have to accomplish mammoth targets. In one of the recent studies published in Nature Methods, it was reveled that manually curated databases are subjected to higher error rates, which implies that reliability of the curated databases is totally questionable (go over my previous post, Manually Curated Databases- How much reliable they are?). Biological curation is now considered as well established practice and to make curators job easier several guidelines have been drafted as part of different community efforts. For example, consider the MIRIAM (Minimum information requested in the annotation of biochemical models) standard which is used for curation of mathematical models populated in Biomodels database and CellML repository. These guidelines are very abstract, and do not provide any clear road map about either curation workflow or project management. Over the time organizations had designed and developed excellent curation tools those can implement effective version control over vertical as well as horizontal curation workflow. Following figure gives a generalize idea about a conventional curation workflow designed by keeping quality issues in mind to prevent inclusion of any incorrect or imprecise information
outline of a simple biological curation workflowConventional approach to biological curation engages solo biocurator at each stage of the curation workflow, and that is where the whole problem resides. After several checkpoints in the curation process, errors are still unavoidable particularly those occurred due to inexperience of the biocurators. Learning from Pair Programming, where all significant programming is done in pairs leading to accelerated development process with improved quality, one can overturn the quality and speed of curation process. Pair programming is one of the extreme programming methodology, which involves two developers (usually one experienced and other one less skilled) participating in single development effort at one workstation. Basic idea is that working in pair is more productive compared to the situation where two people are working separately. Pair programing based approach to biological curation could be highly productive and less erroneous, contributing towards quick knowledge transfer and skill development. Ideally one curator will make entries , while other one (we can refer as observer or reviewer) will be capturing details from literature, at the same time they can have discussion about the difficult problems. Although initially it will take time to get used to pair curation but in long run it will be beneficial. Organizations like GVK Biosciences are already looking towards this kind of alternative approaches to make their databases more clean and reliable, with reduced project cost.

Reblog this post [with Zemanta]
Share and Enjoy:
  • HackerNews
  • Twitter
  • Facebook
  • Google Buzz
  • LinkedIn
  • Posterous
  • Tumblr
  • Digg
  • Reddit
  • del.icio.us
  • DZone
  • FriendFeed
  • Suggest to Techmeme via Twitter
  • Print
  • RSS
  • Slashdot

13 Responses to “Pair programming and biological curation”
  1. 02.18.2009

    Pair programming and biological curation: Image via WikipediaManual curation of peer reviewed biomedical article.. http://tinyurl.com/bekhlp

  2. 02.18.2009

    “Basic idea is that working in pair is counter productive compared to the situation where two people are working separately.”

    Umm, don’t you mean the opposite of that?

    Another technique that I’m surprised isn’t used more is measuring inter-annotator agreement. I used to work in natural-language processing which relies on high-quality manually-annotated corpora for training data. So whenever you publish a new data set, people ask what the inter-annotator agreement was (e.g. using the kappa statistic, which measures agreement over and above what you’d expect from chance alone).

    But this seems much rarer with biological data sets, and even with medical data sets where people’s health or lives may be affected by annotator errors! (I’ve also worked in medical informatics and seen annotations that were out by a factor of ten… Scary)

  3. 02.18.2009

    Hi Andrew,
    Thanks for pointing out the mistake. that was disaster, read it “more productive”. I agree with you that measuring inter-annotator agreement can be a good alternative, which is rarely used- particularly in industry. May be the reason is there is no standard approach, curation of PPI is quite different from drug or ligand databases, and we need to have optimize them separately.

  4. 02.18.2009

    It is a nice idea.
    I worked as a database annotator only for a month, for a summer school, and I remember how much difficult was working alone, and I much I felt the need of knowing how my colleagues would have annotated the same entry.

    Basically, the original pair programming technique is with two people in front of the same monitor, one dictating the code and the other one writing; but I think your variant would be better in this case.

  5. 02.18.2009

    although it is concept but we can borrow a lot of PP rules for curation process,

  6. 02.18.2009

    “working in pair is more productive compared to the situation where two people are working separately”

    I think here it’s a case of tradeoff , although you can minimize the amount of error in your curation process but you will be using two resource instead of using one when you can employ some one with the required knowledge about the whole curation process.

  7. 02.18.2009

    This is possible and this is highly debated topic. In a conventional curation workflow curator does only curation and then reviewer makes only corrections, it is suggested that if they work together then it may be more time saving and efficient. Pair curation is already making its way to big curation companies and they are trying to standardize whole practice.

  8. 02.18.2009

    I agree pair curation idea is a neat one but still as an industrialist I would have preffered to use some automated procedure either during the curation step or during review step ,because skilled workforce cost alot in long run.

  9. 02.18.2009

    I think curation and annotation are little different, While in annotation information base is extended over a existing information hence it can be easily to automated, currently many projects are using automated or semi automated pipelines for annotation. Comparatively in curation you need to create every thing from scratch, so human intervention is unavoidable

  10. 02.18.2009

    “Agent oriented Data curation in Bioinformatics” by Simon Miles from University of southmpton discusses one such automated data curation approach. Despite it’s limitations ,you would like to take a look :)

  11. 02.18.2009

    well I had a chance to look on the paper it describe the Agent-oriented software development process and not automated curation process, both are different issues. Even they discuss some format conversion they are out of context. For example if you really want to automate the curation process for drug database, and you are mining PDFs (as most of journals use this), you need to find the relevant articles first, then you need text mining tool with PDF encoders, tools those can interpret the chemical SMILE and SMARTs from figure itself, even then also such a system will be not smart enough to capture the values from table or figures, see theoretically agents based approach may be feasible but in real time it does not work at all.

  12. 02.19.2009

    well, in theory two people working on the same problem can go twice as fast, because they have less doubts and will need less refinement after.

    If you put only a person working on every task, you are overestimating the effects of getting bored and getting tired over the long period, and this causes a loss of efficency that is bigger than what you can be scared of loosing by putting two people to work paired.
    People can become very tired especially when they have to carry out an intellectual work, and they don’t have nobody to ask when they have doubts. Pair programming is aimed at reducing this effect.

    Moreover, another rule of extreme programming is to rotate the developers, so the pairs of curator won’t be fixed, and everyone will carry out each role over the time. This has various advantages, as it is true team work.

    I suggest you to have a look at http://extremeprogramming.org/ , where there are a lot of good rules like these.

  13. 02.19.2009

    thanks that’s really nice piece of information particularly about rotation plicy, I certainly have a look on the web page