For scientists studying the link between genes and disease, there’s no shortage of information. The genomes of humans and many other animals have been sequenced and published for several years. The challenge is making sense of the data.
A new algorithm designed by Eric Siggia’s Rockefeller laboratory in collaboration with Erik van Nimwegen, now a professor at the University of Basel, Switzerland, and announced this month in Public Library of Science-Computational Biology, may be an important new tool for scientists seeking to extract answers from sequenced genomes.
When it comes to disease, genes tend to hoard the spotlight. In many cases, however, it is not the gene itself that causes a disease, but when and how much of it is expressed. If too much or too little of a particular gene product is made, the cell may not function as it should, and so the molecules that control how much of a gene is made are of great interest to scientists.
The computer program, called PhyloGibbs, which Siggia and Nimwegen developed with former postdoc Rahul Siddharthan, builds on previous software designed to identify where these regulatory molecules bind to DNA. Like its predecessors, PhyloGibbs compares DNA from multiple species in order to identify areas in which the genetic code is statistically similar and filter segments that are most likely to be of interest to scientists.
This approach, however, has been complicated by several factors. First, such inter-species conservation of DNA is not always indicative of function; in organisms that are closely related evolutionarily, some segments of the sequence may be alike simply because the sequences have not diverged sufficiently. Second, not all functional segments are conserved because many mutations either do not affect function or are compensated for by other mutations.
To compensate for these drawbacks, Siggia and van Nimwegen’s software goes a step further. After regulatory sequences from the same genes in different organisms are aligned, the program searches for the regions that are most likely to function as regulatory sites. The algorithm takes into account that binding sites for the same factor will share significant DNA sequences, and that functional binding sites are evolutionarily constrained to retain their affinity for the regulatory molecules. The program then evaluates each region of the DNA it analyzes for the likelihood that it is a binding site, assigning each a score based on the evolutionary relationships between the different species. Both conserved sequence blocks and sequence segments unique to a single organism are considered in the search and are scored consistently. In this way, the code allows multiple sites of multiple types to be determined simultaneously, so that overlap and competition among different regulatory factors is correctly taken into account.
By testing their software’s output against previously analyzed regions of yeast DNA, the scientists say that their algorithm outperforms alternative methods.
The software is being made freely available to the scientific community by Siggia and his colleagues, and a Web site where computational biologists can share their PhyloGibbs outputs is also in the works.