PhyloGibbs is an algorithm for discovering regulatory sites in a collection of DNA sequences, including multiple alignments of orthologous sequences from related organisms. Many existing approaches to either search for sequence-motifs that are overrepresented in the input data, or for sequence-segments that are more conserved evolutionary than expected. PhyloGibbs combines these two approaches and identifies significant sequence-motifs by taking both over-representation and conservation signals into account.
PhyloGibbs runs on arbitrary collections of multiple local sequence alignments of orthologous sequences. The algorithm searches over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to the multiple sequence alignments. These binding site configurations are scored by a Bayesian probabilistic model that treats aligned sequences by an explicit model for the evolution of binding sites and 'background' intergenic DNA that takes the phylogenetic relationship between the species in the alignment into account. The algorithm uses simulated annealing and Monte-Carlo Markov-chain sampling to rigorously assign posterior probabilities to all the binding sites that it reports.
List of the most important features:
- The algorithm can search for an arbitrary number of sites for an arbitrary number of different regulatory motifs. The user can either specify the total number of sites and motifs that PhyloGibbs needs to search for, or it can supply PhyloGibbs with a guess for the total number of sites and motifs in the data.
- The algorithm rigorously takes into account the phylogenetic relationships of the species from which the input data derive. This allows PhyloGibbs to distinguish conservation that is due to the occurrence of functional sites from spurious conservation that is due to the evolutionary proximity of the species. Example phylogenetic trees for commonly used species can be downloaded from the download page.
- PhyloGibbs uses an anneal+track strategy that rigorously assigns posterior probabilities to the sites it reports. In the anneal stage the globally maximum-a-posterior-probability set of binding sites is identified and their posterior probabilities are calculated in the tracking stage.
- The program can also be used to calculate the statistical significance of a pre-specified set of putative binding sites.
- Background probabilities for nonfunctional sequences are implemented as Markov models of arbitrary order (to be specified by the user). Background models can be calibrated from externally supplied files with background sequences.
- Users can specify informative priors for the motifs by supplying an external file with weight matrices. This allows the algorithm to automatically identify new binding sites for motifs for qstawhich one or more binding sites are already known.
PhyloGibbs should be cited as:
Siddharthan R, Siggia ED, van Nimwegen E
PhyloGibbs: A Gibbs sampling motif finder that incorporates phylogeny
PLoS Comput Biol 1(7): e67 (2005)
Siddharthan R, van Nimwegen E, Siggia ED. (2004)
PhyloGibbs: A Gibbs sampler incorporating phylogenetic information,
in Eskin E, Workman C (eds), RECOMB 2004 Satellite Workshop on Regulatory Genomics,
LNBI 3318, (Springer-Verlag Berlin Heidelberg 2005).
NOTE: This is a preliminary report which has been largely supersededby significant changes to the code since then.