NAMEPhyloGibbs v1.2 - Gibbs motif sampler incorporating phylogeny and tracking statistics.
DESCRIPTIONphylogibbs searches for motifs in multiple DNA sequences, each of which may be prealigned with homologous sequence from related species. It uses Gibbs sampling with an anneal+track strategy that finds multiple motifs simultaneously and reports their significance. The user needs to specify the phylogenetic tree of the "proximities" between the species that occur in the input multiple alignmnents (see examples below).
This manual page describes the program's command-line options. For the internal philosophy and description of the algorithm, see the phylogibbs_algorithm(7) manpage. Below is a quick summary of the most important options, followed by more detailed descriptions of all options in alphabetical order.
Input sequences are read from a fasta file specified by the -f input_file option. Sequences that are not aligned with any other sequences may include the bases A, C, G, T, degenerate IUPAC symbols, and the symbol X (blocking sites from occurring at that position). The case of the symbols in unaligned sequences is ignored. If sequences are phylogenetically related, they must be in multi-fasta format, as output by Dialign or other multiple alignment programs; they may contain uppercase or lowercase nucleotide symbols A, C, G, T, dashes (-) to indicate gaps, and 'X' or 'x' to indicate a blocked site. Vertically aligned uppercase letters are treated as aligned and scored taking their phylogenetic relationship into account. Lowercase letters are considered unaligned with other sequences.
To run on multiple sets of multiple alignments, e.g. multiple alignments of orthologous regulatory regions from the same set of species, the different alignments are to be separated by using a double tick ">>" instead of a single tick ">" on the header of the first species in each set. For example:
>>SpeciesA region 1 ACTGACataAGCAGAC... >SpeciesB region 1 AGTCAC--aATCAGAT... >>SpeciesA region 2 CATCACATA--agaat... >SpeciesB region 2 CATCGCAAAaagcata...
Here, the two region 1 sequences from species A and species B are aligned with each other, but are not aligned with the two region 2 sequences from species A and B, because of the separator ">>".
The interpretation of the input sequence is controlled by the -D option; 0 ignores phylogeny and treats all sequences as unaligned, 1 and 2 both assume phylogeny but with different degrees of strictness regarding which alignment blocks can be considered candidate locations for binding sites. The phylogenetics relationships of the species must be specified using the -L option as described below.
By default the best configuration of binding sites (obtained by simulated annealing) is written to the file "output" and the posterior probabilities of binding sites (obtained by tracking) are written to "tracked_output". These filenames can be changed via the -o output_file and -t tracking_file options. The width of the sites is the same for all motifs and is set via -m motifwidth. The number of different motifs is specified through -z nummots and the total number of sites through -y numsites. By default the total number of sites numsites remains fixed. The algorithm will distribute the numsites sites over at most nummots different motifs. For most purposes these options will suffice to provide meaningful results. Fine-tuning is provided by a host of other options described in alphabetic order below; for many of these, proper usage will require an understanding of the internal strategy, as described in phylogibbs_algorithm(7).
ESSENTIAL OPTIONSThese options must be understood in order to use Phylogibbs effectively. Advanced options are described in the next section. All options have short and long versions; use either version.
- -D, --dialign num
- num specifies the "alignment level" of the input sequence and may be 0, 1, 2. If 0, assume input sequences are not aligned: this is the default. If 1, assume they are aligned (Dialign-style aligned fasta, or multi-fasta), and treat inconsistent windows "flexibly". That is, when gaps occurs within a multi-species aligned window, separate it into smaller windows (that is, windows containing fewer species) that each contain mutually gapless subsets of species. If 2, assume aligned sequences and be "strict" about windows containing gaps: reject them (never place sites at their positions).
- -F, --bgfile filename
- File (fasta format) to use for estimating the background distribution. In typical cases, this file would consist of large amounts of intergenic sequence from the species of interest. Default: use the input file to calculate the background distribution.
- -f, --inputfile filename
- Name of the input sequence file, in fasta or aligned-fasta format. Default: none (must be supplied)
- -L,--labeltree string The supplied
string specifies the phylogenetic tree of the species from which
the input sequences derive and must be complete and correct. The tree
format that we use is the standard Newick format. However, the values
that are normally branch lengths are instead "proximities" in our
format. A proximity gives the probability that no mutation has taken
place along the branch. The name labelling each leaf of the tree
should correspond to a substring that occurs in all fasta-headers of
those input sequences that derive from this species (and that does not
occur in the fasta-headers of sequences from other species).
For example, the following phylogeny
0.85 -------- human 0.6 | ----------+ | | 0.9 | -------- chimp Ancestor + | 0.8 | -------- mouse | 0.7 | ----------+ | 0.9 -------- ratwould be written as
(the string is quoted to protect the parentheses from the shell). Note that the proximities are between successive nodes or leaves: eg, 0.9 for chimp indicates that, for a given neutrally evolving base in chimp, the probability of no substitutions in this base since the common ancestor of chimp and human is 0.9. Proximities are multiplicative: for example, if the set of aligned sequences in the input does not actually contain the sequence for human, the human-chimp node will be eliminated and chimp's proximity to the common ancestor of all species in the tree will be 0.6x0.9 = 0.54. Thus, a single tree may be used that includes species not occurring in the input data.
- -m, --motifwidth num
- Width of motifs to search for (integer). Default: 10.
- -N, --ncorrel num
- Order of the Markov model used for the background probabilities. -1: use background probabilities of 0.25 for each base. 0: use single base counts. 1 or more: condition the probability of a background base on the num bases immediately preceeding it. Default: 1. See also the -F and -P options.
- -o, --outputfile filename
- Write output of anneal phasse to filename. This file contains the maximal-a-posterior configuration of binding sites that will be used as a reference state during the tracking phase. If filename is "stdout", output is written to standard output instead. Default: "output".
- -S, --nsteps num
- Total number of steps in the tracking phase. This is the main option controlling the total running time of the algorithm. There are generally four phases to a run of the algorithm: A transient (warmup) phase, an annealing phase, a deep quench phase, and a tracking phase. By default the anneal phase is the same length as the tracking phase, the transient phase is 10% of this length and the deep quench phase is 3% of this length. The length of all these phases can be controlled independently with advanced options described below. Default: 100
- -t, --trackedoutput filename
- File to which posterior probabilities for individual sites and the estimated weight matrices are written at the end of the tracking phase (unless this phase is disabled with -X). Default: "tracked_output".
- -y, --nexpwin num.
- num is the expected total number of sites in all sequences. When no colour-change moves are used (default) this number remains fixed. When colour-change moves are used (see -c in the advanced options) the total number of sites may change and num is used as the a priori expected total number of sites. Note that if a site occurs in a block of S aligned sequences this is counted as 1 site, not S sites.
- -z --nexpcol num.
- num is the expected number of different motifs. When colour-change moves are not used this number acts as an upper bound, i.e. the algorithm will infer up to num motifs. When colour-change moves are used num is interpreted as the a piori expected number of motifs and more motifs could in principle be found.
ADVANCED OPTIONSUse these options to fine-tune the working of PhyloGibbs in various ways.
- -A , --trackfile filename
By default PhyloGibbs uses the maximum-a-posterior binding site
configuration found through simulated annealing as a reference state
for the tracking phase. With option -A the algorithm instead
uses the binding site configuration in the file filename as the
reference state. Each line in filename specifies a window,
i.e. the location of a binding site, by a set of three white-space
separated integers: the sequence number in the input file (starting
from zero), the start position of the window on that sequence
(starting from zero), and the cluster label number (starting from
one). In case of aligned sequences (-D), it is sufficient to
give the seq/start pair for one of the aligned sequences. For example,
a file containing
0 13 1 1 57 1 2 40 1 0 33 2 1 45 2 2 11 2specifies two motifs to be tracked: one with sites starting at positions 13, 57 and 40 on sequences 0, 1, 2, respectively, and the other with windows starting at 33, 45, 11 on the same sequences. Specifying -A skips the simulated anneal and deep quench phases.
- -a, --nanneal num
- Do num steps of the simulated anneal phase (excluding transient and final deep quench). Default: same as number of tracking steps (specified via -S). See also the -u, -g, -S options.
- -B, --blockedfile filename
Block all windows overlapping sequence stretches specified in
filename: each line in the file should contain, as
white-space-separated integers, a sequence number (starting from
zero), the starting position on that sequence (starting from zero),
and the number of bases starting at that position to be blocked. All
windows overlapping those bases will be ignored, i.e. not sampled by
the program. For example, a file containing
0 34 5 2 21 6specifies that bases 34 through 38 on sequence 0, and bases 21 through 26 on sequence 2, are to be blocked. Additionally, any occurrence of a base "x" or "X" in the input sequence is also regarded as a blocked site.
- -b , --beta value
- Sets the initial inverse temperature to value (a positive floating point number). 1 by default. It is unlikely that one would need to change this parameter. See also the -x option.
- -C, --rcsymmetric
- Assumes the motif weight matrices are reverse-complement-symmetric. Effectively, the second half of the weight matrix is assumed to be the reverse-complement of the first half. If the specified motif width is of odd length, the middle column is ignored. Default: turned off. This option is not extensively tested.
- -c, --ncolmoves num
- Do num "colour change moves" per step (ie, adding, removing or recolouring a window). As a special case, the value -1 means an automatic value will be picked. Default: zero. See also the -w, and -s options. Turning on colour-changing moves allows the total number of sites and motifs to change and a prior distribution over the expected number of sites and motifs is then needed. The algorithm uses a maximum entropy prior with expected numbers of sites and motifs given by the values specified through options -y and -z respectively.
- -E,--trackingcutoff float
- A cut-off (between 0 and 1) for printing sites in the tracked-output file: sites with posterior probability less than this value will not be printed. Increase this number to printi only sites with high posterior probability and lower this number for printing sites with low posterior probability. Default: 0.05.
- -g, --ndeepquench num
- Do num steps of the deep quench phase. Default: 3% of the number of simulated anneal steps, but at least two. See also the -u, -a, -S options.
- -h, --help
- Print a brief summary of options and exit.
- -i, --initfile filename
Name of file containing a configuration of binding sites to be used as
an initial state. This initial configuration will be printed to a file
called initial_output. File format is similar to the -A
option, except that a fourth column, specifying the "direction"
(strand) of the window, is required and must be 0 or 1. For example,
0 13 1 0 1 57 1 1specifies that windows starting on sequence 0, position 13 and sequence 1, position 57 should initially be given colour 1, with directions 0 and 1 respectively (ie, on opposite strands). See also -I. Per default a random initial configuration is used.
- -M --motiffile filename
- Specifies an external file with WMs that PhyloGibbs can use as informative priors for the motifs. This option can be used to find additional sites for a set of WMs, and also to specify initial guesses for the motifs to be discovered. filename contains a set of weight matrices in the same format as in the output files that PhyloGibbs produces, which is also the format used by TRANSFAC and MEME. WMs with width less then the motif width given in -m will be padded and wider WMs will have their last columns trimmed. When running PhyloGibbs with -z num, i.e. num motifs, each of the num motifs can either be a `new' motif, using an uninformative prior, or can use one of the WMs from the file for the prior. To control the relative probability of using `new' motifs versus the WMs from the file a line "NUMTFS num" in the file specifies how many additional `new' motifs are expected to exist beyond the motifs in the file.
- -P, --bgpscount value
- Weigh background probabilities with Markov order greater than zero with a "pseudocount" of single-site probabilities, with weight value (useful when the input sequence is not long enough to calculate truly reliable correlated counts but one wants to use them anyway). Default: 0.0. See also the -N and -F options.
- -q, --quiet
- Run quietly (no output to stdout; errors and warnings are still printed on stderr.)
- -R, --reverseprint
- Print starting positions of sites not from the start of the input sequences, but as a negative number from the end of input sequences (ie, the last base on the input sequence is numbered -1). For example, this may be the distance from the start of the gene.
- -r, --norevcomp
- Do not search for reverse-complement matches (ie, do not search on the "other strand"). Essential for searching for motifs in RNA. Default: Search on both strands.
- -s, --nshiftmoves num
- Number of global-shift moves per step. In a global-shift move all windows in a colour are shifted to the left or right by the same amount. Default: twice the number of colours in the initial configuration. See also -c, and -w.
- -T, --pseudocount value
- Value of the pseudocount (parameter of the Dirichlet prior) to be used in the weight-matrix integral (generally between zero and one). Default: 1.0, which corresponds to a uniform prior over the space of WMs. Use smaller values when WMs with significant information scores in each column are expected.
- -u, --ntransient num
- Do num steps of the transient equilibriation phase, before the simulated anneal, and/or before tracking. Default: 10% of the number of steps in the simulated anneal. See also the -a, -g, -S options.
- -v, --verbose
- Print verbose information while running (unlike -q which suppresses all output to stdout). Of -v and -q, the last to be specified applies.
- -W, --write-each-cycle
- Write the output and tracked-output (see -o and -t) after every step, instead of just at the end of the anneal and tracking phases respectively.
- -w, --nwinmoves num
- Number of window-switch moves per step. Default: Equal to the total number of windows in the initial configuration. See also -c, and -s.
- -X, --noautotrack
- Disable tracking: quit after the simulated-anneal and deep quench phase have finished. The default is to do a tracking phase as well.
- -x, --betaincr value
Amount by which to increase fictitious inverse temperature (beta) at
each step in the simulated-annealing phase. Must be positive.
Default: chosen automatically.
BUGSImproperly formatted input fasta files may crash the program or fail silently with mysterious results. In particular, CR-delineated files (Mac-style) will not work; convert them to Unix-style LF-delineated files. DOS/Windows-style CR-LF files will probably work but are not regularly tested.
The program is sprinkled with asserts that (if compiled with -DDEBUG) should catch inconsistencies early and terminate execution rather than fail silently. If you come across such an assert error, or other bugs, or have a feature wishlist, send email to the authors, preferably including the complete commandline, the GSL random number seed (which are both printed in the output file), and the offending input file.
SEE ALSOThe phylogibbs_algorithm(7) manpage for internal details of the algorithm and further references.
AUTHORSThe PhyloGibbs algorithm was developed during 2002-2005 by
Rahul Siddharthan <firstname.lastname@example.org>
Erik van Nimwegen <email@example.com>
Eric D. Siggia <firstname.lastname@example.org>
The code was written by Rahul Siddharthan and Erik van Nimwegen. Please send all questions and bugreports to Erik van Nimwegen <email@example.com>.