Finding Exceptional Motifs in Sequences
Version 3
Recherche de Mots Exceptionnels dans une Séquence
Mark Hoebeke andSophie Schbath
Aim
The main question R'MES addresses is "does this motif occur in that biological sequence with an expected frequency?" In other words, can we observe it so many times, or so few times, just by chance? Usually, when the answ er is no, such a motif is a candidate to have a particular biological meaning; only a candidate: statistical significance is not equivalent to biological significance.
A brief presentation of the statistical method used in R'MES to evaluate the significance of a motif frequency in a sequence can be found here or in the user guide. For more details about the methodology, please refer to the following tutorial (pdf) or to the book DNA, Wo rds and Models by Robin, Rodolphe and Schbath published by CUP in 2005 (or by BELIN in 2003 for the French version).
System Requirements
R'MES comes as a source distribution only, and needs to be compiled before use.
R'MES is written in C and C++. Our distribution was specifically designed to be compiled with the GNU GCC compiler. It has been tested on a variety of Unix platforms (Linux, Solaris, MacOS X).
Getting and Installing R'MES
R'MES is a free software package available under the GNU General Public License. It can be downloaded from https://forgemia.inra.fr/sophie.schbath/rmes.
R'MES has a companion tool, RMESPlot which is available at https://forgemia.inra.fr/sophie.schbath/rmesplot and provides a graphical us er interface for the visualization of R'MES generated results. It comes with its own user guide.
R'MES' installation procedure follows the GNU package distribution standards. So, after downloading rmes-<version>.tar.gz (where <version> stands for the version number) here is the list of steps to perform to install R'MES in a default location (usually /usr/local).
tar zxvf rmes-<version>.tar.gz cd rmes-<version> ./configure make make check make install
For more details, refer to the user guide or to the INSTALL file included in the source distribution.
Running R'MES
To get a complete description of all the possibilities offered by R'MES, please refer to the user guide. In particular, it starts by giving the mo st basic use case of R'MES (calculating the scores of exceptionality of all the words of a given length in a given sequence and under a given Markov model) and then describes other possible cases with the associated options (using degenerated words, analyzing coding DNA sequences, using customized alphabets, finding exceptionally skewed motifs and studying clumps of motifs).
R'MES has to be run via a command line which looks like :
rmes [options] -s <filename> -o <string>
where
- -s <filename>, --seq <filename>
- sets the sequence file in FASTA or GenBank format,
- -o <string>, --out <filename>
- specifies the prefix for output files.
All the options can be obtained by typing :
rmes --help
and are described below.
The option which specifies the approximation of the word count distribution used to evaluate the
p-value is nevertheless required and must take one of the following values :
- --gauss
- Use the Gaussian approximation,
- --poisson
- Use the Poisson approximation for the number of clumps,
- --compoundpoisson
- Use the compound Poisson approximation,
- --skew
- Use the Gaussian method and compute the additional scores for the skew.
The following options are optional:
- -l <int> or --length <int>
- (value required) length of the analyzed words,
- -i <int> or --lmin <int>
- (value required) length of the smallest analyzed words,
- -a <int> or --lmax <int>
- (value required) length of the largest analyzed words,
- -m <int> or --markov_order <int>
- (value required) order of the Markov model,
- --max
- Use the maximal Markov order with respect to the word length,
- -f <filename> or --fam <filename>
- (value required) set the family file in this format.
- --phases <integer>
- (value required) number of phases.
- --dna
- Use nucleotide alphabet
- --aa
- Use amino acid alphabet
- --alphabet <character string>
- (value required) Specify a string to be used as alphabet for the sequences
- -z or --compress
- Compress output files.
- -v or --version
- Displays version information and exits.
Utilities
Three utilities are provided in the R'MES package :
- rmes.format displays the results contained in an output file generated by the rmes command. It produces a table with the motifs sorted according to their exceptionality scores (see the usage information).
- rmes.gfam allows to generate family files when the corresponding families are degenerated DNA motifs which can be written thanks to the bases a, c, g, t and n (see the usage information).
- rmes.composition allows to know the length of a sequence and its composition (see the usage information).
Citating R'MES
if you have been using R'MES or if you want to refer to R'MES, please mention the following reference :
Schbath, S. and Hoebeke, M. (2011). R'MES: a tool to find motifs with a significantly unexpected frequency in biological sequences. In Advances in genomic sequence analysis and pattern discovery (L. Elnitsk i, O. Piontkivska, and L. Welch, eds.). Science, Engineering, and Biology Informatics, vol. 7. World Scientific.
References
Gaussian approximation:
- Prum, B., Rodolphe, F., and de Turckheim, É. (1995) Finding words with unexpected frequencies in deoxyribonucleic acid sequences. J. R. Statist. Soc B. 57, 205-220.
- Schbath, S., Prum, B., and Turckheim, É. de. (1995) Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J. Comp. Biol. 2, 417-437.
- Schbath, S. (1997) An efficient statistic to detect over- and under-represented words in DNA sequences. J. Comp. Biol. 4, 189-192.
Compound Poisson approximation:
- Schbath, S. (1995) Compound Poisson approximation of word counts in DNA sequences. ESAIM : Probability and Statistics 1, 1-16.
- Roquain, E. and Schbath, S. (2007) Improved compound Poisson approximation for the number of occurrences of multiple words in a stationary Markov chain. To appear in Adv. Appl. Prob.