Here is a brief presentation of the statistical method used in R'MES to evaluate the significance of a motif frequency in a sequence. For more details about the methodology, please refer to the tutorial available here or to the book DNA, Words and Models by Robin, Rodolphe and Schbath published by CUP in 2005 (or by BELIN in 2003 for the French version).
The following sections explain the statistical method applied to DNA sequences.
A model as reference
The key idea is to compare the observed count of the motif with the expected one given some knowledge about the sequence. To decide if a word count is indeed unexpected, we need to know what to expect. This will be defined by a probabilistic model, i.e. by the description of what "random" means. In practice, Markovian models are used because a Markov chain model of order m fits the observed counts of all oligonucleotides of length 1 up to (m+1) of the observed sequence. Let us denote by Mm such model.
Choice of the model
Choosing model Mm means to take the base, the dinucleotide, the trinucleotide, ..., the (m+1)-mer compositions of the sequence into account to determine what to expect. However, the sequence should be long enough to correctly estimate the 3x 4m parameters of the model (the transition probabilities). Note that a motif of size l can be only analyzed in M0 up to M(l-2) because higher models would fit the motif count itself (the motif will then be expected by definition). Remember that the model determines the reference; so, changing the reference may change the exceptionality feature of a motif. A word can be exceptionally frequent in one model but expected in another one which, for instance, takes more information on the sequence composition into account. Therefore, when claiming that an observation is statistically significant, do not forget to mention your a priori, your reference, your model.
To evaluate the significance of the difference between observed and expected counts, we need to evaluate the p-value which is the probability, under our model, to observe as much (or as few) occurrences of our motif of interest. It requires the statistical distribution of the count of a motif. Several methods exist either to calculate this p-value exactly (not tractable for long sequences) or to approximate it. Two kinds of approximations exist: a direct approximation using large deviation techniques or an approximation of the motif count distribution. R'MES uses the latter, namely a Gaussian approximation which is suitable for expectedly frequent motifs or a compound Poisson approximation adapted for expectedly rare motifs.
Score of exceptionality
R'MES converts the p-values into scores of exceptionality using the standard one-to-one probit transformation: for a given probability p in [0,1], the associated real-valued score will be the quantile of the standard Gaussian distribution N(0,1). Therefore, exceptionally frequent motifs will have high positive scores, whereas exceptionally rare motifs will have high negative scores. When using the Gaussian approximation, R'MES directly calculates the scores which is much faster. When using the compound Poisson approximation, probabilities P(count=x) for x less than the observed count are first calculated and cumulated (some numerical problems may happen if the observed count or the expected count are too large), then converted into a score.