next up previous contents
Next: Substitution matrices Up: Algorithms Previous: Notes   Contents


Mutation model

In this section we describe the model for non-coding, coding, and multiply-coding regions. We broadly follow the standard approach of modelling sequence evolution as a Markov process (Goldman & Yang 1994; Hein & Stovlbaek 1995; Ewens & Grant 2001). However we do deviate a little (see below) in order to deal with potentially double-coding or triple-coding regions, without having to resort to time-consuming MCMC simulations to estimate probabilities (Pedersen & Jensen 2001). We use more or less the same approach as in Firth & Brown (2005, supplementary material).

Suppose we have an aligned pair of sequences $S_1$ and $S_2$. For each nucleotide $N_1^k$ in $S_1$ we estimate the probability that $N_1^k$ mutates to each of the nucleotides U, C, A, G, as follows. First we define $b(N_1^k \rightarrow i; t, V)$, $i = \mathrm{U, C, A, G}$, by

\begin{displaymath}
\begin{array}{lll}
b(N_1^k \rightarrow i; t, V) & =
& \mathbf{P}(N_1^k \rightarrow i; t)\\
\end{array}\end{displaymath} (1)

for non-coding regions,
\begin{displaymath}
\begin{array}{lll}
b(N_1^k \rightarrow i; t, V) & =
& \mat...
... \times V^{\mathrm{ns}}
\times \mathbf{C}(X_2)\\
\end{array}\end{displaymath} (2)

for single-coding regions, and
\begin{displaymath}
\begin{array}{lll}
b(N_1^k \rightarrow i;t, V) & = & \mathbf...
...\times V^{\mathrm{ns}}
\times \mathbf{C}(X_2')\\
\end{array}\end{displaymath} (3)

for double-coding regions, with the obvious extension for triple-coding regions. Here $X_1$ and $X_2$ are the original and final amino acids or codons in one read-frame, and $X_1'$ and $X_2'$ are the original and final amino acids or codons in the second read-frame (for double-coding regions), for the nucleotide mutation $N_1^k \rightarrow i$. The mutation is taken in the context of $S_2$, in the sense that the neighbouring nucleotides for the codons $X_2$ and $X_2'$ are taken from $S_2$, while those for $X_1$ and $X_1'$ are taken from $S_1$. Also $\mathbf{P}(t) = \exp(\mathbf{Q}t)$; $\mathbf{Q}$, $\mathbf{C}$, $\mathbf{A}$ are the nucleotide, codon and amino acid matrices described in $\S$12.4; $V^{\mathrm{ns}}$ is 1 if $X_1$ and $X_2$ ($X_1'$ and $X_2'$) are synonymous codons and $V^{\mathrm{ns}}$ is a scaling factor $V$ if $X_1$ and $X_2$ ($X_1'$ and $X_2'$) are nonsynonymous codons.

The probability that $N_1^k$ mutates to $j$, after time $t$, is then given by

\begin{displaymath}
P(N_1^k \rightarrow j; t, V) = \frac{b(N_1^k \rightarrow j; t, V)}
{\sum_{i=\mathrm{U,C,A,G}} b(N_1^k \rightarrow i; t, V)}.
\end{displaymath} (4)

Using this model, we can calculated the expected number of non-null mutations $M$ occurring in $S_1$ after a given time $t$ by

\begin{displaymath}
E(M) = \sum_{k=1}^{\mathrm{length}} E_k(M) = \sum_{k=1}^{\ma...
...{length}} \sum_{j \neq N_1^k}
P(N_1^k \rightarrow j; t, V).
\end{displaymath} (5)

Given the two sequences $S_1$ and $S_2$, we adjust $t$ to fit the expected number of mutations $E(M)$ to the observed number of mutations $O(M)$ between $S_1$ and $S_2$.

The parameter $V$ may be left at the default value of 1 (usually pretty close to the best-fit value) or fitted seperately for each pair of sequences or one value may be fitted for the whole alignment. In the latter two cases, first $t$ is determined by fitting the observed and expected number of mutations only at non-coding positions and four-fold degenerate sites where the codons in $S_1$ and $S_2$ are synonymous. Then $V$ is determined by fitting the observed and expected total number of mutations.

Any transition that is assigned a zero-probability according to the input amino acid and codon matrices (e.g. transitions between stop codons and non-stop codons in the default matrices) is ignored and written to the log file. Any transition involving gaps or ambiguous nucleotide codes is also ignored and written to the log file.

Note that the above methodology involves several approximations. Firstly, in Equation 3, we consider only a single nucleotide and the single codon in each of the primary and secondary read-frames containing that nucleotide. However in double-coding regions, adjacent codons are linked by the overlapping read-frame. Hence mutations within one codon can affect the probabilities for subsequent mutations in neighbouring codons. Secondly, as far as codon and amino acid weightings are concerned, we consider only the start and end points of the unknown mutation pathway connecting an aligned codon pair in $S_1$ and $S_2$. It seems reasonable that these simplifications are justified provided $S_1$ and $S_2$ are not too divergent (so that mutation pathways are short and inter-codon cross-talk is low in multiply-coding regions). In Firth & Brown (2005) it is shown that this model provides useful results over a wide range of circumstances.

The parameter $V$ models the selection pressure that may vary from one gene to another. In fact the selection pressure will be different for every codon and this is often modelled by site-dependent rates models (Yang et al. 2000). Using a site-dependent model is pointless for our purposes, as it will be impossible to distinguish columns that have enhanced conservation. The BLOSUM amino acid matrix that we use represents the average amino acid substitution acceptabilities over all codons. The parameter $V$ allows a global adjustment for selection pressure while still allowing particularly conserved sites to be distinguished from other sites. Such sites may be additional non-coding elements, new coding ORFs, or more-conserved regions within proteins (e.g. motifs). The track for mutations at four-fold degenerate neutral sites may be used to help distinguish between these alternatives, though it will provide information for less than one in three nucleotides within CDSs. The track for 3rd codon positions may also be used and, since it includes 2-fold degenerate sites, will provide information at more positions.


next up previous contents
Next: Substitution matrices Up: Algorithms Previous: Notes   Contents
aef 2007-12-10