next up previous contents
Next: Mutation model Up: Algorithms Previous: Overview   Contents

Notes

Knowing the correct reading-frame is very important for mlrgd's statistical mutation model. When estimating the probability that a given nucleotide in a CDS will mutate, mlrgd needs to know which codon it is a member of in both sequences. In a given pairwise comparison $S_1$-$S_2$, either $S_1$ (refpos = 0) or the reference sequence (refpos = 1) CDS files are used to define all the reading-frames. If refpos = 0, every coding nucleotide in $S_1$ is therefore assigned a codon position (1st, 2nd or 3rd) and read-direction (forward or reverse). The nucleotides in $S_2$ are identified purely on the basis of the nucleotides that they align with in $S_1$. If refpos = 1, the nucleotide identification in both $S_1$ and $S_2$ comes from their alignment with the reference sequence. In order to maintain the correct reading-frame in the different sequences, it is important that gaps in the alignment occur in groups of three within coding sequences. Hence the use of code2aln as the alignment programme. If you have purely coding sequences you can of course use e.g. CLUSTALW on the translated (amino acid) sequences and run mlrgd on its own (e.g. with the redo_mlrgd script).

The sequences used for the CDS annotation (reference sequence or $S_1$'s, depending on refpos) mustn't have sequencing-error indels within CDSs, as these will throw mlrgd out of frame and cause global problems. However indels are tolerated in the other sequences: mlrgd will get a local incorect codon identification, but there'll be no long-range problems. Bad alignments or local (paired) frameshifts (gaps not in threes) will mean that nucleotides within the `mis-aligned' region may have the wrong codon position identification - leading to wrong codon identifications - but the problems should be local. It is up to the user to check for alignment problems (see prefix.aln and prefix.errorlog files). In non-coding regions, of course, gaps not in threes are allowed and don't cause problems.

In general you should use refpos = 1 (just use the reference sequence CDS annotation) if you only have CDS annotation for one or a few of your sequences, or if you want to ensure that the CDS annotation is identical across the alignment, or if some sequences have less-than-perfect sequencing quality. However, supplying CDS files for all sequences may help code2aln (code2aln automatically finds long ORFs but may miss ORFs under 300 nt or ORFs without a start codon - e.g. at ribosomal frameshift sites, stop-codon read-through sites, or circular genomes).

For each sequence pair $S_1$-$S_2$, run_mlrgd runs the forward comparison $S_1 \rightarrow S_2$ and also the backward comparison $S_2
\rightarrow S_1$ so if, for example, there is a CDS annotated in $S_1$ but not in $S_2$, each nucleotide in the CDS will get equal weighting as a non-coding nucleotide and as a coding nucleotide in the final output plots.

The software will handle circular genomes as follows. When calculating the mutation probability for coding nucleotides at the ends of the input alignment, it will take into account codons that span the break in the circular genome. Also the running mean plots will use windows that span the break. However, the alignment programme will not reposition the breaks in the input sequences, so you must make sure that the break is in the same place in all of your input sequences.


next up previous contents
Next: Mutation model Up: Algorithms Previous: Overview   Contents
aef 2007-12-10