next up previous contents
Next: Installation Up: usersguide Previous: Contents   Contents


Introduction

A powerful technique for locating functional elements in genomes is to look for conserved columns in multiple sequence alignments (Stojanovic et al. 1999; plotcon, EMBOSS package - Rice et al. 2000; MultiPipMaker - Schwartz et al. 2003; Margulies et al. 2003; VISTA - Frazer et al. 2004). However it is difficult to use this method to detect additional functional elements within protein-coding sequences (CDSs), since many columns in CDSs show conservation due to constraints on the encoded protein. It is possible to look for conserved columns at four-fold degenerate sites (some, but not all, third nucleotide positions in codons), but this leaves out information from at least two thirds of columns and is more-or-less impossible within overlapping genes (common in viruses). Conserved RNA secondary structures may be found with programmes such as alidot (Hofacker et al. 2002) and RNA-DECODER (Pedersen et al. 2004), while other features may be detected through database similarity searches. However novel features without significant RNA secondary structure can not be detected using these methods.

The software package CDS-plotcon is specifically designed to search for conserved functional elements within CDSs. It uses an average model ($\S$12.3) of the expected mutation patterns within CDSs (incorporating a nucleotide mutation matrix, amino acid substitution matrix, sequence divergence parameter $t$, mean synonymous:nonsynonymous substitution ratio $V$ and phylogenetic tree; it can handle up to three overlapping CDSs in different read-frames). Using this, it calculates the expected number of mutations across the alignment in each column and compares this with the observed number of mutations. The results are plotted along the genome, and optionally passed through a sliding window (clipped) mean filter ($\S$6).

Particularly conserved regions may indicate non-coding functional elements, new coding ORFs, or more-conserved regions within proteins (e.g. motifs). The software also produces conservation plots for four-fold degenerate sites, that may be used to help distinguish these alternatives. CDS-plotcon could also be used in conjunction with complementary programmes (e.g. RNA structure prediction programmes).

As well as running the core conservation-calculating programme, the master script run_mlrgd also aligns the input sequences, extracts CDS locations from GENBANK-format files or user-supplied files, calculates a phylogenetic tree, and produces the plots. In run_mlrgd, the user may alter many parameters including parameters for fitting $t$ and $V$, running mean window sizes and clipping levels, whether the genome is circular or not and sequence range to analyse ($\S$8).

The package is particularly useful for analysing virus genomes where (sometimes multiple) CDSs overlapping non-coding conserved features are common and many sequenced genomes with a reasonable range of divergences are often available. In general, a set of viral genomes may be downloaded in GENBANK-format from the NCBI website and fed straight into the package with minimal user input necessary.


next up previous contents
Next: Installation Up: usersguide Previous: Contents   Contents
aef 2007-12-10