next up previous contents
Next: Error log file Up: usersguide Previous: Input files   Contents


Output files

run_mlrgd produces the following output.


prefix.aln

code2aln alignment.

prefix.ORF.ps
prefix.info.ral

CDSs used by code2aln.

prefix.tree

phylip tree file.

prefix.pairs

Sequence pairs used by mlrgd.

prefix.run_mlrgd

Copy of the run_mlrgd script.

prefix.errorlog

Error log file.

prefix.nc.info
prefix.fc.info

Some alignment-wide statistics used in the plots. (Note that `nc' indicates a non-coding model (annotated CDSs are ignored), while `fc' indicates the full-coding model (i.e. up to triple-coding).)

prefix.nc.dat
prefix.fc.dat

Log of whole-sequence/region statistics for each pairwise comparison.

prefix.nc.log
prefix.fc.log

Log of statistics for each nt in each pairwise comparison.

prefix.nc.plot
prefix.fc.plot

Log of statistics for each nt, summed over the phylogenetic tree.

prefix.ncm.R
prefix.ncc.R
prefix.fcm.R
prefix.fcc.R

R plotting scripts.

prefix.ncm.eps
prefix.ncc.eps
prefix.fcm.eps
prefix.fcc.eps

Plots. (Note that `nc' indicates a non-coding model, `fc' indicates the full-coding model, `m' indicates running mean, `c' indicates clipped running mean.)



The files prefix.??.dat contain whole-sequence (or sequence-region) statistics for each pairwise sequence 1 - sequence 2 comparison in prefix.pairs. Columns are as follows:

  1. Best-fitting $t$ (more-or-less evolutionary time).
  2. Mean $\sum_i \left[ \log (P(N^{\mathrm{seq}_1}_i
\rightarrow N^{\mathrm{seq}_2}_i)) \right]$ per nt, where $N^{\mathrm{seq}_1}_i$ and $N^{\mathrm{seq}_2}_i$ are aligned nucleotides in sequences 1 and 2.
  3. Number of nt used in the comparison.
  4. Number of nt discarded due to being gaps in one or both sequences.
  5. Number of nt discarded due to being `zero-probability' transitions according to the given substitution matrices aa.dat and codon.dat (e.g. stop $\leftrightarrow$ non-stop).
  6. Number of point mutations (out of the nt used, i.e. Col. 3).
  7. Number of neutral or synonymous point mutations (including all non-coding nt).
  8. Flag = 1 if problems with $t$-fitting (outside given range or didn't converge in maximum allowed number of iterations).
  9. Number of 4-fold degenerate sites (in sequence 1) with a neutral or null mutation.
  10. Number of 4-fold degenerate sites (in sequence 1) with a neutral non-null mutation.
  11. Best-fitting $V$ (scaling between nonsynonymous and synonymous substitution acceptabilities).
  12. Flag = 1 if problems with $V$-fitting (outside given range or didn't converge in maximum allowed number of iterations).
Notes:
  1. The mutation rate per nt is Col. 6 / Col. 3.
  2. Col. 3 + Col. 4 + Col. 5 equals the sequence length in alignment coordinates.



The files prefix.??.log contain a log of statistics for each nt in each pairwise comparison. Columns are as follows:

  1. Sequence pair number.
  2. Nucleotide number (in alignment coords).
  3. $\log (P(N^{\mathrm{seq}_1}_i \rightarrow N^{\mathrm{seq}_2}_i))$. 9 $\Rightarrow$ gap in both sequences, 8 $\Rightarrow$ gap in sequence 1, 7 $\Rightarrow$ gap in sequence 2, 6 $\Rightarrow$ zero-probability transition, 5 $\Rightarrow$ gap only in reference sequence (when refpos = 1).
  4. Expected number of mutations (0 if gap in either sequence).
  5. Expected number of neutral mutations (0 if gap in either sequence).
  6. Observed number of mutations (0 if gap in either sequence).
  7. Observed number of neutral mutations (0 if gap in either sequence).



The files prefix.??.plot contain a log of statistics for each nt, summed over the phylogenetic tree. Running means of these data are used in the plots. Columns are as follows:

  1. Nucleotide number (in alignment coords).
  2. Expected number of mutations across phylogenetic tree.
  3. Observed number of mutations across phylogenetic tree.
  4. Number of pairs in which nt is non-coding (using CDS annotation).
  5. Number of pairs in which nt is a 1st codon position (using CDS annotation).
  6. Number of pairs in which nt is a 2nd codon position (using CDS annotation).
  7. Number of pairs in which nt is a 3rd codon position (using CDS annotation).
  8. Number of pairs in which nt is a gap (in either sequence) or a `zero-probability' transition.
  9. Standard deviation estimated from expected number of mutations.
  10. $\sum \lambda_i$, where $\lambda_i$ is the mean observed number of mutations per nucleotide for sequence pair $i$ and the sum is over those sequence pairs contributing to the score at that nt position (e.g. not gapped).
  11. Expected number of neutral mutations across phylogenetic tree.
  12. Observed number of neutral mutations across phylogenetic tree.
  13. Expected number of neutral mutations at 4-fold degenerate sites, across phylogenetic tree.
  14. Observed number of neutral mutations at 4-fold degenerate sites, across phylogenetic tree.
  15. $\sum \lambda_i$ for pairs with 4-fold degenerate neutral sites at each nt.
  16. Number of pairs with 4-fold degenerate neutral sites, at each nt.
Notes:
  1. Col. 4 + Col. 5 + Col. 6 + Col. 7 + Col. 8 = total number of sequence pairs.
  2. Columns 2, 3, 10, 11, 12, 13, 14 and 15 should be multiplied by 0.5 since forward and backward comparisons are done for each sequence pair, and multiplied by 0.5 again since each branch of the tree is crossed with two pairwise comparisons (Figure 2). Column 9 should be multiplied by $\sqrt{0.25} = 0.5$.



The image files prefix.*.eps contain a variety of plots and statistics. The header lists the alignment name and model, the number of sequence pairs, the alignment length, total number of mutations across the alignment, mean number of mutations per column and the mean number of mutations per column at four-fold degenerate neutral sites. Note that the initial list of sequence pairs covers each branch of the phylogenetic tree twice and, in addition, for each pair both forward (sequence 1 $\rightarrow$ sequence 2) and backwards (sequence 2 $\rightarrow$ sequence 1) comparisons are made. So these scores are divided by four - hence sometimes a fractional number of mutations is listed.

The ten tracks are as follows

  1. Conservation in non-coding regions.
  2. Conservation in 1st codon positions.
  3. Conservation in 2nd codon positions.
  4. Conservation in 3rd codon positions.
  5. Conservation at non-coding and 4-fold degenerate neutral sites.
  6. Conservation for all nucleotides.
  7. Significance $p$-values for track 6 - i.e. the probability that conservation of that magnitude or greater would be observed if the null model were correct. Actually the reciprocal $p$-values are given on the $y$-axis scale - e.g. 1000 corresponds to a $p$-value of 0.001. The scores apply to the running mean scores. The standard deviations for each running mean window are calculated analytically using $\sum_{\mathrm{window}} \sum_{\mathrm{pairs}} p(1-p)$, where the $p$ are $E_k(M)$ values from Equation 5, ($\S$12.3). The $p$-values are $P(z \ge
{\mathrm{score/stddev}})$, calculated assuming a normal distribution (more-or-less OK, by the Central Limit Theorem).
  8. $\sum \lambda_i$, where $\lambda_i$ is the mean observed number of mutations per nucleotide for sequence pair $i$ and the sum is over those sequence pairs contributing to the score at each nt position. Essentially this is the mean number of mutations per alignment column.
  9. The location of CDSs annotated in refseq.fasta.orfs or refseq.gbk.orfs.
  10. The location of known features annotated in refseq.fasta.features or refseq.gbk.features (if any).

The conservation scores in tracks 1-6 are $E_k(M)-O_k(M)$ (expected $-$ observed number of mutations) scores, scaled by $\sum \lambda_i$, where $\lambda_i$ is the mean observed number of mutations per nucleotide for sequence pair $i$ and the sum is over those sequence pairs contributing to the score at that nucleotide (e.g. not including pairs with gaps at that point). This normalizes regions of the alignment where some sequences are gapped to regions of the alignment where no sequences are gapped. If no pairs contribute at some nucleotide (e.g. if only one sequence is ungapped, or if refpos = 1 and the reference sequence is gapped) then the track returns to zero. Tracks 7 and 8 may be used to assess the significance of any observed features.

The scores are passed through a running mean filter with (image files prefix.??c.eps) or without (image files prefix.??m.eps) clipping. The window size and clipping thresholds are adjustable by the user (use redo_plots to redo the plots with different values; see $\S$9). Note that the window skips any gaps, so in track 1, for example, the scores at the end of one non-coding region will be windowed along with the scores at the beginning of the next non-coding region.

Image files are produced both for the full-coding model (prefix.fc?.eps) specified by all the input CDS files, and for a non-coding model (prefix.nc?.eps) where all nucleotides are assumed to be non-coding.

A variety of not-really-worth-saving files are moved to the directory TIDYUP. Keep these if you might want to use the scripts redo_mlrgd or redo_plots.

The track for synonymous/neutral sites may look somewhat different from the other tracks - many individual bars rather than a continuous line. This is because the synonymous/neutral sites are scattered within CDSs and the track returns to zero at any gap greater than three nucleotides wide. Note that the sliding window covers 2 $\times$ window2 $+$ 1 adjacent synonymous/neutral sites rather than being a window of size 2 $\times$ window2 $+$ 1 in the alignment coordinates. You can obtain a traditional plot of conservation at neutral sites in this track by setting fitwhat = 2 or 3.

Note that a given column may be four-fold degenerate and neutral for some sequence pairs but not for others: four-fold degeneracy depends on the codon in sequence 1. Neutrality depends on the codons in both sequences being synonymous. Hence at each nucleotide, track 5 is scaled by the sum of $\lambda_i$ values just for those pairs $i$ that contribute, rather than the $\sum \lambda_i$ values in track 8. Tracks 1, 2, 3, 4 and 6 are just scaled by the $\sum \lambda_i$ values in track 8. For tracks 1, 2, 3, 4 this only makes sense if the codon position annotation is the same for all sequence pairs (e.g. if refpos = 1).

A combination of track 4 (3rd codon positions) in coding regions - except overlapping CDSs - and track 1 (non-coding positions) in non-coding regions can be useful. This is less susceptible to site-specific variation in the nonsynonymous:synonymous substitution ratios than track 6, but provides denser and more even coverage than track 5. Within track 4, CDS-plotcon provides appropriate scaling between 1-, 2-, 3- and 4-fold degenerate positions.


next up previous contents
Next: Error log file Up: usersguide Previous: Input files   Contents
aef 2007-12-10