PEDEL-AA Rx statistic:
The 'Lx < 0.1 Vx_1' criterion for deciding when to use the 'Cx ~ Lx'
approximation is sometimes inaccurate, and can be refined as
follows.
First consider a single nucleotide substitution in a single codon. There
are 9 possible mutated codons. An amino acid mutation that can only
be coded by a single codon out of the 9 and that requires a
transversion, has only a 1 in 15 probability (assuming a
transition:transversion ratio of 3), since if p is the probability of
a transversion, then 3p is the probability of a transition, and the
total probability of the 9 mutated codons is 6(p) + 3(3p) = 15p.
For example, if the parent codon is GGG (Gly), then the 9
single-nucleotide-substitution codons are
amino relative total probabilities total probabilities
codon acid probability given the codon mutates given the amino acid mutates
AGG Arg 3p Gly 5/15 (wild-type)
CGG Arg p Arg 4/15 Arg 4/10
TGG Trp p Glu 3/15 Glu 3/10
GAG Glu 3p Trp 1/15 Trp 1/10
GCG Ala p Ala 1/15 Ala 1/10
GTG Val p Val 1/15 Val 1/10
GGA Gly 3p
GGC Gly p
GGT Gly p
The 'Lx < 0.1 Vx_1' criterion assumes that all of the
single-nucleotide-substitution non-synonymous amino acid substitutions
are equiprobable - i.e. 1 in 5 in the above example, but in general
represented by the reciprocal of the 'A' factor described in
the notes on the PEDEL-AA algorithms,
where typically A ~ 5.8; whereas, in fact, the most common
single-nucleotide-substitution amino acid substitution (GGG -> Arg) is
4 x as likely as the rarest (GGG -> Trp or Ala or Val). In cases
where some nucleotide substitutions (as defined by the 4 x 4
nucleotide substitution matrix) are particularly rare, the probability
difference between the rarest and the most common
single-nucleotide-substitution amino acid substitutions at a given
site can be much greater.
The 'Lx < 0.1 Vx_1' criterion for being in the 'Cx ~ Lx' region is
basically to make sure that there are enough variants in Vx to
'absorb' all Lx sub-library members so that (within a small error) at
most one sub-library member is equal to any given variant in Vx. In
practice, it doesn't matter what the probability of the rarest
variants is. What matters for the 'Cx ~ Lx' approximation is that the
mean frequency in Lx of the most common variant is < 0.1. In fact the
mean frequency of the most common variant in Lx, which we denote by
Rx, is easy to calculate for x = 0, 1, 2, ..., 20, ..., and is shown
in the PEDEL-AA output table of sub-library statistics.
Using these Rx values, the 'Lx < 0.1 Vx_1' criterion would be replaced
with the criterion 'Rx < 0.1'. In practice this means that if, in the
table of sub-library statistics, there are Rx values > 0.1, for which
the 'Cx ~ Lx' approximation has been used (i.e. x >= 3 and Lx < 0.1
Vx_1), then the particular corresponding Cx values may be
overestimates. A warning and html link are given in the table of
sub-library statistics whenever this occurs.
Return to the notes on the PEDEL-AA
algorithms.