PEDEL-AA Rx statistic

PEDEL-AA Rx statistic:

The 'Lx < 0.1 Vx_1' criterion for deciding when to use the 'Cx ~ Lx' approximation is sometimes inaccurate, and can be refined as follows.

First consider a single nucleotide substitution in a single codon. There are 9 possible mutated codons. An amino acid mutation that can only be coded by a single codon out of the 9 and that requires a transversion, has only a 1 in 15 probability (assuming a transition:transversion ratio of 3), since if p is the probability of a transversion, then 3p is the probability of a transition, and the total probability of the 9 mutated codons is 6(p) + 3(3p) = 15p.

For example, if the parent codon is GGG (Gly), then the 9 single-nucleotide-substitution codons are

      amino  relative       total probabilities         total probabilities
codon  acid  probability  given the codon mutates  given the amino acid mutates
 AGG    Arg    3p           Gly 5/15 (wild-type)
 CGG    Arg     p           Arg 4/15                         Arg 4/10
 TGG    Trp     p           Glu 3/15                         Glu 3/10
 GAG    Glu    3p           Trp 1/15                         Trp 1/10
 GCG    Ala     p           Ala 1/15                         Ala 1/10
 GTG    Val     p           Val 1/15                         Val 1/10
 GGA    Gly    3p
 GGC    Gly     p
 GGT    Gly     p

The 'Lx < 0.1 Vx_1' criterion assumes that all of the single-nucleotide-substitution non-synonymous amino acid substitutions are equiprobable - i.e. 1 in 5 in the above example, but in general represented by the reciprocal of the 'A' factor described in the notes on the PEDEL-AA algorithms, where typically A ~ 5.8; whereas, in fact, the most common single-nucleotide-substitution amino acid substitution (GGG -> Arg) is 4 x as likely as the rarest (GGG -> Trp or Ala or Val). In cases where some nucleotide substitutions (as defined by the 4 x 4 nucleotide substitution matrix) are particularly rare, the probability difference between the rarest and the most common single-nucleotide-substitution amino acid substitutions at a given site can be much greater.

The 'Lx < 0.1 Vx_1' criterion for being in the 'Cx ~ Lx' region is basically to make sure that there are enough variants in Vx to 'absorb' all Lx sub-library members so that (within a small error) at most one sub-library member is equal to any given variant in Vx. In practice, it doesn't matter what the probability of the rarest variants is. What matters for the 'Cx ~ Lx' approximation is that the mean frequency in Lx of the most common variant is < 0.1. In fact the mean frequency of the most common variant in Lx, which we denote by Rx, is easy to calculate for x = 0, 1, 2, ..., 20, ..., and is shown in the PEDEL-AA output table of sub-library statistics.

Using these Rx values, the 'Lx < 0.1 Vx_1' criterion would be replaced with the criterion 'Rx < 0.1'. In practice this means that if, in the table of sub-library statistics, there are Rx values > 0.1, for which the 'Cx ~ Lx' approximation has been used (i.e. x >= 3 and Lx < 0.1 Vx_1), then the particular corresponding Cx values may be overestimates. A warning and html link are given in the table of sub-library statistics whenever this occurs.

Return to the notes on the PEDEL-AA algorithms.