The 'Lx < 0.1 Vx_1' criterion for deciding when to use the 'Cx ~ Lx' approximation is sometimes inaccurate, and can be refined as follows.

First consider a single nucleotide substitution in a single codon. There are 9 possible mutated codons. An amino acid mutation that can only be coded by a single codon out of the 9 and that requires a transversion, has only a 1 in 15 probability (assuming a transition:transversion ratio of 3), since if p is the probability of a transversion, then 3p is the probability of a transition, and the total probability of the 9 mutated codons is 6(p) + 3(3p) = 15p.

For example, if the parent codon is GGG (Gly), then the 9 single-nucleotide-substitution codons are

amino relative total probabilities total probabilities codon acid probability given the codon mutates given the amino acid mutates AGG Arg 3p Gly 5/15 (wild-type) CGG Arg p Arg 4/15 Arg 4/10 TGG Trp p Glu 3/15 Glu 3/10 GAG Glu 3p Trp 1/15 Trp 1/10 GCG Ala p Ala 1/15 Ala 1/10 GTG Val p Val 1/15 Val 1/10 GGA Gly 3p GGC Gly p GGT Gly pThe 'Lx < 0.1 Vx_1' criterion assumes that all of the single-nucleotide-substitution non-synonymous amino acid substitutions are equiprobable - i.e. 1 in 5 in the above example, but in general represented by the reciprocal of the 'A' factor described in the notes on the PEDEL-AA algorithms, where typically A ~ 5.8; whereas, in fact, the most common single-nucleotide-substitution amino acid substitution (GGG -> Arg) is 4 x as likely as the rarest (GGG -> Trp or Ala or Val). In cases where some nucleotide substitutions (as defined by the 4 x 4 nucleotide substitution matrix) are particularly rare, the probability difference between the rarest and the most common single-nucleotide-substitution amino acid substitutions at a given site can be much greater.

The 'Lx < 0.1 Vx_1' criterion for being in the 'Cx ~ Lx' region is basically to make sure that there are enough variants in Vx to 'absorb' all Lx sub-library members so that (within a small error) at most one sub-library member is equal to any given variant in Vx. In practice, it doesn't matter what the probability of the rarest variants is. What matters for the 'Cx ~ Lx' approximation is that the mean frequency in Lx of the most common variant is < 0.1. In fact the mean frequency of the most common variant in Lx, which we denote by Rx, is easy to calculate for x = 0, 1, 2, ..., 20, ..., and is shown in the PEDEL-AA output table of sub-library statistics.

Using these Rx values, the 'Lx < 0.1 Vx_1' criterion would be replaced with the criterion 'Rx < 0.1'. In practice this means that if, in the table of sub-library statistics, there are Rx values > 0.1, for which the 'Cx ~ Lx' approximation has been used (i.e. x >= 3 and Lx < 0.1 Vx_1), then the particular corresponding Cx values may be overestimates. A warning and html link are given in the table of sub-library statistics whenever this occurs.

Return to the notes on the PEDEL-AA algorithms.