PEDEL-AA is an extension to amino acid sequences of the original nucleotide version of PEDEL (see links to publications and mathematics notes from the statistics home page). Due to the more complex problem of estimating diversity and completeness at the amino acid level (as opposed to nucleotides and codons), there are some major differences in the algorithms and a few extra approximations. A brief description of the procedure follows.

First, to recapitulate the nucleotide version of PEDEL: As discussed in the mathematics notes, for the nucleotide version, if the input library is conceptually divided into sub-libraries Lx (x=0,1,2,...) where the sub-library Lx comprises all variants in the library with exactly x nucleotide substitutions, then the PEDEL scenario divides into two regions:

- where Lx is large compared with Vx (the number of possible variants with exactly x substitutions) then Cx (the expected number of distinct sequences in Lx) is approximately equal to Vx, and
- where Lx is small compared with Vx, then Cx is approximately equal to Lx.

In the more complex scenario presented in PEDEL-AA, the assumption of equiprobable variants breaks down for two reasons: (i) we have introduced a full 4 x 4 nucleotide substitution matrix (in particular the transition:transversion ratio is not assumed to be unity), and (ii) even if nucleotide substitutions were equiprobable, the corresponding amino acid substitutions are not. However we may still borrow some concepts from the equiprobable nucleotide version of PEDEL - namely, (1) when Lx is small compared with Vx, then Cx is approximately equal to Lx, and (2) when Lx is large compared with Vx then Cx is approximately equal to Vx. However these concepts need some refining, as follows.

- Instead of x counting the number of nucleotide substitutions, we
now let x count the number of amino acid substitutions
(i.e. non-synonymous, non-stop codon, substitutions). Note that the
substitution rate on the PEDEL-AA input form is the mean number
of
*nucleotide*substitutions,*nsubst*, per variant but, using this, the programme calculates the expected mean number of nonsynonymous amino acid substitutions per variant. The probability distribution, P(x_nt), of the number of nucleotide substitutions, x_nt, per variant may be assumed to follow the PCR distribution (with mean*nsubst*, and the extra parameters*ncycles*and*eff*, respectively the number of PCR cycles and the PCR efficiency). Separate statistics are also calculated assuming that the number of nucleotide substitutions per variant follows a Poisson distribution (with the single parameter*nsubst*), and these results may be used if the PCR*ncycles*and*eff*parameters are unknown.

The probabilities of variants being truncated (i.e. containing introduced stop codons) are then subtracted from the P(x_nt) distributions. Clearly this is an increasing function of x_nt and, by x_nt = 100, typically less than 0.6% of variants are stop-codon free. Note that the P(x_nt) will no longer sum up to unity; instead (after discarding indel-containing variants) they sum up to*L_eff*/*L_tot*, where*L_eff*is the 'effective' library size (i.e. the number of variants with no indels or stop codons) and*L_tot*is the total library size (albeit again excluding variants with indels).

Next, the Poisson and PCR P(x_nt) distributions are redistributed into amino acid P(x_aa) distributions. First the mean number,*frac*, of nonsynonymous amino acid substitutions per nucleotide substitution (given that the nucleotide substitution doesn't produce a stop codon) is calculated. Typically*frac*~ 2/3. For each x_nt, the number of nonsynonymous amino acid substitutions resulting from exactly x_nt nucleotide substitutions is assumed to follow a binomial distribution, B(x_nt,*frac*) (i.e. x_nt 'trials'; probability of 'success' per 'trial' =*frac*). Summing up the binomial distributions, each weighted by P(x_nt), for x_nt = 0,1,2,...,100 gives the amino acid Poisson and PCR P(x_aa) distributions. Of course, Sum_{x_nt} P(x_nt) = Sum_{x_aa} P(x_aa) =*L_eff*/*L_tot*. The Poisson and PCR amino acid sub-library sizes, Lx, are given by P(x_aa) x*L_tot*.

All these estimates rely on the mean number of nucleotide substitutions per variant,*nsubst*, being relatively small compared with the number of codons in the sequence, so that multiple substitutions in the same codon are not very common. In practice, we limit*nsubst*<= 0.1 x input sequence length (in nucleotides). In fact, for the Poisson case, we can calculate L0, L1 and L2 exactly (a sum over all possible variants with exactly 0, 1 or 2 amino acid substitutions, multiplied by their probabilities given by the input nucleotide substitution matrix and*nsubst*, multiplied by*L_tot*). These calculations agree very well with the 'sum of binomial distributions' method. For example, for the library presented in Volles & Lansbury (2005), we have'exact' 'binomial sum' L0 3.763e+05 3.861e+05 L1 8.174e+05 8.205e+05 L2 8.795e+05 8.717e+05

- Two estimates of Vx are calculated. The first, Vx_1, is an
estimate of the number of 'common' variants with exactly x amino acid
substitutions - namely those variants where each substituted amino
acid is accessible by just a single nucleotide substitution in the
respective codon. The second, Vx_2, is the total number of possible
amino acid variants with exactly x amino acid substitutions. Although
most variants in a sub-library will be of type Vx_1, variants of type
Vx_2 may contribute significantly to the total number of distinct
variants Cx when the sub-library size Lx is large compared with the
number of variants Vx_1. When the sub-library size Lx is small
compared with the number of variants Vx_1, nearly every variant in Lx
is distinct (i.e. Cx ~ Lx) and it doesn't matter whether these
variants are of type Vx_1 or Vx_2.

- Vx_2 is given by C(N,x) 19^x, where C(N,x) is the combinatorial function, i.e. C(N,x) = N!/[x!(N-x)!], and N is the length of the input sequence in codons. (Cf equation (5) of the mathematical notes for the nucleotide version of PEDEL.)
- We estimate Vx_1 with the formula C(N,x) A^x, where A is the mean number of non-synonymous amino acid substitutions accessible by a single nucleotide substitution, for the input sequence (the value of A is given on the results page).

- When Lx << Vx then Cx ~ Lx. For equiprobable variants this
approximation is good to 5% for Lx < 0.1 Vx (see the mathematical
notes on the nucleotide version of PEDEL).
For PEDEL-AA, we use this approximation when Lx < 0.1 Vx_1 (the number
of 'easy-to-reach' variants) or more rigorously, and optionally, when
Rx < 0.1, where Rx is the mean frequency of the most common variant in
the sub-library Lx (see note on Rx
statistic for details). Lx < 0.1 Vx_1 is usually true for x
>= 3 and almost always true for x >= 4. For example for N >= 40
codons, V3_1 >= ~5.8^3 x C(40,3) = 1.9 x 10^6 and V4_1 >= ~5.8^4 x
C(40,4) = 1.0 x 10^8, while for N >= 100 codons, V3_1 >= ~5.8^3 x
C(100,3) = 3.1 x 10^7 and V4_1 >= ~5.8^4 x C(100,4) = 4.4 x 10^9; and
remember these Vx_1 sizes only need to be compared with the relevant
sub-library size L3 or L4, not the whole library size. In the Cx ~ Lx
region, whether a particular variant is of type Vx_1 or type Vx_2 is
irrelevant - either way it will (almost) always be a distinct variant
in the library.

- For x = 0, 1 and 2, we calculate the expected number of distinct
variants, Cx, precisely. This calculation includes variants with
multiple nucleotide substitutions in the same codon (i.e. both Vx_1
and Vx_2).
The total number of each of the 64 codon types in the input sequence
is calculated. The 64 x 20 matrix of probabilities for each codon
type mutating to each amino acid type is calculated using the input
nucleotide substitution matrix and the input substitution rate.
For x = 0, 1 and 2 there are, respectively Vx_2 = 1, 19N and
361N(N-1)/2 total possible variants (i.e. N!/[x!(N-x)!] 19^x, where N
is the length of the input sequence in codons). The probability of
the input sequence mutating to each of these possible variants is
calculated and renormalized by the respective probability sum P0, P1
or P2 (where Px = Sum_{v_i in Vx_2} P(v_i)) to give the normalized
probabilities Pn(v_i) of the different variants within the respective
sub-libraries Lx, rather than within the whole library. The
probability of a particular variant v_i being present in the relevant
sub-library Lx is given by 1 - exp(-Pn(v_i) x Lx). These
probabilities are quickly summed over all possible variants using the
codon counts. Computationally, this is very fast for x = 0, 1 and 2,
but can take a few minutes for x = 3; hence the 'exact' calculation is
not used on the webserver for x >= 3. The sizes of the sub-libraries
Lx are determined separately for the Poisson and PCR distributions as
described above, thus resulting in separate Cx estimates for the
different distributions.

- Ideally, for x >= 3, we will enter the Cx ~ Lx region. In this
case all the individual Cx estimates, and the estimated total number
of distinct variants in the library C = C0 + C1 + C2 + ..., will be
fairly good. A warning is printed in the 'notes' column of the output
table of sub-library statistics if there are any x >= 3 values for
which the Cx ~ Lx approximation may not apply, in which case Cx is
estimated with the formula Cx ~ Vx_1(1-exp(-Lx/Vx_1)) (i.e. ignoring,
in these particular sub-libraries, any variants of type Vx_2). This
formula is not very accurate and may result in an overestimate
(because the Vx_1 are not equiprobable - the higher probability amino
acid substitutions [e.g. those accessible by transitions, or via more
than one possible nucleotide substitution] will be over-represented at
the expense of other amino acid substitutions) or an underestimate
(since the Vx_2 variants are ignored). These effects can be quite a
large (e.g. if some Vx_1 substitutions are 3 times more likely than
others, and the Lx ~ Vx_1 turnover occurs at x ~ 3, then the most
common three-amino acid substitutions will be 3^3 = 27 times more
likely than the rarest three-amino acid substitutions).

Link to PEDEL-AA form filled out with Volles & Lansbury parameters.

Property | Volles & Lansbury | Firth & Patrick | |
---|---|---|---|

- | - | Poisson | PCR |

Truncations (%) | 15 | 15.6 | |

# Full-length clones | 3.1 x 10^6 | 3.18 x 10^6 | |

Protein mutation freq. per aa | 0.016 | 0.0160 | |

Mean # mutations per protein | 2.1 | 2.12 | |

Unmutated sequences (%)* | 14 | 10.1 | 14.0 |

# of unique proteins |
1.3 x 10^6 |
1.32 x 10^6 |
1.29 x 10^6 |

# of unique point mutations | 1990 | 1989 | |

# of unique single point mutations | 1566 | 1618 | 1618 |

* Relative to the total library size (i.e. including truncated variants).

Return to PEDEL-AA server page.