Overview of PEDEL-AA algorithms:

PEDEL-AA is an extension to amino acid sequences of the original nucleotide version of PEDEL (see links to publications and mathematics notes from the statistics home page). Due to the more complex problem of estimating diversity and completeness at the amino acid level (as opposed to nucleotides and codons), there are some major differences in the algorithms and a few extra approximations. A brief description of the procedure follows.

First, to recapitulate the nucleotide version of PEDEL: As discussed in the mathematics notes, for the nucleotide version, if the input library is conceptually divided into sub-libraries Lx (x=0,1,2,...) where the sub-library Lx comprises all variants in the library with exactly x nucleotide substitutions, then the PEDEL scenario divides into two regions:
  1. where Lx is large compared with Vx (the number of possible variants with exactly x substitutions) then Cx (the expected number of distinct sequences in Lx) is approximately equal to Vx, and
  2. where Lx is small compared with Vx, then Cx is approximately equal to Lx.
In the transition region (where Lx ~ Vx) we can calculate Cx with the formula Cx ~ Vx(1-exp(-Lx/Vx)). This is based on the assumption that all variants in Vx are equiprobable, so the mean number of occurrences in the sub-library Lx of each variant in Vx is Lx/Vx and, assuming Poisson statistics, the probability that any given variant is present in the sub-library is 1-exp(-Lx/Vx), so the expected number of distinct variants present in the sub-library is Cx ~ Vx(1-exp(-Lx/Vx)).

In the more complex scenario presented in PEDEL-AA, the assumption of equiprobable variants breaks down for two reasons: (i) we have introduced a full 4 x 4 nucleotide substitution matrix (in particular the transition:transversion ratio is not assumed to be unity), and (ii) even if nucleotide substitutions were equiprobable, the corresponding amino acid substitutions are not. However we may still borrow some concepts from the equiprobable nucleotide version of PEDEL - namely, (1) when Lx is small compared with Vx, then Cx is approximately equal to Lx, and (2) when Lx is large compared with Vx then Cx is approximately equal to Vx. However these concepts need some refining, as follows.

  1. Instead of x counting the number of nucleotide substitutions, we now let x count the number of amino acid substitutions (i.e. non-synonymous, non-stop codon, substitutions). Note that the substitution rate on the PEDEL-AA input form is the mean number of nucleotide substitutions, nsubst, per variant but, using this, the programme calculates the expected mean number of nonsynonymous amino acid substitutions per variant. The probability distribution, P(x_nt), of the number of nucleotide substitutions, x_nt, per variant may be assumed to follow the PCR distribution (with mean nsubst, and the extra parameters ncycles and eff, respectively the number of PCR cycles and the PCR efficiency). Separate statistics are also calculated assuming that the number of nucleotide substitutions per variant follows a Poisson distribution (with the single parameter nsubst), and these results may be used if the PCR ncycles and eff parameters are unknown.

    The probabilities of variants being truncated (i.e. containing introduced stop codons) are then subtracted from the P(x_nt) distributions. Clearly this is an increasing function of x_nt and, by x_nt = 100, typically less than 0.6% of variants are stop-codon free. Note that the P(x_nt) will no longer sum up to unity; instead (after discarding indel-containing variants) they sum up to L_eff / L_tot, where L_eff is the 'effective' library size (i.e. the number of variants with no indels or stop codons) and L_tot is the total library size (albeit again excluding variants with indels).

    Next, the Poisson and PCR P(x_nt) distributions are redistributed into amino acid P(x_aa) distributions. First the mean number, frac, of nonsynonymous amino acid substitutions per nucleotide substitution (given that the nucleotide substitution doesn't produce a stop codon) is calculated. Typically frac ~ 2/3. For each x_nt, the number of nonsynonymous amino acid substitutions resulting from exactly x_nt nucleotide substitutions is assumed to follow a binomial distribution, B(x_nt,frac) (i.e. x_nt 'trials'; probability of 'success' per 'trial' = frac). Summing up the binomial distributions, each weighted by P(x_nt), for x_nt = 0,1,2,...,100 gives the amino acid Poisson and PCR P(x_aa) distributions. Of course, Sum_{x_nt} P(x_nt) = Sum_{x_aa} P(x_aa) = L_eff / L_tot. The Poisson and PCR amino acid sub-library sizes, Lx, are given by P(x_aa) x L_tot.

    All these estimates rely on the mean number of nucleotide substitutions per variant, nsubst, being relatively small compared with the number of codons in the sequence, so that multiple substitutions in the same codon are not very common. In practice, we limit nsubst <= 0.1 x input sequence length (in nucleotides). In fact, for the Poisson case, we can calculate L0, L1 and L2 exactly (a sum over all possible variants with exactly 0, 1 or 2 amino acid substitutions, multiplied by their probabilities given by the input nucleotide substitution matrix and nsubst, multiplied by L_tot). These calculations agree very well with the 'sum of binomial distributions' method. For example, for the library presented in Volles & Lansbury (2005), we have
                  'exact'       'binomial sum'
          L0     3.763e+05         3.861e+05
          L1     8.174e+05         8.205e+05
          L2     8.795e+05         8.717e+05
    

  2. Two estimates of Vx are calculated. The first, Vx_1, is an estimate of the number of 'common' variants with exactly x amino acid substitutions - namely those variants where each substituted amino acid is accessible by just a single nucleotide substitution in the respective codon. The second, Vx_2, is the total number of possible amino acid variants with exactly x amino acid substitutions. Although most variants in a sub-library will be of type Vx_1, variants of type Vx_2 may contribute significantly to the total number of distinct variants Cx when the sub-library size Lx is large compared with the number of variants Vx_1. When the sub-library size Lx is small compared with the number of variants Vx_1, nearly every variant in Lx is distinct (i.e. Cx ~ Lx) and it doesn't matter whether these variants are of type Vx_1 or Vx_2.
    Calculation of A: For each codon, i, in the parent sequence, the number of non-synonymous, non-stop codon, amino acid substitutions Ai, i=1,...,N (N = parent sequence length in codons) accessible by a single nucleotide substitution is calculated. Such substitutions are accessible with similar probability (albeit varying by a factor of ~3 for a typical transition:transversion ratio) and much higher probability than amino acid substitutions only accessible via 2 or 3 nucleotide substitutions in the same codon (except in the case of a short parent sequence and/or high mutation rate). The Ai are averaged (geometric mean) over the sequence to give the parameter A = (A1 x A2 x ... x An)^(1/N). A is typically around 5.8.

  3. When Lx << Vx then Cx ~ Lx. For equiprobable variants this approximation is good to 5% for Lx < 0.1 Vx (see the mathematical notes on the nucleotide version of PEDEL). For PEDEL-AA, we use this approximation when Lx < 0.1 Vx_1 (the number of 'easy-to-reach' variants) or more rigorously, and optionally, when Rx < 0.1, where Rx is the mean frequency of the most common variant in the sub-library Lx (see note on Rx statistic for details). Lx < 0.1 Vx_1 is usually true for x >= 3 and almost always true for x >= 4. For example for N >= 40 codons, V3_1 >= ~5.8^3 x C(40,3) = 1.9 x 10^6 and V4_1 >= ~5.8^4 x C(40,4) = 1.0 x 10^8, while for N >= 100 codons, V3_1 >= ~5.8^3 x C(100,3) = 3.1 x 10^7 and V4_1 >= ~5.8^4 x C(100,4) = 4.4 x 10^9; and remember these Vx_1 sizes only need to be compared with the relevant sub-library size L3 or L4, not the whole library size. In the Cx ~ Lx region, whether a particular variant is of type Vx_1 or type Vx_2 is irrelevant - either way it will (almost) always be a distinct variant in the library.

  4. For x = 0, 1 and 2, we calculate the expected number of distinct variants, Cx, precisely. This calculation includes variants with multiple nucleotide substitutions in the same codon (i.e. both Vx_1 and Vx_2). The total number of each of the 64 codon types in the input sequence is calculated. The 64 x 20 matrix of probabilities for each codon type mutating to each amino acid type is calculated using the input nucleotide substitution matrix and the input substitution rate. For x = 0, 1 and 2 there are, respectively Vx_2 = 1, 19N and 361N(N-1)/2 total possible variants (i.e. N!/[x!(N-x)!] 19^x, where N is the length of the input sequence in codons). The probability of the input sequence mutating to each of these possible variants is calculated and renormalized by the respective probability sum P0, P1 or P2 (where Px = Sum_{v_i in Vx_2} P(v_i)) to give the normalized probabilities Pn(v_i) of the different variants within the respective sub-libraries Lx, rather than within the whole library. The probability of a particular variant v_i being present in the relevant sub-library Lx is given by 1 - exp(-Pn(v_i) x Lx). These probabilities are quickly summed over all possible variants using the codon counts. Computationally, this is very fast for x = 0, 1 and 2, but can take a few minutes for x = 3; hence the 'exact' calculation is not used on the webserver for x >= 3. The sizes of the sub-libraries Lx are determined separately for the Poisson and PCR distributions as described above, thus resulting in separate Cx estimates for the different distributions.

  5. Ideally, for x >= 3, we will enter the Cx ~ Lx region. In this case all the individual Cx estimates, and the estimated total number of distinct variants in the library C = C0 + C1 + C2 + ..., will be fairly good. A warning is printed in the 'notes' column of the output table of sub-library statistics if there are any x >= 3 values for which the Cx ~ Lx approximation may not apply, in which case Cx is estimated with the formula Cx ~ Vx_1(1-exp(-Lx/Vx_1)) (i.e. ignoring, in these particular sub-libraries, any variants of type Vx_2). This formula is not very accurate and may result in an overestimate (because the Vx_1 are not equiprobable - the higher probability amino acid substitutions [e.g. those accessible by transitions, or via more than one possible nucleotide substitution] will be over-represented at the expense of other amino acid substitutions) or an underestimate (since the Vx_2 variants are ignored). These effects can be quite a large (e.g. if some Vx_1 substitutions are 3 times more likely than others, and the Lx ~ Vx_1 turnover occurs at x ~ 3, then the most common three-amino acid substitutions will be 3^3 = 27 times more likely than the rarest three-amino acid substitutions).

Note that the introduced stop codon and indel statistics and graphs are exact calculations (based on the input substitution, indel and nucleotide matrix parameters) and do not use any of the above approximations (except Poisson statistics). The above approximations are only used for the library completeness statistics.


Comparison with Volles & Lansbury (2005) Monte Carlo simulation.

Link to PEDEL-AA form filled out with Volles & Lansbury parameters.

PropertyVolles & Lansbury Firth & Patrick
-- PoissonPCR
Truncations (%) 15 15.6
# Full-length clones 3.1 x 10^6 3.18 x 10^6
Protein mutation freq. per aa 0.016 0.0160
Mean # mutations per protein 2.1 2.12
Unmutated sequences (%)* 14 10.1 14.0
# of unique proteins 1.3 x 10^6 1.32 x 10^6 1.29 x 10^6
# of unique point mutations 1990 1989
# of unique single point mutations 1566 1618 1618

* Relative to the total library size (i.e. including truncated variants).
References

  • Volles M.J., Lansbury P.T. Jr. (2005). A computer program for the estimation of protein and nucleic acid sequence diversity in random point mutagenesis libraries, Nucleic Acids Res. 33, 3667-3677.

  • Return to PEDEL-AA server page.