PEDEL: Programme for Estimating Diversity in Error-prone PCR Libraries

PEDEL

Programme for Estimating Diversity in Error-prone PCR Libraries.

Download pedel.cxx and pedel.batch.cxx (for calculating the expected number of distinct sequences in an epPCR library).
Download stats.batch.cxx (for calculating sub-library composition).
Download the Monte Carlo simulation programmes pedel_mc_run and pedel_mc.cxx.

The *.cxx programmes are written in C++ and should run under LINUX, MacOS-X or MS-Windows, provided you have a C++ compiler. The Monte Carlo programme pedel_mc_run is a C-shell script and should run under any LINUX/UNIX-like environment (including MacOS-X). Most users will not need to download the software, as the web server provides a more convenient interface.

Return to library statistics home.
Click here for some warnings.

Problem: Given a library of L sequences, comprising variants of a sequence of N nucleotides, into which random point mutations have been introduced, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming L > 10, N > 5, and the mean number of mutations per sequence m < 0.1 x N).

1) pedel.cxx and pedel.batch.cxx

These two programmes calculate the expected number of distinct sequences in an epPCR library. In pedel.cxx, the user inputs the sequence length N, library size L, and mutation rate m (or lambda). It is assumed that m << N (e.g. m < 0.1 x N), N > 5 and L > 10. In pedel.batch.cxx, the user specifies a (logarithmically spaced) range of m, N or L values.

Compile the programmes as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o pedel pedel.cxx
g++ -o pedel.batch pedel.batch.cxx

Run the programmes as follows:

./pedel L N m
./pedel.batch 1 L N m_0 m_1 nsteps outfile
./pedel.batch 2 m N L_0 L_1 nsteps outfile
./pedel.batch 3 L m N_0 N_1 nsteps outfile

where L = library size, N = sequence length, m = mean number of point mutations per sequence, ?_0 and ?_1 give a range covered with nsteps steps, and outfile is the output data file (e.g. use pedel.dat). pedel.batch.cxx writes the results to screen (html format) and to outfile (plain text).

Currently nsteps is limited to 100. You can change this by editing the

#define maxnsteps 100

line in pedel.batch.cxx, and recompiling.

Links to download programmes: pedel.cxx , pedel.batch.cxx.

2) stats.batch.cxx

This programme outputs to screen (html format) and to outfile (plain text) some of the intermediary statistics used by pedel.cxx. The user inputs L, N and m, and the programme calculates the statistics Px, Lx, Vx, Cx, Cx/Vx and Lx - Cx (see below) for x = 0,1,2,...,min(N,20). I.e. it gives the composition of each sub-library comprising all those sequences with exactly x mutations.

Statistics:

x = exact number of mutations per sequence.
Px = Poisson probability of x mutations, given m.
Lx = expected number of sequences in library with exactly x mutations.
Vx = number of possible sequences with exactly x mutations.
Cx = expected number of distinct sequences in the sub-library comprising sequences with exactly x mutations.
Cx/Vx = completeness of sub-library.
Lx - Cx = number of redundant sequences in sub-library.

Compile the programme as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o stats.batch stats.batch.cxx

Run the programme as follows:

./stats.batch L N m outfile

where L = library size, N = sequence length, m = mean number of point mutations per sequence, and outfile is the output data file (e.g. use stats.dat).

Link to download programme: stats.batch.cxx.

3) pedel_mc_run and pedel_mc.cxx

The programmes pedel_mc_run and pedel_mc.cxx run Monte Carlo simulations for the PEDEL scenario. They simulate a number of libraries and calculate the mean and standard deviation of the number of distinct sequences per library, plus some other statistics.

The programmes are much slower than pedel.cxx, and are completely impractical for large library sizes or long sequence lengths (e.g. on my Pentium 4 LINUX laptop it takes about 3 minutes for 200 simulations x L = 1000 x N = 100, and 30 minutes for 200 simulations x L = 1000 x N = 1000, with m = 2). Also the statistics are generally less accurate than those generated by pedel.cxx - especially the statistics for smaller sub-libraries - unless you do lots of iterations. However you do get an estimate of the standard deviations of the various library statistics.

The programmes are mainly useful as a sanity check on pedel.cxx, especially for small sequence lengths or library sizes (e.g. < ~20) or relatively large mutation rates (e.g. > ~0.1 x sequence length), where some of the assumptions used in pedel.cxx are not met so well. In addition these programmes may be useful if you wish to simulate a library for other reasons (e.g. to calculate more detailed library statistics).

There are two programmes: pedel_mc.cxx and pedel_mc_run. Compile pedel_mc.cxx as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o pedel_mc pedel_mc.cxx

pedel_mc_run is a C-shell script. You may need to type

chmod 700 pedel_mc_run

to make it into an executable. pedel_mc.cxx is called automatically by pedel_mc_run - you won't need to run pedel_mc.cxx directly.

Run the programme as follows:

./pedel_mc_run nsims L N m seed approx

where nsims = requested number of simulated libraries, L = library size (max 100000), N = sequence length (max 1000), m = mean number of point mutations per sequence, seed = random seed (positive integer), and approx = 1 or 0 tells the programme which method to use (see below; in general, use approx = 0).

The programme outputs to screen the mean and standard deviation of the number of distinct sequences per library, and similar statistics for the sub-libraries comprising those sequences with exactly x mutations (x = 0,1,2,3,...,20). The following output files are also produced:

hist.dat:
Statistics for the first simulated library. Columns:
1) x,
2) number of sequences in the library with exactly x mutations,
3) expected number for Poisson distribution.

library.dat:
Simulated sequences in the first simulated library. Columns:
1) number of mutations in the sequence,
2) the sequence (0 = unmutated nucleotide; 1,2,3 = mutated nucleotide).

ndistinct.dat:
List of the number of distinct sequences in each of the simulated libraries.

Sometimes just running pedel_mc_run with nsims = 1 may be sufficient for your purposes.

pedel_mc_run may use one of two methods for generating simulated mutated sequences:
approx = 1: For each sequence in the library, a random Poisson variable with mean m is used to select the number of point mutations. These are applied at random places in the sequence, with each mutation randomly selected from 0 -> 1, 2 or 3.
approx = 0: Every nucleotide in each simulated sequence is tested using a random number to decide whether it mutates to 1, 2 or 3 or remains 0.

I thought that the approx = 1 method would be quicker for large N, though a bit less accurate. In fact both seem to take pretty much the same run-time and give essentially the same results. So, in general, you should use approx = 0.

Current limits are maximum sequence length = 1000 and maximum library size = 100000. You can change these by editing the

#define maxn 1000
#define maxl 100000

lines in pedel_mc.cxx, and recompiling. Beware of increasing the maximum sequence length above about 10^9 as this may be too large for the random number generator to resolve (typically the random numbers have 9-10 random digits). Note also that maxn and maxl are integers. In general the compiler will limit the maximum size of integers to 2^31 ~= 2.1 x 10^9. Some compilers may limit the maximum size of integers to 2^15 ~= 32000. If maxn or maxl exceeds the relevant limit, then you will get nonsense results when you run the programme.

Link to download programmes: pedel_mc.cxx, pedel_mc_run.

Notes:

You must agree to the Terms of Usage before using any of this software.
If you use this software for publications, please cite Wayne M. Patrick, Andrew E. Firth and Jonathan M. Blackburn, 2003, User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries, Protein Engineering, 16, 451-457 or Andrew E. Firth and Wayne M. Patrick, 2005, Statistics of protein library construction, Bioinformatics, 21, 3314-3315.
If you seem to be getting bizarre results, check that none of the limitations on L, N, m etc. have been violated (see the maths notes).
All corrections and notifications of bugs are gratefully received.
Queries or comments to Andrew Firth (aef24cam.ac.uk).
AEF gratefully acknowledges funding from the Foundation for Research, Science and Technology, grant number UOOX0304.