PEDEL
Programme for Estimating
Diversity in Error-prone PCR Libraries.
The *.cxx programmes are written in C++ and should run under
LINUX, MacOS-X or MS-Windows, provided you have a C++ compiler. The
Monte Carlo programme pedel_mc_run is a C-shell script and
should run under any LINUX/UNIX-like environment (including MacOS-X).
Most users will not need to download the software, as the web server
provides a more convenient interface.
Return to library statistics home.
Click here for some warnings.
Problem: Given a library of L sequences, comprising
variants of a sequence of N nucleotides, into which random
point mutations have been introduced, we wish to calculate the
expected number of distinct sequences in the library. (Typically
assuming L > 10, N > 5, and the mean number of
mutations per sequence m < 0.1 x N).
1) pedel.cxx and pedel-batch.cxx
These two programmes calculate the expected number of distinct
sequences in an epPCR library. In pedel.cxx, the user inputs
the sequence length N, library size L, and mutation rate
m (or lambda). It is assumed that m << N
(e.g. m < 0.1 x N), N > 5 and
L > 10. In pedel-batch.cxx, the user specifies a
(logarithmically spaced) range of m, N or L
values.
Compile the programmes as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o pedel pedel.cxx
g++ -o pedel-batch pedel-batch.cxx
Run the programmes as follows:
./pedel L N m
./pedel-batch 1 L N m_0 m_1 nsteps outfile
./pedel-batch 2 m N L_0 L_1 nsteps outfile
./pedel-batch 3 L m N_0 N_1 nsteps outfile
where
L = library size,
N = sequence length,
m = mean number of point mutations per sequence,
?_0 and ?_1 give a range covered with nsteps steps,
and outfile is the output data file (e.g. use pedel.dat).
pedel-batch.cxx writes the results to screen (html format) and to
outfile (plain text).
Currently nsteps is limited to 100. You can change this by editing
the
#define maxnsteps 100
line in pedel-batch.cxx, and recompiling.
Links to download programmes: pedel.cxx
, pedel-batch.cxx.
2) stats-batch.cxx
This programme outputs to screen (html format) and to outfile
(plain text) some of the intermediary statistics used by
pedel.cxx. The user inputs L, N and m,
and the programme calculates the statistics Px, Lx,
Vx, Cx, Cx/Vx and Lx - Cx (see below) for x =
0,1,2,...,min(N,20). I.e. it gives the composition of each
sub-library comprising all those sequences with exactly x
mutations.
Statistics:
- x = exact number of mutations per sequence.
- Px = Poisson probability of x mutations, given m.
- Lx = expected number of sequences in library with exactly
x mutations.
- Vx = number of possible sequences with exactly x mutations.
- Cx = expected number of distinct sequences in the sub-library
comprising sequences with exactly x mutations.
- Cx/Vx = completeness of sub-library.
- Lx - Cx = number of redundant sequences in sub-library.
Compile the programme as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o stats-batch stats-batch.cxx
Run the programme as follows:
./stats-batch L N m outfile
where
L = library size,
N = sequence length,
m = mean number of point mutations per sequence,
and outfile is the output data file (e.g. use stats.dat).
Link to download programme:
stats-batch.cxx.
3) pedel_mc_run and
pedel_mc.cxx
The programmes pedel_mc_run and pedel_mc.cxx run
Monte Carlo simulations for the PEDEL scenario. They simulate a
number of libraries and calculate the mean and standard deviation of
the number of distinct sequences per library, plus some other
statistics.
The programmes are much slower than pedel.cxx, and are
completely impractical for large library sizes or long sequence
lengths (e.g. on my Pentium 4 LINUX laptop it takes about 3 minutes
for 200 simulations x
L = 1000 x N = 100, and 30 minutes for 200 simulations x
L = 1000 x N = 1000, with m = 2). Also the
statistics are generally less accurate than those generated by
pedel.cxx - especially the statistics for smaller
sub-libraries - unless you do lots of iterations. However you do get
an estimate of the standard deviations of the various library
statistics.
The programmes are mainly useful as a sanity check on
pedel.cxx, especially for small sequence lengths or library
sizes (e.g. < ~20) or relatively large mutation rates (e.g. > ~0.1
x sequence length), where some of the assumptions used in
pedel.cxx are not met so well. In addition these programmes
may be useful if you wish to simulate a library for other reasons
(e.g. to calculate more detailed library statistics).
There are two programmes: pedel_mc.cxx and pedel_mc_run.
Compile pedel_mc.cxx as follows (replace 'gcc' by an appropriate
alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o pedel_mc pedel_mc.cxx
pedel_mc_run is a C-shell script. You may need to type
chmod 700 pedel_mc_run
to make it into an executable. pedel_mc.cxx is called
automatically by pedel_mc_run - you won't need to run
pedel_mc.cxx directly.
Run the programme as follows:
./pedel_mc_run nsims L N m seed approx
where
nsims = requested number of simulated libraries,
L = library size (max 100000),
N = sequence length (max 1000),
m = mean number of point mutations per sequence,
seed = random seed (positive integer), and
approx = 1 or 0 tells the programme which method to use (see below; in
general, use approx = 0).
The programme outputs to screen the mean and standard deviation of the
number of distinct sequences per library, and similar statistics for the
sub-libraries comprising those sequences with exactly x mutations
(x = 0,1,2,3,...,20). The following output files are also produced:
hist.dat:
Statistics for the first simulated library. Columns:
1) x,
2) number of sequences in the library with exactly x mutations,
3) expected number for Poisson distribution.
library.dat:
Simulated sequences in the first simulated library. Columns:
1) number of mutations in the sequence,
2) the sequence (0 = unmutated nucleotide; 1,2,3 = mutated nucleotide).
ndistinct.dat:
List of the number of distinct sequences in each of the simulated libraries.
Sometimes just running pedel_mc_run with nsims = 1
may be sufficient for your purposes.
pedel_mc_run may use one of two methods for generating
simulated mutated sequences:
approx = 1: For each sequence in the library, a random
Poisson variable with mean m is used to select the number of
point mutations. These are applied at random places in the sequence,
with each mutation randomly selected from 0 -> 1, 2 or 3.
approx = 0: Every nucleotide in each simulated sequence is
tested using a random number to decide whether it mutates to 1, 2 or 3
or remains 0.
I thought that the approx = 1 method would be quicker for large
N, though a bit less accurate. In fact both seem to take
pretty much the same run-time and give essentially the same results.
So, in general, you should use approx = 0.
Current limits are maximum sequence length = 1000 and maximum library size =
100000. You can change these by editing the
#define maxn 1000
#define maxl 100000
lines in pedel_mc.cxx, and recompiling. Beware of increasing
the maximum sequence length above about 10^9 as this may be too large
for the random number generator to resolve (typically the random
numbers have 9-10 random digits). Note also that maxn and
maxl are integers. In general the compiler will limit the
maximum size of integers to 2^31 ~= 2.1 x 10^9. Some compilers may
limit the maximum size of integers to 2^15 ~= 32000. If maxn
or maxl exceeds the relevant limit, then you will get nonsense
results when you run the programme.
Link to download programmes:
pedel_mc.cxx,
pedel_mc_run.
Notes:
- You must agree to the Terms of Usage
before using any of this software.
- If you use this software for publications, please cite Wayne M. Patrick,
Andrew E. Firth and Jonathan M. Blackburn, 2003, User-friendly algorithms
for estimating completeness and diversity in randomized protein-encoding
libraries, Protein Engineering, 16, 451-457 or Andrew E.
Firth and Wayne M. Patrick, 2005, Statistics of protein library
construction, Bioinformatics, 21, 3314-3315.
- If you seem to be getting bizarre results, check that none of the
limitations on L, N, m etc. have been violated (see
the maths notes).
- All corrections and notifications of bugs are gratefully received.
- Queries or comments to Andrew Firth (aef24cam.ac.uk).
- AEF gratefully acknowledges funding from the Foundation for Research,
Science and Technology, grant number UOOX0304.