GLUE
Programmes for libraries comprising a random
sampling of equally probable variants.
The programmes are written in C++ and should run under
LINUX, MacOS-X or MS-Windows, provided you have a C++ compiler.
Most users will not need to download the software, as the web server
provides a more convenient interface.
Return to library statistics home.
Problem: Given a library of L sequences, where each sequence
is chosen at random from a set of V equiprobable variants, we
wish to calculate the expected number of distinct sequences in the
library. Alternatively, given a set of V equiprobable variants,
we wish to calculate the library size L necessary to obtain a
given percentage completeness, or to have a given probability of being
100% complete. (Typically assuming V >> 1, e.g. V >
10.)
1) glue.cxx
Programme for calculating any of the following:
the expected completeness of a given library,
the required library size for a given expected completeness,
the required library size for a given probability of being 100%
complete,
where the sequences in the library are chosen at random from a set of
equally probable variants.
Compile the programme as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o glue glue.cxx
Run the programme as follows (three alternative modes):
./glue 1 nvariants library_size
./glue 2 nvariants completeness
./glue 3 nvariants prob_100%_complete
where
nvariants = number of equally probable variants,
library_size = library size,
completeness = required library completeness,
and prob_100%_complete = required probability that the library
is 100% complete.
For the first mode, glue.cxx will return the expected library
completeness. For the other two modes, glue.cxx will return the
required library size.
Link to download programme: glue.cxx
.
2) glue_mc.cxx
A Monte Carlo simulation for finding the mean (and standard deviation)
completeness of a library, and the proportion of libraries that are 100%
complete, for a given library size and number of equally probable variants.
This programme is slower than glue.cxx, especially for large
library sizes and large numbers of simulated libraries. However you
do get an estimate of the standard deviation of the library
completeness statistic. In general the statistics agree very well
with glue.cxx. The programme is mainly useful as a sanity
check for glue.cxx, especially for small numbers of variants
(e.g. < ~10) - where some of the assumptions used in glue.cxx
are not met so well.
Compile the programme as follows (replace 'gcc' by an appropriate alternative,
e.g. 'c++' or 'g++', if you're using a different C++ compiler):
g++ -o glue_mc glue_mc.cxx
Run the programme as follows:
./glue_mc nvariants library_size nsims seed
where
nvariants = number of equally probable variants,
library_size = library size,
nsims = requested number of simulated libraries,
and seed = random seed (positive integer).
The programme outputs to screen the mean and standard deviation of the
library completeness statistics, plus the proportion of the simulated
libraries that are 100% complete. The following output file is also produced:
histogram.dat:
Statistics averaged over all libraries. Columns:
1) x (0 <= x < 100)
2) fraction of variants that occur exactly x times in the library,
3) expected number for Poisson distribution.
Current limits are maximum number of variants = 10^7, maximum number
of simulations = 10^5 and maximum library size = 10^8. You can change
these by editing the
#define maxvar 10000000 // 10^7
#define maxnsims 100000 // 10^5
#define maxnlibrary 100000000 // 10^8
lines in glue_mc.cxx, and recompiling. Beware of increasing
the maximum number of variants above about 10^9 as this may be too
large for the random number generator to resolve (typically the random
numbers have 9-10 random digits). Note also that these values are
integers. In general the compiler will limit the maximum size of
integers to 2^31 ~= 2.1 x 10^9. Some compilers may limit the maximum
size of integers to 2^15 ~= 32000. If any of these values exceeds the
relevant limit, then you will get nonsense results when you run the
programme.
Link to download programme:
glue_mc.cxx.
Notes:
- You must agree to the Terms of Usage
before using any of this software.
- If you use this software for publications, please cite Wayne M. Patrick,
Andrew E. Firth and Jonathan M. Blackburn, 2003, User-friendly algorithms
for estimating completeness and diversity in randomized protein-encoding
libraries, Protein Engineering, 16, 451-457 or Andrew E.
Firth and Wayne M. Patrick, 2005, Statistics of protein library
construction, Bioinformatics, 21, 3314-3315.
- If you seem to be getting bizarre results, check that none of the
limitations on L, N, m etc. have been violated (see
the maths notes).
- All corrections and notifications of bugs are gratefully received.
- Queries or comments to Andrew Firth (aef24cam.ac.uk).
- AEF gratefully acknowledges funding from the Foundation for Research,
Science and Technology, grant number UOOX0304.