## GLUE

Programmes for libraries comprising a random sampling of equally probable variants.
The programmes are written in C++ and should run under LINUX, MacOS-X or MS-Windows, provided you have a C++ compiler. Most users will not need to download the software, as the web server provides a more convenient interface.

Problem: Given a library of L sequences, where each sequence is chosen at random from a set of V equiprobable variants, we wish to calculate the expected number of distinct sequences in the library. Alternatively, given a set of V equiprobable variants, we wish to calculate the library size L necessary to obtain a given percentage completeness, or to have a given probability of being 100% complete. (Typically assuming V >> 1, e.g. V > 10.)

1) glue.cxx

Programme for calculating any of the following:

• the expected completeness of a given library,
• the required library size for a given expected completeness,
• the required library size for a given probability of being 100% complete,

where the sequences in the library are chosen at random from a set of equally probable variants.

Compile the programme as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o glue glue.cxx

Run the programme as follows (three alternative modes):

./glue 1 nvariants library_size
./glue 2 nvariants completeness
./glue 3 nvariants prob_100%_complete

where nvariants = number of equally probable variants, library_size = library size, completeness = required library completeness, and prob_100%_complete = required probability that the library is 100% complete. For the first mode, glue.cxx will return the expected library completeness. For the other two modes, glue.cxx will return the required library size.

2) glue_mc.cxx

A Monte Carlo simulation for finding the mean (and standard deviation) completeness of a library, and the proportion of libraries that are 100% complete, for a given library size and number of equally probable variants.

This programme is slower than glue.cxx, especially for large library sizes and large numbers of simulated libraries. However you do get an estimate of the standard deviation of the library completeness statistic. In general the statistics agree very well with glue.cxx. The programme is mainly useful as a sanity check for glue.cxx, especially for small numbers of variants (e.g. < ~10) - where some of the assumptions used in glue.cxx are not met so well.

Compile the programme as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o glue_mc glue_mc.cxx

Run the programme as follows:

./glue_mc nvariants library_size nsims seed

where nvariants = number of equally probable variants, library_size = library size, nsims = requested number of simulated libraries, and seed = random seed (positive integer).

The programme outputs to screen the mean and standard deviation of the library completeness statistics, plus the proportion of the simulated libraries that are 100% complete. The following output file is also produced:

histogram.dat:
Statistics averaged over all libraries. Columns:
1) x (0 <= x < 100)
2) fraction of variants that occur exactly x times in the library,
3) expected number for Poisson distribution.

Current limits are maximum number of variants = 10^7, maximum number of simulations = 10^5 and maximum library size = 10^8. You can change these by editing the

#define maxvar 10000000 // 10^7
#define maxnsims 100000 // 10^5
#define maxnlibrary 100000000 // 10^8

lines in glue_mc.cxx, and recompiling. Beware of increasing the maximum number of variants above about 10^9 as this may be too large for the random number generator to resolve (typically the random numbers have 9-10 random digits). Note also that these values are integers. In general the compiler will limit the maximum size of integers to 2^31 ~= 2.1 x 10^9. Some compilers may limit the maximum size of integers to 2^15 ~= 32000. If any of these values exceeds the relevant limit, then you will get nonsense results when you run the programme.

• Queries or comments to Andrew Firth (aef24 cam.ac.uk).