DRIVeR: Diversity Resulting from In Vitro Recombination

DRIVeR

Diversity Resulting from In Vitro Recombination.

Download driver.cxx and driver-batch.cxx (for calculating the expected number of distinct sequences in a library constructed by in vitro recombination of two highly homologous sequences).
Download the Monte Carlo simulation programme driver_mc.cxx.

The programmes are written in C++ and should run under LINUX, MacOS-X or MS-Windows, provided you have a C++ compiler. Most users will not need to download the software, as the web server provides a more convenient interface.

Return to library statistics home.
Click here for some warnings.

Problem: Given a library of L sequences generated by random recombination of two near-identical genes differing at only a small number of known nucleotide (or codon) positions, we wish to calculate the expected number of distinct sequences in the library. (Typically assuming the mean number of crossovers per sequence m < 0.1 x sequence length N).

1) driver.cxx and driver.batch.cxx

These programmes are for calculating the expected number of distinct sequences in a library generated by random crossovers between two near-identical sequences. In driver.cxx the user inputs the library size L, sequence length N, mean number of crossovers per sequence m (or lambda), and a list of the variable positions. It then calculates the probabilities of there being an even or odd number of crossovers between each pair of consecutive variable positions. Multiplying these probabilities gives the relative probabilities of each of the 2^M possible daughter sequences (where M is the total number of variable positions). From these it calculates the probability that each daughter sequence will be present in the library and hence the expected number of distinct sequences in the library. driver.batch.cxx is similar, but calculates the expected number of distinct sequences in the library for a range of L and m values centred on the input L and m.

Note that you may, more or less, consider your sequence either as a sequence of nucleotides with a few variable nucleotides or as a sequence of codons with a few variable codons.

Compile the programmes as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o driver driver.cxx
g++ -o driver.batch driver.batch.cxx

Before running the programmes, you will need to make a file listing the variable positions. The first line lists the number of variable positions. The remaining lines list the positions. These must be in numerical order. Click here for an example position file.

Run the programmes as follows:

./driver L N m posfile outfile xtrue
./driver.batch L N m posfile outfile xtrue

where L = library size, N = sequence length, m = mean number of crossovers per sequence, posfile is the list of variable positions (e.g. use driver.in), outfile is the output data file (e.g. use driver.dat), and xtrue is 1 if m is the true mean number of crossovers per sequence and 0 if m is the mean number of observable crossovers per sequence (click here for details on counting crossovers).

driver.cxx outputs to screen the total number of possible sequences, the expected number of distinct sequences in the library, the true mean number of crossovers per sequence, and the the mean number of observable crossovers per sequence (click here for details on counting crossovers). It also produces an output file outfile (html format) with columns:

1) coordinates of each interval between variable positions,
2) length of the interval,
3) the mean expected number of crossovers in the interval,
4) the probability for an even number of crossovers in the interval,
5) the probability for an odd number of crossovers in the interval.

driver.batch.cxx produces two output files - outfile (html format) and outfile2 (plain text format), with columns:

1) true mean number of crossovers per sequence,
2) observed mean number of crossovers per sequence,
3-12) expected number of distinct sequences for different library sizes.

The library sizes (columns) range from L / 32 to L x 16, while the crossover rates (rows) range from about m / 30 to m x 30.

Currently the maximum number of variable positions is limited to 20 (in driver.cxx) and 15 (in driver.batch.cxx). Also the maximum sequence length is 10^8 and the maximum library size is 10^12. You can change these by editing the

#define maxpos 20
#define maxndaugh 524288 // pow(2,maxpos-1)
#define maxn 100000000 // 10^8
#define maxl 1000000000000. // 10^12

lines in the programmes, and recompiling. Note that maxn, maxpos and maxndaugh are integers. In general the compiler will limit the maximum size of integers to 2^31 ~= 2.1 x 10^9. Some compilers may limit the maximum size of integers to 2^15 ~= 32000. If any of maxn, maxpos, maxndaugh exceed the relevant limit, then you will get nonsense results when you run the programmes.

If you get a segmentation fault error it probably means you need to increase your stacksize - use the ulimit or limit command.

Links to download programmes: driver.cxx, driver.batch.cxx.

2) driver_mc.cxx

This programme does a full Monte Carlo simulation for the DRIVeR scenario. It may be useful for checking the analytic calculations used in driver.cxx, but is relatively slow, especially for large numbers of variable positions or large library sizes.

Compile the programme as follows (replace 'gcc' by an appropriate alternative, e.g. 'c++' or 'g++', if you're using a different C++ compiler):

g++ -o driver_mc driver_mc.cxx

Before running the programme, you will need to make a file listing the variable positions. The first line lists the number of variable positions. The remaining lines list the positions. These must be in numerical order. Click here for an example position file.

Run the programme as follows:

./driver_mc L N m posfile seed nsims approx

where L = library size, N = sequence length, m = (true) mean number of crossovers per sequence (click for details), posfile is the list of variable positions (e.g. use driver.in), seed = random seed (positive integer), nsims = requested number of simulated libraries, and approx = 1 or 0 tells the programme which method to use (see below).

The programme outputs to screen the mean and standard deviation of the number of distinct daughter sequences per simulated library. For the final simulated library only, the programme outputs to the file mc.dat the number of times each of the possible daughter sequences (encoded by 0,1,2,...,(2^M)-1) occurs in the library.

driver_mc.cxx may use one of two methods for generating the simulated sequences:
approx = 1: For each sequence in the library, a random Poisson variable with mean m is used to select the number of crossovers. These are then applied at random places in the sequence.
approx = 0: Every position in each simulated sequence is tested using a random number to decide whether a crossover occurs at that site or not.
The approx = 1 method is quicker, but a bit less accurate.

Current limits are maximum number of simulated libraries = 100000, maximum sequence length = 2000, maximum library size = 1000000, and maximum number of variable positions = 12. You can change these by editing the

#define maxniter 100000
#define maxn 2000
#define maxl 1000000
#define maxpos 12
#define maxndaugh 4096 // pow(2,maxpos)

lines in driver_mc.cxx, and recompiling. Beware of increasing the maximum sequence length above about 10^9, or decreasing the crossover rate m / N below about 1/10^9, as these may be too extreme for the random number generator to resolve (typically the random numbers have 9-10 random digits). Note also that all these numbers are integers. In general the compiler will limit the maximum size of integers to 2^31 ~= 2.1 x 10^9. Some compilers may limit the maximum size of integers to 2^15 ~= 32000. If any of these numbers exceed the relevant limit, then you will get nonsense results when you run the programme.

Link to download programme: driver_mc.cxx.

Notes:

You must agree to the Terms of Usage before using any of this software.
If you use this software for publications, please cite Wayne M. Patrick, Andrew E. Firth and Jonathan M. Blackburn, 2003, User-friendly algorithms for estimating completeness and diversity in randomized protein-encoding libraries, Protein Engineering, 16, 451-457 or Andrew E. Firth and Wayne M. Patrick, 2005, Statistics of protein library construction, Bioinformatics, 21, 3314-3315.
If you seem to be getting bizarre results, check that none of the limitations on L, N, m etc. have been violated (see the maths notes).
All corrections and notifications of bugs are gratefully received.
Queries or comments to Andrew Firth (aef24cam.ac.uk).
AEF gratefully acknowledges funding from the Foundation for Research, Science and Technology, grant number UOOX0304.