A RADseq pipeline based on a read-cluster approach
- VSearch (https://github.com/torognes/vsearch) [Version 1.1.3 was used when testing]
- Parallel (http://www.gnu.org/software/parallel/) [Version 20140722 was used when testing]
- Python (https://www.python.org/) [Version 2.7.6 was used when testing]
- Numerous shell tools (awk, basename, bc, bzcat*, cat, cd, grep, ln, mkdir, paste, pwd, rm sed, sort, tail, uniq, wc, zcat*) *optional commands
NOTE: there is a command in the bin directory that checks that it can find all necessory commands called 'check-dependencies'
- Download the latest release
- Extract archive
- Add bin directory to your path
e.g.
# Download the source (and unpack)
wget https://github.com/molecularbiodiversity/rad-loci/archive/v0.4.tar.gz
tar xf v0.4.tar.gz
# Optional: you may want to move the source somewhere else. If you do you
# will need to cd to the directory containing it too.
#mv rad-loci-0.4 /usr/local
#cd /usr/local
# Add to path (you may want to put this in your ~/.bashrc file to you don't
# need to do it each time).
# use 'pwd' command to work out your working directory (as you will need to
# hard-code it if you use the ~/.bashrc option)
export PATH=$PATH:$PWD/rad-loci-0.4/bin
# Check dependencies were ok
check-dependencies
# this should output "Success: everything found in your path" at end
- Make a directory called 'samples' in your current directory and copy/symlink all your sample fastq or fasta files into it
- Run 'rad-loci-settings settings.conf' program to create a new settings file in your current working directory. Optionally change any of the settings as required
- Run 'rad-loci settings.conf' command. It will output a lot of progress information on the terminal. It will take a while to process the data based on how much input data is provided so it is best if you run in with nohup or as an HPC job.
e.g.
# create a directory for this experiment
mkdir experiment
cd experiment
# create a directory for the samples (input)
mkdir samples
cd samples
# copy or symlink the sample fastq/fasta files
#ln -s /some/path/to/sample/files*.fq .
#cp /some/path/to/sample/files*.fq .
cd ..
# create settings
rad-loci-settings settings.conf
# edit the settings as necessory
#nano settings.conf
#vim settings.conf
# run the pipeline (with nohup as it might take a while and
# we don't want it to terminate if our connection drops)
nohup rad-loci settings.conf &
# watch the output
tail -f nohup.out
#Ctrl+C to exit this view as it will go for ever
- Create the catalog of potential loci and prepare samples for analysis
- Merge (cluster) similar sequences from catalog
- Filter clusters to those that have 2 to 16 different members (<2 = non-informative, >16 = multi-mapping)
- Align sequences to filtered catalog (to find the which alleles map to each)
- Refilter catalog to those that contain 2 to 16 alleles
- Make database of all allele sequences (FastA)
- Map sequences from each sample (100% identity) to allele database and count copies
- Merge counts from each sample into single TSV file
- Filter Loci based on:
- Max 2 alleles per sample per loci
- Min 2 significant alleles and
- Min proportion of samples with data for loci
- Call genotype for each sample and output as a structure (.stru) file
- Make random selection of loci
- Call genotype into migrate format