oclust's Introduction

About oclust

A pipeline for clustering long 16S rRNA sequencing reads, or any sequences, into Operational Taxonomic Units.

Requirements

Linux v.2.6.x
Perl v.5.10.1
R (should be in path)
- package seqinr should be installed:

      > install.packages("seqinr")

Note on data

The pipeline is designed for Pacbio CCS reads - it will not work on raw Pacbio reads.

Input files

The only input file to oclust is a file in FASTA format containing the sequencing reads to be clustered.

FASTQ files can be converted to FASTA:

   $ cd utils
   $ chmod +x fastq_to_fasta.pl
   $ ./fastq_to_fasta.pl file.fastq > file.fasta

Installation

Get the repository:

$ git clone https://github.com/oscar-franzen/oclust.git oclust
Make executable (might not be necessary):
```
$ cd oclust
$ chmod +x *.pl
```
Decide if you want to compute distances based on Needleman-Wunsch or Infernal. The latter will be substantially faster.

First time executed, oclust_pipeline.pl will download the human genome sequence and format it.

   $ ./oclust_pipeline.pl -x <method> -f <input file> -o <output directory> -p <number of CPUs>

   General settings:
   -x PW or MSA               Can be PW for pairwise alignments (based on Needleman-Wunsch)
                               or MSA for multiple sequence alignment (based on
                               Infernal). [MSA]
   -t local or cluster        If -x is PW, should it be parallelized by running it locally
                               on multiple cores or by submitting jobs to a cluster
                               (requires a system with the LSF scheduler). [local]
   -a complete, average or    The desired clustering algorithm. [complete]
       single    
   -f [string]                Input fasta file.
   -o [string]                Name of output directory (must not exist) and use full path.
   -R HMM, BLAST, or none     Method to use for reverse complementing sequences. [HMM]
   -p [integer]               Number of processor cores to use for BLAST. [4]
   -minl [integer]            Minimum sequence length. [optional]
   -maxl [integer]            Maximum sequence length. [optional]
   -rand [integer]            Randomly sample a specified number of sequences. [optional]
   -human Y or N              If 'Y'es, then execute BLAST-based contamination
                               screen towards the human genome. [Y]
   -chimera Y or N            Run chimera check. Can be Y or N. [Y]

  LSF settings (only valid for -x PW when -t cluster):
   -lsf_queue [string]       Name of the LSF queue to use. [scavenger]
   -lsf_account [string]     Name of the account to use. [optional]
   -lsf_time [integer]       Runtime hours per job specified as number of hours. [1]
   -lsf_memory [integer]     Requested amount of RAM in MB. [3000]
   -lsf_nb_jobs [integer]    Number of jobs. [20]

Dependencies

The oclust pipeline bundles together the following open source/public domain software:

R, compiled with: $ ./configure --prefix=~/R/ --enable-static=yes --with-x=no --with-tcltk=no
The seqinr R package
Perl and BioPerl
Parallel::ForkManager
Memory::Usage
NCBI BLAST
uchime (public domain version)
HMMER (hmmscan)
vrevcomp
infernal
EMBOSS Needleman-Wunsch implementation (needle), compiled with: $ ./configure --prefix=~/e/ --disable-shared --without-mysql --without-postgresql --without-axis2c --without-hpdf --without-x --without-pngdriver

Reference

Improved OTU-picking using long-read 16S rRNA gene amplicon sequencing and generic hierarchical clustering, Franzén et al. Microbiome 2015

Contact

p.oscar.franzen at gmail.com

oclust's People

Contributors

Stargazers

Watchers

oclust's Issues

time complexity

We are running oclust on ~7,000 PacBio CCS 16S reads. How long will oclust take to finish computing similarity matrix? Can you give us some hints about the runtime complexity?

How could I use the final result ?

Dear Brother,
Thank u for your pipeline.
I have meet a problem that PB CCS generate much more OTUs. I just see your pipeline and run a test using a small CCS reads (about 354 reads). I am writing to ask some questions:

Firstly, how could i use the final results listed as follow.
PW.complete.0.01.hclust PW.complete.0.03.hclust dist.mat
these *.hclust files contain 2 columun. the column 1 is the read id, the column 2 is the cluster num(or id ? or order?). is that right? I have check 10 reads which belongs to the same cluster id (while they belongs different OTUs using usearch software), almost all of the reads tend to same annotation result. It Looks perfect. And I want to konw the meaning of values in file 'dist.mat', and the superficial explanations of the culculation method.

Secondly, i wander if the PB sequence erro give rise to the much more artificial(false) OTUs, even though using the CCS reads(3 pass limited). I have annotated the total reads(28,000 CCS reads), only about less than 100 species found, but there are almost 10,000 OTUs generated by usearch.

Finally, how do you think about to get Phylogenetic tree which will be used for calculating unifranc distance for PB CCS reads? Large amount OTUs is not efficient for analysis. And I donnot want to filter much reads because the low deep PB reads.

Looking forward to your reply.
Best.

oscar-franzen / oclust Goto Github PK