Giter Site home page Giter Site logo

oclust's Introduction

About oclust

A pipeline for clustering long 16S rRNA sequencing reads, or any sequences, into Operational Taxonomic Units.

Requirements

  • Linux v.2.6.x
  • Perl v.5.10.1
  • R (should be in path)
    • package seqinr should be installed:
      > install.packages("seqinr")

Note on data

The pipeline is designed for Pacbio CCS reads - it will not work on raw Pacbio reads.

Input files

The only input file to oclust is a file in FASTA format containing the sequencing reads to be clustered.

FASTQ files can be converted to FASTA:

   $ cd utils
   $ chmod +x fastq_to_fasta.pl
   $ ./fastq_to_fasta.pl file.fastq > file.fasta

Installation

  1. Get the repository:

    $ git clone https://github.com/oscar-franzen/oclust.git oclust

  2. Make executable (might not be necessary):

    $ cd oclust
    $ chmod +x *.pl
    
  3. Decide if you want to compute distances based on Needleman-Wunsch or Infernal. The latter will be substantially faster.

    First time executed, oclust_pipeline.pl will download the human genome sequence and format it.

   $ ./oclust_pipeline.pl -x <method> -f <input file> -o <output directory> -p <number of CPUs>

   General settings:
   -x PW or MSA               Can be PW for pairwise alignments (based on Needleman-Wunsch)
                               or MSA for multiple sequence alignment (based on
                               Infernal). [MSA]
   -t local or cluster        If -x is PW, should it be parallelized by running it locally
                               on multiple cores or by submitting jobs to a cluster
                               (requires a system with the LSF scheduler). [local]
   -a complete, average or    The desired clustering algorithm. [complete]
       single    
   -f [string]                Input fasta file.
   -o [string]                Name of output directory (must not exist) and use full path.
   -R HMM, BLAST, or none     Method to use for reverse complementing sequences. [HMM]
   -p [integer]               Number of processor cores to use for BLAST. [4]
   -minl [integer]            Minimum sequence length. [optional]
   -maxl [integer]            Maximum sequence length. [optional]
   -rand [integer]            Randomly sample a specified number of sequences. [optional]
   -human Y or N              If 'Y'es, then execute BLAST-based contamination
                               screen towards the human genome. [Y]
   -chimera Y or N            Run chimera check. Can be Y or N. [Y]

  LSF settings (only valid for -x PW when -t cluster):
   -lsf_queue [string]       Name of the LSF queue to use. [scavenger]
   -lsf_account [string]     Name of the account to use. [optional]
   -lsf_time [integer]       Runtime hours per job specified as number of hours. [1]
   -lsf_memory [integer]     Requested amount of RAM in MB. [3000]
   -lsf_nb_jobs [integer]    Number of jobs. [20]

Dependencies

The oclust pipeline bundles together the following open source/public domain software:

Reference

Contact

  • p.oscar.franzen at gmail.com

oclust's People

Contributors

oscar-franzen avatar

Stargazers

Shang Xie (谢上) avatar Nian avatar Ashish Damania avatar

Watchers

James Cloos avatar  avatar

Forkers

martin-hartmann

oclust's Issues

time complexity

We are running oclust on ~7,000 PacBio CCS 16S reads. How long will oclust take to finish computing similarity matrix? Can you give us some hints about the runtime complexity?

How could I use the final result ?

Dear Brother,
Thank u for your pipeline.
I have meet a problem that PB CCS generate much more OTUs. I just see your pipeline and run a test using a small CCS reads (about 354 reads). I am writing to ask some questions:

Firstly, how could i use the final results listed as follow.
PW.complete.0.01.hclust PW.complete.0.03.hclust dist.mat
these *.hclust files contain 2 columun. the column 1 is the read id, the column 2 is the cluster num(or id ? or order?). is that right? I have check 10 reads which belongs to the same cluster id (while they belongs different OTUs using usearch software), almost all of the reads tend to same annotation result. It Looks perfect. And I want to konw the meaning of values in file 'dist.mat', and the superficial explanations of the culculation method.

Secondly, i wander if the PB sequence erro give rise to the much more artificial(false) OTUs, even though using the CCS reads(3 pass limited). I have annotated the total reads(28,000 CCS reads), only about less than 100 species found, but there are almost 10,000 OTUs generated by usearch.

Finally, how do you think about to get Phylogenetic tree which will be used for calculating unifranc distance for PB CCS reads? Large amount OTUs is not efficient for analysis. And I donnot want to filter much reads because the low deep PB reads.

Looking forward to your reply.
Best.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.