Giter Site home page Giter Site logo

widdowquinn / thapbi-pycits Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 8.37 MB

Repository for ITS diagnostics development as part of Phytothreats project

License: MIT License

Python 28.45% HTML 67.00% UnrealScript 4.55%
bioinformatics computational-biology classification its1 metabarcoding

thapbi-pycits's Introduction

THAPBI-pycits TravisCI build status codecov.io coverage status Code Health

README.py - THAPBI-pycits

This repository is for development of ITS1-based diagnostic/profiling tools for the THAPBI Phyto-Threats project, funded by BBSRC.

DEVELOPER NOTES

Python style conventions

In this repository, we're trying to keep to the Python PEP8 style convention, the PEP257 docstring conventions, and the Zen of Python. To help in this, a pre-commit hook script is provided in the git_hooks subdirectory that, if deployed in the Git repository, checks Python code for PEP8 correctness before permitting a git commit command to go to completion.

If the pep8 module is not already present, it can be installed using pip install pep8

Whether you choose to use this or not, the THAPBI-pycits repository is registered with landscape.io, and the "health" of the code is assessed and reported for every repository push.

Installing the git hook

To install the pre-commit hook:

  1. clone the repository with git clone https://github.com/widdowquinn/THAPBI (you may already have done this)
  2. change directory to the root of the repository with cd THAPBI-pycits
  3. copy the pre-commit script to the .git/hooks directory with cp git_hooks/pre-commit .git/hooks/

Using a virtual environment with the repository

In the root directory of the repository:

$ virtualenv -p python3.5 venv-THAPBI-pycits
$ source venv-THAPBI-pycits/bin/activate
<activity>
$ deactivate

INSTALLATION

Dependencies: Python modules

All Python module dependencies are described in requirements.txt and can be installed using

pip install -r requirements.txt

There may be issues with biom-format and biopython installations due to ordering of module installation. If this is the case for you, then it might be solved by installing numpy at the command-line first, with:

pip install numpy
pip install -r requirements.txt

Dependencies: Third-party applications

pear

pear is a paired-end read merger, used by the pipeline to merge ITS paired-end reads into a single ITS sequence. It is available from the pear home page as a precompiled executable that can be placed in your $PATH, and it can be installed on the Mac with Homebrew and homebrew-science, using: If you are having trouble on some Linux system due to bzip+zlib issues, a precompiled version is availble: https://github.com/xflouris/PEAR/blob/master/bin/pear-0.9.5-bin-64

brew install pear

please rename the binary to pear:

wget http://sco.h-its.org/exelixis/web/software/pear/files/pear-0.9.10-bin-64.tar.gz
tar -zxvf pear-0.9.10-bin-64.tar.gz
cp pear-0.9.10-bin-64/pear-0.9.10-bin-64 pear-0.9.10-bin-64/pear
put this in your PATH

Trimmomatic

Trimmomatic is used to trim and quality-control the input reads. pycits expects Trimmomatic to be available at the command-line as trimmomatic. You can check if the tool is installed this way with the command:

which trimmomatic

To obtain Trimmomatic with this installation type on Linux systems, you can use:

apt-get install trimmomatic

and on the Mac (with Homebrew and homebrew-science):

brew install trimmomatic

If you have downloaded the Java .jar. file from trimmomatic's home page, you can wrap the .jar file with a Bash script called trimmomatic in your $PATH, such as

#!/bin/bash
exec java -jar $TRIMMOMATIC "$@"

where $TRIMMOMATIC is the path to your trimmomatic .jar file.

muscle

muscle MUSCLE is a program for creating multiple alignments of amino acid or nucleotide sequences [muscle download page]http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux32.tar.gz as a precompiled executable that can be placed in your $PATH when decompressed. Please rename to muscle

wget http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux32.tar.gz
tar -zxvf muscle3.8.31_i86linux32.tar.gz
cp muscle3.8.31_i86linux32 muscle
put this in your PATH

fastqc

fastqc fastqc A quality control tool for high throughput sequence data. [fastqc download page]http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux32.tar.gz as a precompiled executable that can be placed in your $PATH when decompressed.

wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip
unzip fastqc_v0.11.5.zip
cd FastQC
chmod 755 fastqc
put this in your PATH

spades

spades spades is an assembly program which we use for error correction. [spades download page]http://spades.bioinf.spbau.ru/release3.9.1/SPAdes-3.9.1-Linux.tar.gz as a precompiled executable that can be placed in your $PATH when decompressed.

wget http://spades.bioinf.spbau.ru/release3.9.1/SPAdes-3.9.1-Linux.tar.gz
tar -zxvf SPAdes-3.9.1-Linux.tar.gz
cd ./SPAdes-3.9.1-Linux/bin/
put this in your PATH

flash

flash flash is a pair end read assembly program. [flash download page]https://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz as a precompiled executable that can be placed in your $PATH when decompressed.

wget https://sourceforge.net/projects/flashpage/files/FLASH-1.2.11.tar.gz
tar -zxvf FLASH-1.2.11.tar.gz
cd FLASH-1.2.11
put this in your PATH

swarm

swarm swarm is a clustering program. [swarm download page]https://github.com/torognes/swarm

git clone https://github.com/torognes/swarm.git
cd swarm/src
make
put PATH_TO/swarm/bin in your PATH

blastclust

blastclust blastclust is a clustering program. This is not essential to download. [blastclust download page]ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy/2.2.26/blast-2.2.26-x64-linux.tar.gz as a precompiled executable that can be placed in your $PATH when decompressed.

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/legacy/2.2.26/blast-2.2.26-x64-linux.tar.gz
tar -zxvf blast-2.2.26-x64-linux.tar.gz
put this in your PATH

cd-hit

cd-hit cd-hit is a clustering program. [cd-hit download page]https://github.com/weizhongli/cdhit

git clone [email protected]:weizhongli/cdhit.git
cd cdhit
make
put PATH_TO/cdhit in your PATH

bowtie_2.2.5

bowtie_2.2.5 bowtie_2.2.5 is a read mapping program. [bowtie_2.2.5 download page]https://depot.galaxyproject.org/package/linux/x86_64/bowtie2/bowtie2-2.2.5-Linux-x86_64.tar.gz We have use the pre compiled binary to reduce difficulty in getting this to work.

mkdir bowtie_2.2.5
cd bowtie_2.2.5
wget https://depot.galaxyproject.org/package/linux/x86_64/bowtie2/bowtie2-2.2.5-Linux-x86_64.tar.gz
tar -zxvf bowtie2-2.2.5-Linux-x86_64.tar.gz
put this in your path
export PATH=$HOME/bowtie_2.2.5/bin/:$PATH

samtools_1.2

samtools_1.2 samtools_1.2 is a program to do lots of thing with sam/ bam files. [samtools_1.2 download page]https://depot.galaxyproject.org/package/linux/x86_64/samtools/samtools-1.2-Linux-x86_64.tgz We have use the pre compiled binary to reduce difficulty in getting this to work.

mkdir samtools_1.2
cd samtools_1.2
wget https://depot.galaxyproject.org/package/linux/x86_64/samtools/samtools-1.2-Linux-x86_64.tgz
tar -zxvf samtools-1.2-Linux-x86_64.tgz
put this in your path
export PATH=$HOME/samtools_1.2/bin/:$PATH

vsearch

vsearch vsearch Versatile open-source tool for metagenomics. [vsearch download page]https://github.com/torognes/vsearch

wget https://github.com/torognes/vsearch/releases/download/v2.4.0/vsearch-2.4.0-linux-x86_64.tar.gz
tar xzf vsearch-2.4.0-linux-x86_64.tar.gz

More information

thapbi-pycits's People

Contributors

peterthorpe5 avatar widdowquinn avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

peterthorpe5

thapbi-pycits's Issues

Make Python3 explicit

On systems where Python2 and Python3 are available, it is not always usual for python to point to python3. Installation is problematic on those systems. Since we explicitly require Python3, we should change the shebangs to #!/usr/bin/env python3.

`samtools` test fails on OSX

$ nosetests tests/test_wrapper_samtools.py
........F..
======================================================================
FAIL: Run samtools_idxstats on test data and compare output
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/lpritc/Development/GitHub/THAPBI-pycits/venv-THAPBI-pycits/lib/python3.5/site-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/Users/lpritc/Development/GitHub/THAPBI-pycits/tests/test_wrapper_samtools.py", line 143, in test_samtools_idxstats_exec
    assert_equal(target_fh.read(), test_fh.read())
AssertionError: '7Phy[625 chars]t0\nP._taxon_walnut_P281_1\t215\t0\t0\nP._taxo[6441 chars]t0\n' != '7Phy[625 chars]t0\nPhytophthora_thermophila_CBS127954_EU30115[6441 chars]t0\n'
Diff is 7302 characters long. Set self.maxDiff to None to see it.

----------------------------------------------------------------------
Ran 11 tests in 0.100s

The root cause of this is the existence of samtools_idxstats.py and, in particular, the unnecessary and fragile combinations of shell pipes and redirection in a command-line call to samtools in the Samtools_Idxstats class.

The key problem that causes the test to fail is that we can't rely on the behaviour of sort being consistent across platforms, and over time. In general, if you find yourself writing pipes or redirects into a shell command to be run from a Python script, you're heading in the wrong direction. Here, running the raw command:

samtools idxstats tests/test_data/samtools/Aligned.sortedByCoord.out.bam > test.bam

and running a diff as in the test fails, as you'd expect:

diff test.bam tests/test_targets/samtools/MisMat_0_star_mappings 

But so does running the same series of pipes and shell commands:

$ samtools idxstats tests/test_data/samtools/Aligned.sortedByCoord.out.bam | grep -v '*' | sort --reverse -n -k3 > test.bam
$ diff test.bam tests/test_targets/samtools/MisMat_0_star_mappings
14a15,23
> P._taxon_walnut_P281_1	215	0	0
> P._taxon_sedge__VHS25675_R1C_1	210	0	0
> P._taxon_personii_P11555_1	213	0	0
> P._taxon_kwongon_TCH009_1	214	0	0
> P._taxon_kwongonlike_CLJO100_1	214	0	0
> P._rosacearum_P292_1	214	0	0
> P._riparia_VI_3100B9F_HM004225_1	213	0	0
> PPhytophthora_fluvialis_CBS129424_JF701436__1	215	0	0
> P.infestans_ITS1_1	160	0	0
16a26
> Phytophthora_taxon_PgChlamydo_VHS3753_EU301160_1	215	0	0
19d28
< Phytophthora_taxon_PgChlamydo_VHS3753_EU301160_1	215	0	0
32,40d40
< PPhytophthora_fluvialis_CBS129424_JF701436__1	215	0	0
< P.infestans_ITS1_1	160	0	0
< P._taxon_walnut_P281_1	215	0	0
< P._taxon_sedge__VHS25675_R1C_1	210	0	0
< P._taxon_personii_P11555_1	213	0	0
< P._taxon_kwongonlike_CLJO100_1	214	0	0
< P._taxon_kwongon_TCH009_1	214	0	0
< P._rosacearum_P292_1	214	0	0
< P._riparia_VI_3100B9F_HM004225_1	213	0	0
112d111
< 2_Phytophthora_emanzi_CMW35510_1	163	0	0
113a113
> 2Phytophthora_sp._UK92615_1	163	0	0
117d116
< 2Phytophthora_sp._UK92615_1	163	0	0
130a130
> 2_Phytophthora_emanzi_CMW35510_1	163	0	0

and this is why the test fails.

Sorting the repo's target file in the same way as the samtools output shows that the data is the same:

$ cat tests/test_targets/samtools/MisMat_0_star_mappings | grep -v '*' | sort --reverse -n -k3 > target.bam
$ diff test.bam target.bam
<no output>

The appropriate fix for this is to use pysam instead of wrapping samtools.

`samtools` compatibility problem: sorting

The samtools wrapper fails with

$ samtools --version
samtools 1.3
Using htslib 1.3.1
Copyright (C) 2015 Genome Research Ltd.

as with this version, the -o flag is necessary to indicate the output directory.

nosetests -v reports:

samtools_sort instantiates, runs and returns ... ERROR
Run samtools_index on test data and compare output ... ERROR

Split out requirements files

The single requirements.txt file contains requirements for the package, and for Travis-CI, but not for the Santi pipeline (which I've not got working on Travis). We could split requirements into more than one file for different installation purposes, not least because the Santi pipeline dependencies are a pain to install, often fail, and we don't want to promote its use or have it as a dealbreaker for the module.

Fix `trim_seq()`

The modified trim_seq() in a19c6ec restores Santi's pipeline functionality, but borks the logic.

We really want to restore the modified trim_seq(), which allows for zero right-clip, and we can do this maintaining Santi's pipeline function by either:

  1. punting the lclip/rclip choices out to the script itself
  2. using a local trim_seq() in the script itself

I prefer (1) as a choice.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.