knights-lab / shi7 Goto Github PK

Auto-parameterizing QC Pipeline for Illumina sequencing runs

License: GNU Affero General Public License v3.0

Python 82.24% C 17.41% Shell 0.18% Makefile 0.17%

shi7's Introduction

New!

Now the shi7_trimmer binary in bin/ is capable of standalone paired-end QC on a per-sample basis without any dependencies, prerequisites, or installation.

It has the following features:

Native gzip support built in; just feed it fastq.gz file(s)
Adaptor autodetection and trimming
New automatic adapter support including poly-G repeats on newer Illumina 2-color instruments
Detection and removal of any novel adapters or suspected technical sequence via PALINCUT mode (PE only)
Support for stripping/anonymizing all headers (sanitizing)
Support for outputting a set maximum depth (rarefaction) of QC'd reads
Support for fasta as well as fastq output formats directly
Support for discarding reads upon encountering an Illumina machine discard signal (an ultra-low QC code embedded in the read)
A variety of base-pair quality trimming modes and windows, including FLOOR and blind cut as well as different rolling average varieties
Support for different trimming quality thresholds at front and end of reads
(Almost) all other features in python-wrapped shi7
Over an order of magnitude faster in many cases, with no intermediates!

Prerequisites (classic shi7 and shi7_learning pipeline)

Python 2.7+
Java

Installation

The CONDA way (personal install), recommended for Linux or Windows WSL

Follow steps 1 and 2 of https://bioconda.github.io/ (including installing MiniConda 3.6 if you don't have miniconda)
Do this in a terminal:

conda create -n shi7 -c knights-lab shi7
source activate shi7

Alternative portable/server install:

Grab the latest release package (not source), extract, then add to PATH. You should be able to execute shi7.py on the commandline.

Confused or new to UNIX? Or something not working right? Try following the super-specific directions below:

Here are some super specific installation instructions for the portable install:

For our purposes, we will install and use SHI7 on an interactive shell on our supercomputer, MSI, like so: isub -n nodes=1:ppn=16 -m 22GB -w 12:00:00 (skip this if installing on your own computer, just open a (bash) terminal and optionally enter a directory where you want to install SHI7 with cd). You'll want to update the URLs below with the latest version on the release page above!

Have python (you will most likely have this already!) and importantly (at least on MacOS) the Java SDK: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html (If you are installing java for the first time, you MUST REBOOT to have it be recognized).
Download and unpack the latest release: (if on MacOS, replace the word "linux" with "mac" near the end of the wget link!)

wget https://github.com/knights-lab/shi7/releases/download/v0.9.9/shi7_0.9.9_linux_release.zip
unzip shi7_*_release.zip
chmod +x shi7_*_linux_release/*

Add SHI7 binaries to your PATH so they can be called on the commandline anywhere: On Linux:

echo "PATH=$PWD/shi7_0.9.9_linux_release:$PATH" >> ~/.bashrc

On Mac:

echo "PATH=$PWD/shi7_0.9.9_mac_release:$PATH" >> ~/.bash_profile

Reload your terminal environment and test shi7.py:

. ~/.bash_profile
. ~/.bashrc
shi7.py -h

At this point you should see the help screen printed out and SHI7 should be installed.

Using SHI7 (simplest method):

Note: if you installed via the CONDA method, omit the ".py" extension at the end of shi7 and shi7_learning

To run interactively on a supercomputer like MSI (skip this step otherwise):

isub -n nodes=1:ppn=16 -m 22GB -w 12:00:00`
module load python

Learn the appropriate shi7 parameters from the raw (unzipped) fastq data (replace 'myFastqFolder' with your actual data directory):

shi7_learning.py -i myFastqFolder -o learnt

Run shi7.py with the output of 2b: chmod +x learnt/shi7_cmd.sh && ./learnt/shi7_cmd.sh (or just copy the command that shi7_learning prints out when it finishes and run that)

Other usage examples:

Assuming you have a bunch of fastq files, of forward and reverse reads, split up by sample, that have Nextera adaptors:

shi7.py -i MyFastQFolder -o MyOutputFolder --adaptor Nextera

Assuming you only have R1 reads (no paired end):

shi7.py -i MyFastQFolder -o MyOutputFolder --adaptor Nextera -SE

This sets the minimum read length to 285 and the maximum to 300 when stitching, which is the canonical HMP V4 16S primer coverage region. This can be a powerful QC step in and of itself. Note: if using the EMP V4 protocol, omit these arguments.

If you have shotgun sequences, you might want to try not stitching (we recommend trying first and seeing how many stitch -- "percent combined" in the shi7.log file):

--flash False

Including --drop_r2 True in this case returns only R1 reads.

We recommend the following format for sequence file names:

sampleID_other_information_R1.fastq
sampleID_other_information_R2.fastq

Then, using strip_underscore True will return processed reads with just the sampleID, simplifying downstream processing. For example, an efficient command for non-stitching shotgun sequences with the sample names in the filenames before the first underscore character:

shi7.py -i MyFastQFolder -o MyOutputFolder --adaptor Nextera --flash False --strip_delim _,1 --drop_r2 True

Cite

To cite SHI7: Please cite the article published here: http://msystems.asm.org/content/3/3/e00202-17

Al-Ghalith, Gabriel A., Benjamin Hillmann, Kaiwei Ang, Robin Shields-Cutler, and Dan Knights. 2018. “SHI7 Is a Self-Learning Pipeline for Multipurpose Short-Read DNA Quality Control.” mSystems 3 (3). American Society for Microbiology Journals: e00202–17.

shi7's People

Contributors

Stargazers

Watchers

Forkers

gabeal rrshieldscutler transcript derasaria irshadulhaq1 wigasper alexpersa7

shi7's Issues

Print full shi7 command in log file

It would be good to print the commandline that was used to run the program directly in the log file.

Remove dependence on string "R1" occurring in reads

Code in question:
format_basename(re.sub('R1', '', input_path_R1))
This needs to change (or add a check if R1 is not present, just to swap "1"). I don't know whether this would cause downstream problems after stitching if the reads didn't contain "R1" or "R2" but instead had .1 and .2 or F and R or some other convention that is now caught properly with the new trimmo paired end even-odd selector.

Bug: filenames mutated containing R1 and R2 in sample names

'ERR1352369', 'ERR1352368', 'ERR1352367', 'ERR1352366', 'ERR1352365', 'ERR1352364', 'ERR1352363', 'ERR1352362', 'ERR1352361', 'ERR1352360', 'ER368486', 'ER368487', 'ER368484', 'ER368485', 'ERR989659', 'ERR989658', 'ER368480', 'ER368481', 'ERR989657', 'ER368488', 'ER368489', 'ER368507', 'ER368506', 'ER368505', 'ER368504', 'ER368503', 'ER368502', 'ER368501', 'ER368500', 'ER368509', 'ER368508', 'ER368491', 'ERR688806', 'ER368499', 'ER368498', 'ERR1352370', 'ER368490', 'ER368493', 'ER368492', 'ER368495', 'ER368494', 'ER368497', 'ER368496', 'ERR989664', 'ERR989665', 'ERR989666', 'ERR989667', 'ERR989660', 'ERR989661', 'ERR989662', 'ERR989663', 'ERR989668', 'ERR1352359', 'ER368514', 'ER368515', 'ER368516', 'ER368517', 'ER368510', 'ER368511', 'ER368512', 'ER368513', 'ERR576948', 'ER368518', 'ER368479', 'ER368482', 'ER368483']

The last file, for instance, was ERR1368483_1. (Notice it has an embedded "R1" by chance) I think we need a smarter way of parsing these guys... or a way to explicitly add R1/R2 to files that are detected as pairs (if they were detected as such without "R1/R2" being the determinant.

The way it is now, regardless of how files are paired, "R1" is naively searched for and stripped in all files regardless if they're R1's, or R2's, or if they even had R1/R2 in the name to begin with, and in this case it's indiscriminate -- whether the R1 is in the beginning or end of the file etc.

The proposed solution is:

Determine whether prospective pairs contain R1 and R2 in the same position in their filenames and matches what we expect.
1b. If not, add R1/R2 to the ends of the filenames.
In subsequent processing steps, strip only the final R1/R2 in the names at the very end. Don't strip all occurrences of R1/R2 strings in all files indiscriminately at every step.

Threads are not respected -- wrong number of threads used

Hi,

Bryan tried running it specifying -t 46 but trimmomatic is using 256 threads.

Possibilities:

passing it as a string instead of an integer?
formatting problems
trimmomatic is buggy

Detect if paired end

#35 Learning module update must be able to detect if reads are paired end or single end.

Rename repo from shi7en to shi7

It is in the repository settings. Only the owner can do this.

Rename all 'ninja_shi7' to 'shi7en'

We are removing the 'ninja' prefix from this project.

debug mode + flash false = error

Error when running debug mode with --flash False. Looks like it is searching for trimmer files (which presumably don't exist with flash == False). Doesn't kill the process or affect the file output, that I can tell.

Traceback (most recent call last):
  File "/export/scratch/robin/miniconda3/lib/python3.5/shutil.py", line 538, in move
    os.rename(src, real_dst)
FileNotFoundError: [Errno 2] No such file or directory: '/project/flatiron2/data/public_shotgun/qin2014/adapator_test/truseq2/temp/trimmer/HD.10.fastq' -> '/project/fl
atiron2/data/public_shotgun/qin2014/adapator_test/truseq2/HD.10.fastq'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/project/flatiron2/sop/shi7.py", line 383, in <module>
    main()
  File "/project/flatiron2/sop/shi7.py", line 376, in main
    shutil.move(file, args.output)
  File "/export/scratch/robin/miniconda3/lib/python3.5/shutil.py", line 552, in move
    copy_function(src, real_dst)
  File "/export/scratch/robin/miniconda3/lib/python3.5/shutil.py", line 251, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/export/scratch/robin/miniconda3/lib/python3.5/shutil.py", line 114, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/project/flatiron2/data/public_shotgun/qin2014/adapator_test/truseq2/temp/trimmer/HD.10.fastq'

output dir needs to be created prior to writing log file

Change absolute paths to os.path.join

In the code there are a lot of '/temp' if we want to be system agnostic should be using os.path.joins.

Verifying the stitching with shotgun reads

We need to verify the output of FLASH when running shotgun reads. Methods for this are in the works.

Bug: Failure with R1/R2 in the format blah_1.fastq blah_2.fastq

`08/30/2017 06:47:48 PM ['/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_2.fastq', '/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_1.fastq']

08/30/2017 06:47:48 PM trimmomatic PE -threads 4 /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_1.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_2.fastq

/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq

/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/unpaired.ERR688806_1.fastq

/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq

/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/unpaired.ERR688806_1.fastq ILLUMINACLIP:/home/knightsd/algh0022/sop/adapters/NexteraPE-PE.fa:2:30:5:1:true
08/30/2017 06:48:04 PM Picked up _JAVA_OPTIONS: -Xmx24576000k
TrimmomaticPE: Started with arguments:
-threads 4 /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_1.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/ERR688806_2.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/unpaired.ERR688806_1.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/unpaired.ERR688806_1.fastq ILLUMINACLIP:/home/knightsd/algh0022/sop/adapters/NexteraPE-PE.fa:2:30:5:1:true
Using PrefixPair: 'AGATGTGTATAAGAGACAG' and 'AGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTCCGAGCCCACGAGAC'
Using Long Clipping Sequence: 'CTGTCTCTTATACACATCTGACGCTGCCGACGA'
ILLUMINACLIP: Using 1 prefix pairs, 4 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 768321 Both Surviving: 768101 (99.97%) Forward Only Surviving: 119 (0.02%) Reverse Only Surviving: 101 (0.01%) Dropped: 0 (0.00%)
TrimmomaticPE: Completed successfully
08/30/2017 06:48:04 PM AXE_ADAPTORS done!

08/30/2017 06:48:04 PM flash /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq -o ERR688806 -d /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash -M 700 -m 20 -t 4

08/30/2017 06:48:04 PM [FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq
[FLASH]
[FLASH] Output files:
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.extendedFrags.fastq
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.notCombined_1.fastq
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.notCombined_2.fastq
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.hist
[FLASH] /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.histogram
[FLASH]
[FLASH] Parameters:
[FLASH] Min overlap: 20
[FLASH] Max overlap: 700
[FLASH] Max mismatch density: 0.250000
[FLASH] Allow "outie" pairs: false
[FLASH] Cap mismatch quals: false
[FLASH] Combiner threads: 4
[FLASH] Input format: FASTQ, phred_offset=33
[FLASH] Output format: FASTQ, phred_offset=33
[FLASH]
[FLASH] Starting reader and writer threads
[FLASH] Starting 4 combiner threads
[FLASH] ERROR: Qual string length (251) not the same as sequence length (259) (file "/panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq", near line 497)

[FLASH] FLASH did not complete successfully; exiting with failure status (1)
08/30/2017 06:48:04 PM FLASH done!
08/30/2017 06:48:04 PM /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.fastq
08/30/2017 06:48:04 PM shi7_trimmer /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/flash/ERR688806.fastq /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/trimmer/ERR688806.fastq 80 20 FLOOR 5 ASS_QUALITY 30
08/30/2017 06:48:04 PM Chucking reads of ass quality (30).
Using FLOOR of 5 bases.
The minimum length a read can be is 80 bases.
Trimming while quality < 20 on left, < 20 on right.
Done.
0 sequences were utter shi7, quality-wise.
And 0 sequences were decent.
The average trimmed length was -nan.
On average, the cut bases were: -nan (left), -nan (right).
The average read quality was -nan.

08/30/2017 06:48:04 PM CREATE_TRIMMER_GENERAL done!
08/30/2017 06:48:04 PM Convert and combine FASTQs done!
08/30/2017 06:48:04 PM Execution time: 0:00:15.680135`

Notice I flagged this line: /panfs/roc/groups/8/knightsd/clayt092/projects/douc_belly_bugs/public_datasets/stomach/embl_prjeb7843/shi7_output/temp/axe/ERR688806.fastq
The issue occurs when coming up with an output file name for the R1 and R2 reads post-axing. They're named the same thing and overwrite one another during write, leading to a mess.

Remove numpy dependency, revisit stdev logic

The learning module introduced a dependency on numpy for the calculation of subsampled means and standard deviations of stitching overlaps.

Because we want to keep the software unencumbered of external dependencies where possible for portable release, and because these calculations are fairly trivial, I think they should be replaced.

Support input gzipped fastq files

This is fairly standard among other genomics tools.

Limit default threads to 16

On most machines with hard drives, I/O starts to become the bottleneck and the software actually slows down with more threads. The user can still set whatever they like, but adding something like:

defaults = max(mp.getNumThreads(),16) for the default would be helpful :)

Update the gotta_split src to newest version

shi7 doesn't recognize *.fq files.

fastq files are only recognized if file extension = ".fastq". Files such as "Sample1_R1.fq" will throw the following error:

Traceback (most recent call last): File "/project/flatiron2/sop/shi7.py", line 383, in <module> main() File "/project/flatiron2/sop/shi7.py", line 329, in main path_fastqs = axe_adaptors_paired_end(path_fastqs, axe_output, args.adaptor, threads=args.threads, shell=args.shell) File "/project/flatiron2/sop/shi7.py", line 174, in axe_adaptors_paired_end path_R1_fastqs, path_R2_fastqs = split_fwd_rev(input_fastqs) File "/project/flatiron2/sop/shi7.py", line 146, in split_fwd_rev raise ValueError('Error: The input directory %s must contain at least one pair of R1 & R2 fastq file!' % os.path.dirname(paths[0])) IndexError: list index out of range

Change the temp folder to be written to the output dir

The temp folder should be written to the output dir so that we can guarantee the user has write permission to the directory.

readme typos

In the readme examples you call the script with $ shi7en not shi7.py

Understanding SHI7EN parameters

- What are we supposed to put for --r1 and --r2? Are these the expected identifiers for the two? So if my file is named "CS-137_S84_L001_R1_001.fastq", should I put in something like --r1 R1_001?
- Is --filter_length equivalent to create_trimmer_general_L150.sh? So if I used this script before, should I be changing this parameter to 150?
- Was 20 always the default for the trim quality? I thought it was 25 - but unfortunately I don't recall specifying this anywhere when I ran SHI7EN when it didn't have a wrapper.

Speed up the python parse/write (Convert/combine fasta steps)

It seems for the test HMP samples on multi-core systems (16 threads) the trimmomatic and flash steps run in less than 5 minutes but the python script that converts and/or writes to QIIME formatted FASTA takes another 10. Is there some way to speed this up (yield?).

Make the "overall quality" in shi7en_trimmer editable by commandline

There is a parameter in the shi7 binary trimmer that may be very useful to users. This is a knob that controls overall quality of reads retained; so it is very much like a "filter" in the way our length filter is.

This is the "30" number in the commandline for shi7en_trimmer. If this is settable by users (default 30), with the name "filter_avg_qual" or something, that would be great.

Thanks!
@pvangay : I think this should address the concern you had before about the overall quality knob.

Work towards conda-forge release

Which command in python

Switch the 'no' flags to boolean flags

The flags

--no_<more>

Should be set to be:

--<more> {True/False}

To avoid confusion.

Also the user should not be able to make a 'combined FASTQ'. Explicitly fail at arg checking if this is the case and tell the user to change their arguments.

Debug the "strip_underscore" feature

The purpose of this feature is for convenience so users don't have to rename their fastq files (only the text up until the underscore is kept, which is sufficient for sample names in most cases for the sequencing centers we've dealt with). This feature is important when there are hundreds of files and renaming them is burdensome.

When the files contain R1 or R2 designators after the first underscore--which most do, by design!--this can cause some tricky problems internally when passing around the current working file prefix between the various modules. Currently, there are some problems with the current implementation.

With SE mode, the R1 and R2 suffices added internally after the strip point may remain in the final file names.
The implementation does not play nicely with files for which "R1" and "R2" don't explicitly appear in the file names (such as with the .1./.2. or F/R conventions which are usually handled fine by the "sort and pair" parser logic in the absence of --strip_underscore).

Although one workaround is to detect common cases ("R1 and R2 not present in file name, SE mode used, etc") and disable this option in these cases, the software should be smart enough to append the R1 and R2 labels properly. One internal workaround: Shi7 should add the R1 and R2 designators to the first and second file (replacing everything after the underscore with R1 and R2 for the correct reads) and have the pipeline proceed, correctly identifying the pairs and removing the R1/R2 when appropriate before it completes.

test whether this works with spaces in the directory names

(Ensure there are quotes around every directory/file location passed to the programs)

Change explicit file opens to with open.

With open is the pythonic way rather than explicitly opening and closing files.

SHI7EN on MSI

(shi7en) vanga015@labc03 [~/imp] % pip install git+https://github.com/knights-lab/shi7en --upgrade --no-cache-dir
Could not find platform independent libraries
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to [:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ImportError: No module named 'encodings'

Current thread 0x00007f1b16ffe700 (most recent call first):
Aborted (core dumped)
(shi7en) vanga015@labc03 [~/imp] %
qsub: job 478001.nokomis0015.msi.umn.edu completed
Connection to lab closed.

Self-learning methods

A primary feature that can set this software apart from other pipelines could be its ability to learn what parameters to use based on the fastq data it is provided. To do this, the user should specify a "learning" mode, and the software should figure out the parameters, and explain to the user what selections have been made.

Look at the names of the files and determine whether R1 and R2 are likely to be present. If unlikely (when sorted, files have different numbers of lines, or there's no "1" changing to "2" in the next alphabetical file), disable paired end mode.

Presumably we can accomplish a lot of the more intricate "self-learning" methods easily if we do this one thing: Make a "testbed" function that grabs 10,000 reads (or a single sample, whatever comes first) of a single input pair (if paired-end) or R1 fastq (if not paired-end) and do the following on it:

Try all adapters; read the resulting filesizes; choose the adapters that made the smallest files
Try stitching: if fewer than 20% stitch at the default minimum overlap of 15, disable flash
If stitching works: read the histogram file and calculate median and standard deviation. If standard deviation is less than a fourth of the mean, clamp the stitching min and max to [mode - 10, mode + 10].
If stitching is disabled, increase the stringency of shi7_trimmer trimming to get rid of the bad-quality tails (the defaults expect stitching and are designed to just lop off straggling adapters, not chew off large stretches of the read).

Yes, we really can distill this art into a science, and all automatically while reporting to the user what we did, and only if the user elects to use the "learning" mode. :)

Add testing files for the pipeline

Currently, we have no defined way to test the software. This is bad practice.

Need new flag: "--discard_R2" (should work with and without -SE)

This would be useful with things like '-SE', which by default still keeps all the R2 reads (it merely disables checking whether reads are paired, but doesn't do anything about it if they are and the user just wants R1). This may be desirable sometimes, other times not.

Similarly, if the user has paired end reads that don't stitch and uses --flash false, the "--discard_R2" flag would simply chuck the R2's after using them for adapter removal.

Not correctly pairing reverse and forward reads

12/14/2016 01:50:38 PM flash Bioko_Table/temp/axe/ML04.S305.R1.fastq Bioko_Table/temp/axe/MLK01.S299.R2.fastq -o ML04.S305 -d Bioko_Table/temp/flash -M 700 -m 20 -t 48 -O
12/14/2016 01:50:39 PM [FLASH] Starting FLASH v1.2.11
[FLASH] Fast Length Adjustment of SHort reads
[FLASH]
[FLASH] Input files:
[FLASH] Bioko_Table/temp/axe/ML04.S305.R1.fastq
[FLASH] Bioko_Table/temp/axe/MLK01.S299.R2.fastq

Add shi7en src. Include shi7en binaries in release.

This is the file that should go in the 'src' folder of the root of the project. https://github.com/GabeAl/NINJA-SHI7/blob/master/ninja_shi7.c

Should be renamed to 'shi7en_trimmer'.

generator terminates one line before eof

When trying a function such as this, on a file of known 500 sequences, I always get 499 back (499 headers, seqs, and quality strings):


def check_sequence_name(path):
    ids = []
    print(path)
    vari = 1
    with open(path) as path_inf:
        fastq_gen = read_fastq(path_inf)
        for gen in fastq_gen:
            print(vari,"...",gen[0],"\n",gen[1],"\n",gen[2],"\n")
            vari += 1
            ids.append(gen[0][-9]) # we can't always assume it's this position
    return ids, len(ids) #why the output has size of 499 (Ben: skipping last line!)

Switch the adaptors to be included with the repository.

This is how to include data files in setuptools.
http://setuptools.readthedocs.io/en/latest/setuptools.html#including-data-files

This is how to access those included data files from Python at run time.
http://peak.telecommunity.com/DevCenter/PythonEggs#accessing-package-resources

The argument from the user will be a string like 'NextEra' or 'TruSeq'. Need to convert that string argument to a path based on the included data files in the repository.

>>> 'NextEra'
'NextEra'
>>> dict(zip('NextEra') , (foo_config = open(os.path.join(os.path.dirname(__file__),'foo.conf').read())
  File "<stdin>", line 1
    dict(zip('NextEra') , (foo_config = open(os.path.join(os.path.dirname(__file__),'foo.conf').read())
                                      ^
SyntaxError: invalid syntax
>>> dict(zip('NextEra') , (foo_config = open(os.path.join(os.path.dirname(__file__),'foo.conf').read())
  File "<stdin>", line 1
    dict(zip('NextEra') , (foo_config = open(os.path.join(os.path.dirname(__file__),'foo.conf').read())
                                      ^
SyntaxError: invalid syntax
>>> adapter_path = dict[args.adapter]

Include DUST

HMP protocol was to 'dust' reads. We should include this feature.

Uppercase T and F doesn't work now

Complains about the dictionary when I try to use uppercase T/F values. Main branch.

I imagine there need to be a few more calls to "toLower" sprinkled in there, or maybe a wrapper function around the dict which calls toLower on the string being passed in before looking up and returning the value.

sample name editing

Files from UMGC are oddly named like this: "T-CS-147_S118_L001_R1_001.fastq"

But my actual sample name is only "T.CS.147"

At the moment SHI7EN turns this into T.CS.147.S118

Is there any way to specify where the sample name starts/stops?

Redo installation link

'shi7en_trimmer' is silently failing.

Output to the log file is not being properly recorded for the shi7en_trimmer causing a silent failure. This causes the pipeline to write blank output files.

Add tutorial for adding binary to path

Add this to the README.md

echo 'PATH=$PATH: <path_to_binary>' > ~/.bashrc

Also, rename binaries so that they are in sync with the shizen call in the python wrapper. Python expects them to all be called 'shi7en_trimmer'.

Add 'split_libraries' to the pipeline

This is similar to the QIIME split_libraries_fastq.py

This source is included here.

This should be the first step in the pipeline. It should default to False. User must provide TXT file of oligos. Should be a command line argument. Will split combined FASTQ file into individual FASTQs per sample.

rename "trim" to "filter"

"trim" usually means cut sequence length.

CRITICAL: Paired ends are NOT correctly paired

Seeing the files not paired correctly:

12/16/2016 10:14:53 AM Debug Mode is Enabled. Retaining intermediate files.
12/16/2016 10:14:53 AM ['raw/CZ-MB4_S4_L001_R2_001.fastq', 'raw/CZ-MB124_S124_L001_R1_001.fastq', 'raw/CZ-MB19_S19_L001_R2_001.fastq', 'raw/CZ-MB24_S24_L001_R2_001.fastq', 'raw/CZ-MB87_S87_L001_R2_001.fastq', 'raw/CZ-MB124_S124_L001_R2_001.fastq', 'raw/CZ-MB114_S114_L001_R1_001.fastq', 'raw/CZ-MB101_S101_L001_R2_001.fastq', 'raw/CZ-MB32_S32_L001_R2_001.fastq', 'raw/CZ-MB108_S108_L001_R1_001.fastq', 'raw/CZ-MB121_S121_L001_R2_001.fastq', 'raw/CZ-MB128_S128_L001_R2_001.fastq', 'raw/CZ-MB75_S75_L001_R2_001.fastq',

etc.

The reads are not actually paired correctly, causing everything in PE processing to fail. This is a critical flaw and needs to be patched ASAP.

[edit] Seems to be working now, but awaiting confirmation (reply to this thread)