The ngs from safisher

Species names getting confusing in PIPELINE

PIPELINE uses the same species name argument for BLAST, STAR, RUM, and HTSEQ. This causes problems if the users hasn't named their libraries identically or if the species name doesn't mesh with the expected names hard-coded in the BLAST module.

Module versioning

Need to track version information for each module. Currently we just track the version of ngs.sh (incremented when modules change) and the external programs used.

STATS

Add a flag to STATS that will run stats on all subdirectories (ie all samples in the current directory) and output stats to a specified xls file.

Update STATS to include program version numbers and genome library names.

Update args processing in all modules

Should use the following to process args rather than "while true" as this way we can force stop the processing of args when we only have one left as the last arg must be the sample name. Alternatively we should add a test after the "while true" loop to make sure we have one remaining argument.

while test $# -gt 1
do
echo $1
shift
done

HTSEQ needs stats function.

Python version / location

Change Python scripts, removing the hard-coded location of Python. Default to the version of Python that is in the user's path.

Trim location output

Need a way to label the columns in the output files that contain the list of trim locations.

PIPELINE should pull modules from file

PIPELINE should pull list of modules and module args from a user-specified file. The file should be copied into a 'sampleID/pipeline' directory with a time stamp.

SPAdes version number

Need to parse the SPAdes log file to capture the version number.

Sample Name trailing slash

Check sample name for trailing "/" and remove it if present. The slash at the end of the name is a cosmetic fix.

HTSeq version information

Need to output version information for HTSeq.

Better control over trimming

TRIM should allow for pass through of more parameters to trimReads.py.

Add error checking for SNP and SPAdes

Trimming way too slow

Need to optimize the trimming script.

PIPELINE error testing

Currently there is no testing of failed runs meaning PIPELINE will continue to run even if a module fails. At the most basic level we should test for empty files in BLAST (raw.fq), TRIM (unaligned_1.fq) and HTSEQ (*.cnts.txt).

Generate module-specific version numbers

Each module should return a version number, with the pipeline being just another module. Users should be able to track which version of which module was used in each pipeline run.

STAR version number

Need to parse the STAR log file to capture the version number.

Journal output

Journal output should be optional and based on parameter value.

Update analysis.log

Create a log subdirectory in sampleID to store log files. Break up analysis.log so that every time a module runs a new log file is saved in this directory using (time-stamp + module name) for file name.

When running PIPELINE, then the logs for all modules within the PIPELINE run would be stored in a single PIPELINE log.

Rework the log output so that the timestamps are comments and the log file is effectively a bash file and can be run without modification.

RUM output diretory

Save rum output in rum.$SPECIES instead of rum.trim. This would have implications in rumalign, rumstatus, htseq, stats, and post

Integrate prnVersion() into existing modules

FastQC
Trim
RUM
STAR
HTSeq
SNP
SPAdes
BOWTIE

Decouple directories and module names

Each directory should include a module file that contains the name of the module used to create that directory. This information could be added to the SAMPLE_ID.versions file. The file containing the module information can be used by STATS to determine which module is used to generate the stats. In this case STATS would be provided with a list of directories rather than a list of modules.

This will allow us to decouple the directory name from the module name and allow for more flexibility in running modules repeatedly. For example STAR could be run twice on two different genome versions or HTSEQ could be run repeated on different transcriptomes. This will also allow for meta-modules and more overall granularity in modules. For example we could run HTSEQ on exons then introns and use another (meta-)module to combine the exon and intron counts.

BLAST hits not counted

BLAST should output a file that lists all hits not counted and the species mapped (one line per read). This would make it easier for users to determine if there was a specific contaminant.

DEBUG flag

Make DEBUG flag an argument.

Continuing PIPELINE run

Allow PIPELINE to pick up where it left off. Need some way to flag when a module completes, so we know where to begin. We could look for the last created directory and rerun that module. Example, if INIT, FASTQC, and BLAST directories exist, then we should rerun BLAST and go from there (not expecting the BLAST module to have completed).

Document API for modules

init function
inclusion of an error checking function
stats function
printing of version information

safisher / ngs Goto Github PK

ngs's People

Contributors

Stargazers

Watchers

Forkers

ngs's Issues

Recommend Projects

Recommend Topics

Recommend Org