safisher / ngs Goto Github PK
View Code? Open in Web Editor NEWPennSCAP-T Pipeline
PennSCAP-T Pipeline
PIPELINE uses the same species name argument for BLAST, STAR, RUM, and HTSEQ. This causes problems if the users hasn't named their libraries identically or if the species name doesn't mesh with the expected names hard-coded in the BLAST module.
Need to track version information for each module. Currently we just track the version of ngs.sh (incremented when modules change) and the external programs used.
Add a flag to STATS that will run stats on all subdirectories (ie all samples in the current directory) and output stats to a specified xls file.
Should use the following to process args rather than "while true" as this way we can force stop the processing of args when we only have one left as the last arg must be the sample name. Alternatively we should add a test after the "while true" loop to make sure we have one remaining argument.
while test $# -gt 1
do
echo $1
shift
done
Change Python scripts, removing the hard-coded location of Python. Default to the version of Python that is in the user's path.
Need a way to label the columns in the output files that contain the list of trim locations.
PIPELINE should pull list of modules and module args from a user-specified file. The file should be copied into a 'sampleID/pipeline' directory with a time stamp.
Need to parse the SPAdes log file to capture the version number.
Check sample name for trailing "/" and remove it if present. The slash at the end of the name is a cosmetic fix.
Need to output version information for HTSeq.
TRIM should allow for pass through of more parameters to trimReads.py.
Need to optimize the trimming script.
Currently there is no testing of failed runs meaning PIPELINE will continue to run even if a module fails. At the most basic level we should test for empty files in BLAST (raw.fq), TRIM (unaligned_1.fq) and HTSEQ (*.cnts.txt).
Each module should return a version number, with the pipeline being just another module. Users should be able to track which version of which module was used in each pipeline run.
Need to parse the STAR log file to capture the version number.
Journal output should be optional and based on parameter value.
Create a log subdirectory in sampleID to store log files. Break up analysis.log so that every time a module runs a new log file is saved in this directory using (time-stamp + module name) for file name.
When running PIPELINE, then the logs for all modules within the PIPELINE run would be stored in a single PIPELINE log.
Rework the log output so that the timestamps are comments and the log file is effectively a bash file and can be run without modification.
Save rum output in rum.$SPECIES instead of rum.trim. This would have implications in rumalign, rumstatus, htseq, stats, and post
Each directory should include a module file that contains the name of the module used to create that directory. This information could be added to the SAMPLE_ID.versions file. The file containing the module information can be used by STATS to determine which module is used to generate the stats. In this case STATS would be provided with a list of directories rather than a list of modules.
This will allow us to decouple the directory name from the module name and allow for more flexibility in running modules repeatedly. For example STAR could be run twice on two different genome versions or HTSEQ could be run repeated on different transcriptomes. This will also allow for meta-modules and more overall granularity in modules. For example we could run HTSEQ on exons then introns and use another (meta-)module to combine the exon and intron counts.
BLAST should output a file that lists all hits not counted and the species mapped (one line per read). This would make it easier for users to determine if there was a specific contaminant.
Make DEBUG flag an argument.
Allow PIPELINE to pick up where it left off. Need some way to flag when a module completes, so we know where to begin. We could look for the last created directory and rerun that module. Example, if INIT, FASTQC, and BLAST directories exist, then we should rerun BLAST and go from there (not expecting the BLAST module to have completed).
need to capture poly-A/T trimming parameters and other trimReads.py command line options somewhere.
Should we also update TRIM to include trimming parameters in versions files: each of the contaminants trimming options and the contaminants string?
Allow for setting of ngs.sh parameters via environment variables: debug, journal file, repo resource directory, and executable locations.
This should be tracked by the BLAST module.
Should this be optional?
adjust version output to only use file name of library file, rather than library path.
possibly include library path as separate output.
This should conform to how we process arguments in other modules.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.