nyu-molecular-pathology / snsxt Goto Github PK
View Code? Open in Web Editor NEWbioinformatics pipeline framework for data analysis
License: GNU General Public License v3.0
bioinformatics pipeline framework for data analysis
License: GNU General Public License v3.0
Currently, files for a given sample for a given step in the analysis pipeline are retrieved through filename pattern matching as per here:
# file matching pattern based on the sample's id
self.search_pattern = '{0}*'.format(self.id)
This may not be specific enough to prevent matching the wrong file(s) if two or more samples in an analysis have similar names, such as
Sample1
Sample11
Need to revise this to do a more exact search to prevent the possibility of mis-matches. Consider creating an 'expected' exact filename and doing a search for an exact match. Alternatively, consider using samples.*.csv
files output by sns
with paths to expected files, or record the output paths of files in snsxt
analysis steps for later retrieval.
Also for reference, the file retrieval class method and find
module
qacct
for compute jobs to verify exit status and check for jobs that finished with errorsIf you have duplicate entries in a task list, it appears that only the first one gets executed. Maybe because the task list is using a dictionary, and they can only have unique keys?
need to port over the methods for emailing analysis results
Looks like there is a problem with Travis cloning into this submodule:
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of '[email protected]:stevekm/annotate-peaks.git' into submodule path '/home/travis/build/NYU-Molecular-Pathology/snsxt/snsxt/sns_tasks/scripts/annotate-peaks' failed
Failed to clone 'snsxt/sns_tasks/scripts/annotate-peaks' a second time, aborting
The command "eval git submodule update --init --recursive " failed. Retrying, 2 of 3.
maybe need to change the remote URL I think
The program logs should be easier to access by moving them to the parent directory
if an exception is raised and the program ends, the qsub jobs it was monitoring should all be killed as well so they are not left running
Instead of copying over all the base sns wes analysis report files when setting up the analysis report, these should instead be associated with the individual sns analysis tasks that produce the output used by the report fragments. This way, as other sns pipelines are added, they can bring over only the report fragments that are relevant to their output.
Need to document the program framework with a visualization of submodule dependencies, etc., like with this:
http://furius.ca/snakefood/
Need to work on the implementation of an analysis step that submits qsub jobs running MuTect2 per-sample, per-chromosome. This will probably need a new multi-qsub type of task class, along with the internal task job management.
Right now when Python subprocess is used to execute a shell command, nothing is typically returned.
Might be good to always return the error code, or the SubprocessCmd
or process object completely. This would be good for error checking within the script to make sure the commands finished successfully
reference: https://github.com/NYU-Molecular-Pathology/util/blob/master/tools.py#L22
The shell command mutt
does not work on cluster nodes, only the head node. Need to
modify mutt
commands to ssh back into the head node (implemented elsewhere, in reportIT
)
consider another email-sending program or library
example log message:
[2017-11-15 21:00:22] DEBUG (mail:mail:sns_start_email:74) /bin/sh: line 3: mutt: command not found
Need to find where this message is being generated and reduce it to DEBUG level
WARNING ] No report files are set for analysis task MuTect2Split
[INFO ] Submitted jobs: ['4127275', '4127276', '4127277', '4127278', '4127279', '4127280', '4127281', '4127282', '4127283', '4127284', '4127285', '4127286', '4127287', '4127288', '4127289', '4127290', '4127291', '4127292', '4127293', '4127294', ......
Since the task list is a YAML dictionary, the order of keys is not saved. See this for a solution that loads it as an ordered dict
https://stackoverflow.com/questions/5121931/in-python-how-can-you-load-yaml-mappings-as-ordereddicts
Need a way to integrate some QC flags that should appear in the reports based on criteria for the data at each step, and maybe also in the output from the program itself somehow.
See https://github.com/NYU-Molecular-Pathology/demultiplexing-stats for an example of a report with a QC check included
The run email is not going to the user who actually runs it.
Looks like Job
objects are being created a little bit too soon, need to add a pause or sleep
to give the jobs some time to start being run and log files generated before initializing. Alternatively, reconfigure initialization & log file checking.
[INFO ] Creating new sns analysis in dir /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05
[WARNING ] Log file does not appear to exist: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/logs-qsub/sns.wes.SC-SERACARE.o4127208
Can test pipeline components faster by using a smaller targets.bed file which only has targets for smaller chromosomes, such as for MuTect2 testing
If a tasks is creating qsub jobs for every sample in an analysis, and one sample causes an exception to be raised, the remaining qsub jobs might not be killed if they are not in the background jobs list yet. Need to look into method for handling this
Tried to implement sns RNA-Seq pipeline but the SnsWESAnalysisOutput
class (from here) has issues with compatibility, due to validation and methods which rely on sns wes
specific files.
Need to find a way to refactor so that we can use other types of pipelines
Need to fork the project, leave snsxt
as-is, develop new framework that is not tied to sns wes
as tightly?
Maybe deprecate many of the methods of analysis
objects which are no longer used or needed, and instead implement them as methods of SnsTask
objects. In this way, sns
attributes will be tied to the individual sns
pipelines and not to the analysis object as a whole
In case the default email command fails for some reason, a backup simple notification should be attempted.
If an analysis tasks is run with qsub_wait = False
, the task will not wait for qsub jobs to finish before moving on to the next task. However, the snsxt program as a whole should still wait for these jobs to finish before completing and moving on to tasks such as report setup & generation. Need to instead put these un-waited jobs in a background queue within the program so that the program waits for all of them to finish before completing.
A lot of the requirements for accomdating more generic frameworks make the NGS580 analysis tasks harder to implement, need to have a more generic version of snsxt for that. Fork into SkidSteer framework: https://github.com/stevekm/SkidSteer
these aspects need to have INFO level log messages
these aspects need to be be set to DEBUG level log messages
Right now, builtin task class module names are named such as AnalysisTask.py
and custom tasks are named such as _Delly2.py
. Need to reverse this naming convention, so that the builtin class modules names start with '_'
Need to implement a method that will cause tasks to not run if they find that the expected output file(s) already exist
Need to include the CNV analysis
Example:
https://travis-ci.org/NYU-Molecular-Pathology/snsxt/builds/296001755
Tests are failing but Travis is still reporting a successful build because the exist status of the test script is 0
See answer here
https://stackoverflow.com/a/43266549/5359531
need to incorporate here:
https://github.com/NYU-Molecular-Pathology/snsxt/blob/master/snsxt/test.py#L16
Add the base sns
pipeline as an embedded part of snsxt
, so you can run the entire pipeline from just snsxt
in a single step.
Each of these should have its own discrete log file, so you can more easily check just the logs for certain aspects of the analysis. I think this functionality might already be baked into the LoggedObject
class but needs a way to direct the handlers to the correct file locations. Will have to decide how to put all the different log files in common or separate log directories..
This will make it easier for end user to quickly find and configure binary file paths, so they dont have to dig around in the program internals to find the right config file
Routes used by snsxt should be separated into their own analysis tasks
Going to need to include more analysis config information in the task_lists.yml files, need to formally refactor this into a 'pipeline' to encompass these extra configs. Example:
Need to check the printing of log filehandler paths, its getting printed repeated at the end of analysis tasks it seems
[INFO ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.