nyu-molecular-pathology / snsxt Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 3.0 3.05 MB

bioinformatics pipeline framework for data analysis

License: GNU General Public License v3.0

Python 90.89% Shell 2.61% CSS 0.42% R 6.08%

snsxt's People

Contributors

Stargazers

Watchers

Forkers

stevekm varshini712 javrodriguez

snsxt's Issues

Revise analysis sample output file path retrieval

Currently, files for a given sample for a given step in the analysis pipeline are retrieved through filename pattern matching as per here:

# file matching pattern based on the sample's id
self.search_pattern = '{0}*'.format(self.id)

This may not be specific enough to prevent matching the wrong file(s) if two or more samples in an analysis have similar names, such as

Sample1
Sample11

Need to revise this to do a more exact search to prevent the possibility of mis-matches. Consider creating an 'expected' exact filename and doing a search for an exact match. Alternatively, consider using samples.*.csv files output by sns with paths to expected files, or record the output paths of files in snsxt analysis steps for later retrieval.

Also for reference, the file retrieval class method and find module

Need more error checking of analysis task output

check qacct for compute jobs to verify exit status and check for jobs that finished with errors
if its a qsub job, check that the job completed successfully
check that expected task input files exist
include in config per analysis step expected output file(s), and check that files exist upon completion of analysis step
raise exceptions inside analysis tasks if task completion validations do not pass
have separate email for errors and successful results, email only dev's & users in case of an error
custom actions when custom exceptions are raised

Can't run the same analysis task multiple times

If you have duplicate entries in a task list, it appears that only the first one gets executed. Maybe because the task list is using a dictionary, and they can only have unique keys?

include mail output

need to port over the methods for emailing analysis results

annotate-peaks submodule breaks Travis build

Looks like there is a problem with Travis cloning into this submodule:

Please make sure you have the correct access rights

and the repository exists.

fatal: clone of '[email protected]:stevekm/annotate-peaks.git' into submodule path '/home/travis/build/NYU-Molecular-Pathology/snsxt/snsxt/sns_tasks/scripts/annotate-peaks' failed

Failed to clone 'snsxt/sns_tasks/scripts/annotate-peaks' a second time, aborting

The command "eval git submodule update --init --recursive " failed. Retrying, 2 of 3.

maybe need to change the remote URL I think

Move log dir to parent directory

The program logs should be easier to access by moving them to the parent directory

need to kill all qsub jobs if the program breaks

if an exception is raised and the program ends, the qsub jobs it was monitoring should all be killed as well so they are not left running

refactor analysis report to utilize sns analysis steps

Instead of copying over all the base sns wes analysis report files when setting up the analysis report, these should instead be associated with the individual sns analysis tasks that produce the output used by the report fragments. This way, as other sns pipelines are added, they can bring over only the report fragments that are relevant to their output.

Need program framework dependency graph

Need to document the program framework with a visualization of submodule dependencies, etc., like with this:
http://furius.ca/snakefood/

MuTect2 split jobs task

Need to work on the implementation of an analysis step that submits qsub jobs running MuTect2 per-sample, per-chromosome. This will probably need a new multi-qsub type of task class, along with the internal task job management.

send email when analysis starts

Return error code or process object from subprocess functions

Right now when Python subprocess is used to execute a shell command, nothing is typically returned.

Might be good to always return the error code, or the SubprocessCmd or process object completely. This would be good for error checking within the script to make sure the commands finished successfully

reference: https://github.com/NYU-Molecular-Pathology/util/blob/master/tools.py#L22

mutt does not work on cluster nodes

The shell command mutt does not work on cluster nodes, only the head node. Need to

modify mutt commands to ssh back into the head node (implemented elsewhere, in reportIT)
consider another email-sending program or library

example log message:

[2017-11-15 21:00:22] DEBUG (mail:mail:sns_start_email:74) /bin/sh: line 3: mutt: command not found

MuTect2_split printing tons of job IDs to email log

Need to find where this message is being generated and reduce it to DEBUG level

WARNING ] No report files are set for analysis task MuTect2Split
[INFO    ] Submitted jobs: ['4127275', '4127276', '4127277', '4127278', '4127279', '4127280', '4127281', '4127282', '4127283', '4127284', '4127285', '4127286', '4127287', '4127288', '4127289', '4127290', '4127291', '4127292', '4127293', '4127294', ......

Tasks might not be executed in order

Since the task list is a YAML dictionary, the order of keys is not saved. See this for a solution that loads it as an ordered dict

https://stackoverflow.com/questions/5121931/in-python-how-can-you-load-yaml-mappings-as-ordereddicts

Integrate QC checks in reports and program output

Need a way to integrate some QC flags that should appear in the reports based on criteria for the data at each step, and maybe also in the output from the program itself somehow.

See https://github.com/NYU-Molecular-Pathology/demultiplexing-stats for an example of a report with a QC check included

Email not going to inbox

The run email is not going to the user who actually runs it.

sns jobs being caught and Job objects generated before log file is created

Looks like Job objects are being created a little bit too soon, need to add a pause or sleep to give the jobs some time to start being run and log files generated before initializing. Alternatively, reconfigure initialization & log file checking.

[INFO    ] Creating new sns analysis in dir /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05
[WARNING ] Log file does not appear to exist: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/logs-qsub/sns.wes.SC-SERACARE.o4127208

need mini targets & probes for testing

Can test pipeline components faster by using a smaller targets.bed file which only has targets for smaller chromosomes, such as for MuTect2 testing

Need running tasks to kill their jobs on error

If a tasks is creating qsub jobs for every sample in an analysis, and one sample causes an exception to be raised, the remaining qsub jobs might not be killed if they are not in the background jobs list yet. Need to look into method for handling this

SnsWESAnalysisOutput class not compatible with other sns pipeline outputs

Tried to implement sns RNA-Seq pipeline but the SnsWESAnalysisOutput class (from here) has issues with compatibility, due to validation and methods which rely on sns wes specific files.

Need to find a way to refactor so that we can use other types of pipelines

Need to fork the project, leave snsxt as-is, develop new framework that is not tied to sns wes as tightly?

Maybe deprecate many of the methods of analysis objects which are no longer used or needed, and instead implement them as methods of SnsTask objects. In this way, sns attributes will be tied to the individual sns pipelines and not to the analysis object as a whole

Need backup email notification in case of email send failure

In case the default email command fails for some reason, a backup simple notification should be attempted.

Need to have background qsub job queue for the program

If an analysis tasks is run with qsub_wait = False, the task will not wait for qsub jobs to finish before moving on to the next task. However, the snsxt program as a whole should still wait for these jobs to finish before completing and moving on to tasks such as report setup & generation. Need to instead put these un-waited jobs in a background queue within the program so that the program waits for all of them to finish before completing.

need to fork snsxt into more generic framework

A lot of the requirements for accomdating more generic frameworks make the NGS580 analysis tasks harder to implement, need to have a more generic version of snsxt for that. Fork into SkidSteer framework: https://github.com/stevekm/SkidSteer

need to adjust log message levels

these aspects need to have INFO level log messages

list the name of each task as it is being started
make sure that qsub log files containing error messages are printed to log; they might be masked if an exception is raised first

these aspects need to be be set to DEBUG level log messages

number of qsub jobs submitted & monitored

refactor sns_task submodule names

Right now, builtin task class module names are named such as AnalysisTask.py and custom tasks are named such as _Delly2.py. Need to reverse this naming convention, so that the builtin class modules names start with '_'

Skip analysis task if output already exists

Need to implement a method that will cause tasks to not run if they find that the expected output file(s) already exist

Include CNV analysis

Need to include the CNV analysis

Travis CI not reporting failed tests

Example:
https://travis-ci.org/NYU-Molecular-Pathology/snsxt/builds/296001755

Tests are failing but Travis is still reporting a successful build because the exist status of the test script is 0

See answer here
https://stackoverflow.com/a/43266549/5359531

need to incorporate here:
https://github.com/NYU-Molecular-Pathology/snsxt/blob/master/snsxt/test.py#L16

integrate sns pipeline with snsxt

Add the base sns pipeline as an embedded part of snsxt, so you can run the entire pipeline from just snsxt in a single step.

Separate log file for each analysis, sample, and task

Each of these should have its own discrete log file, so you can more easily check just the logs for certain aspects of the analysis. I think this functionality might already be baked into the LoggedObject class but needs a way to direct the handlers to the correct file locations. Will have to decide how to put all the different log files in common or separate log directories..

(qsub) log dir
cluster scheduler method (e.g. qsub, SLURM, etc)
file to get sample IDs from, method for sample ID import

logger filehandler paths getting printed repeatedly at end of analysis

Need to check the printing of log filehandler paths, its getting printed repeated at the end of analysis tasks it seems

[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO    ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO    ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log

need small datasets for testing

need mini target.bed and probes.bed files so program can be tested faster
need set of fastq.gz files that include reads in the corresponding mini targets and probes regions