Giter Site home page Giter Site logo

snsxt's People

Contributors

stevekm avatar varshini712 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

snsxt's Issues

Revise analysis sample output file path retrieval

Currently, files for a given sample for a given step in the analysis pipeline are retrieved through filename pattern matching as per here:

# file matching pattern based on the sample's id
self.search_pattern = '{0}*'.format(self.id)

This may not be specific enough to prevent matching the wrong file(s) if two or more samples in an analysis have similar names, such as

Sample1
Sample11

Need to revise this to do a more exact search to prevent the possibility of mis-matches. Consider creating an 'expected' exact filename and doing a search for an exact match. Alternatively, consider using samples.*.csv files output by sns with paths to expected files, or record the output paths of files in snsxt analysis steps for later retrieval.

Also for reference, the file retrieval class method and find module

Need more error checking of analysis task output

  • check qacct for compute jobs to verify exit status and check for jobs that finished with errors
  • if its a qsub job, check that the job completed successfully
  • check that expected task input files exist
  • include in config per analysis step expected output file(s), and check that files exist upon completion of analysis step
  • raise exceptions inside analysis tasks if task completion validations do not pass
  • have separate email for errors and successful results, email only dev's & users in case of an error
  • custom actions when custom exceptions are raised

annotate-peaks submodule breaks Travis build

Looks like there is a problem with Travis cloning into this submodule:

Please make sure you have the correct access rights

and the repository exists.

fatal: clone of '[email protected]:stevekm/annotate-peaks.git' into submodule path '/home/travis/build/NYU-Molecular-Pathology/snsxt/snsxt/sns_tasks/scripts/annotate-peaks' failed

Failed to clone 'snsxt/sns_tasks/scripts/annotate-peaks' a second time, aborting

The command "eval git submodule update --init --recursive " failed. Retrying, 2 of 3.

maybe need to change the remote URL I think

refactor analysis report to utilize sns analysis steps

Instead of copying over all the base sns wes analysis report files when setting up the analysis report, these should instead be associated with the individual sns analysis tasks that produce the output used by the report fragments. This way, as other sns pipelines are added, they can bring over only the report fragments that are relevant to their output.

MuTect2 split jobs task

Need to work on the implementation of an analysis step that submits qsub jobs running MuTect2 per-sample, per-chromosome. This will probably need a new multi-qsub type of task class, along with the internal task job management.

mutt does not work on cluster nodes

The shell command mutt does not work on cluster nodes, only the head node. Need to

  • modify mutt commands to ssh back into the head node (implemented elsewhere, in reportIT)

  • consider another email-sending program or library

example log message:

[2017-11-15 21:00:22] DEBUG (mail:mail:sns_start_email:74) /bin/sh: line 3: mutt: command not found

MuTect2_split printing tons of job IDs to email log

Need to find where this message is being generated and reduce it to DEBUG level

WARNING ] No report files are set for analysis task MuTect2Split
[INFO    ] Submitted jobs: ['4127275', '4127276', '4127277', '4127278', '4127279', '4127280', '4127281', '4127282', '4127283', '4127284', '4127285', '4127286', '4127287', '4127288', '4127289', '4127290', '4127291', '4127292', '4127293', '4127294', ......

sns jobs being caught and Job objects generated before log file is created

Looks like Job objects are being created a little bit too soon, need to add a pause or sleep to give the jobs some time to start being run and log files generated before initializing. Alternatively, reconfigure initialization & log file checking.

[INFO    ] Creating new sns analysis in dir /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05
[WARNING ] Log file does not appear to exist: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/logs-qsub/sns.wes.SC-SERACARE.o4127208

Need running tasks to kill their jobs on error

If a tasks is creating qsub jobs for every sample in an analysis, and one sample causes an exception to be raised, the remaining qsub jobs might not be killed if they are not in the background jobs list yet. Need to look into method for handling this

SnsWESAnalysisOutput class not compatible with other sns pipeline outputs

Tried to implement sns RNA-Seq pipeline but the SnsWESAnalysisOutput class (from here) has issues with compatibility, due to validation and methods which rely on sns wes specific files.

Need to find a way to refactor so that we can use other types of pipelines

Need to fork the project, leave snsxt as-is, develop new framework that is not tied to sns wes as tightly?

Maybe deprecate many of the methods of analysis objects which are no longer used or needed, and instead implement them as methods of SnsTask objects. In this way, sns attributes will be tied to the individual sns pipelines and not to the analysis object as a whole

Need to have background qsub job queue for the program

If an analysis tasks is run with qsub_wait = False, the task will not wait for qsub jobs to finish before moving on to the next task. However, the snsxt program as a whole should still wait for these jobs to finish before completing and moving on to tasks such as report setup & generation. Need to instead put these un-waited jobs in a background queue within the program so that the program waits for all of them to finish before completing.

need to adjust log message levels

these aspects need to have INFO level log messages

  • list the name of each task as it is being started
  • make sure that qsub log files containing error messages are printed to log; they might be masked if an exception is raised first

these aspects need to be be set to DEBUG level log messages

  • number of qsub jobs submitted & monitored

refactor sns_task submodule names

Right now, builtin task class module names are named such as AnalysisTask.py and custom tasks are named such as _Delly2.py. Need to reverse this naming convention, so that the builtin class modules names start with '_'

Separate log file for each analysis, sample, and task

Each of these should have its own discrete log file, so you can more easily check just the logs for certain aspects of the analysis. I think this functionality might already be baked into the LoggedObject class but needs a way to direct the handlers to the correct file locations. Will have to decide how to put all the different log files in common or separate log directories..

Move binary configs to parent directory

This will make it easier for end user to quickly find and configure binary file paths, so they dont have to dig around in the program internals to find the right config file

refactor 'task_lists' into 'pipelines'

Going to need to include more analysis config information in the task_lists.yml files, need to formally refactor this into a 'pipeline' to encompass these extra configs. Example:

  • (qsub) log dir
  • cluster scheduler method (e.g. qsub, SLURM, etc)
  • file to get sample IDs from, method for sample ID import

logger filehandler paths getting printed repeatedly at end of analysis

Need to check the printing of log filehandler paths, its getting printed repeated at the end of analysis tasks it seems

[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO    ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log
[INFO    ] "main" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.log
[INFO    ] "email" log filepath: /ifs/data/molecpathlab/NGS580_WES/171116_NB501073_0027_AHT5M2BGX3/results_2017-11-24_14-55-05/snsxt/logs/run.py.2017-11-24-16-28-06.email.log

need small datasets for testing

  • need mini target.bed and probes.bed files so program can be tested faster
  • need set of fastq.gz files that include reads in the corresponding mini targets and probes regions

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.