Giter Site home page Giter Site logo

norwegianveterinaryinstitute / demultiplexrawsequencedata Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 0.0 513.2 MB

A workflow automation script: demultiplex the library sequence, run quality checks, deliver to archiving and processing afterwards

License: GNU General Public License v3.0

Python 99.80% Shell 0.20%
bioinformatics bioinformatics-analysis bioinformatics-pipeline bioinformatics-scripts bioinformatics-tool fastqc multiqc python3

demultiplexrawsequencedata's Introduction

demultiplex_script.py

Demutliplex a MiSEQ or NextSEQ run, perform QC using FastQC and MultiQC and deliver files either to VIGASP for analysis or NIRD for archiving

Replace with relevant run id. Example : "190912_M06578_0001_000000000-CNNTP". RunID breaks down like this (date +%y%m%d/yymmdd_MACHINE-SERIAL-NUMBER_AUTOINCREASING-NUMBER-OF-RUN_000000000-FlowcellID-used-for-this-run .

Note: don't bother with enforcing ISO dates for the directory name. It is an Illumina standard and they do not care.

Software requirements

Python > v3.9
bcl2fastq ( from https://emea.support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software/downloads.html )
FastQC    ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ )
MultiQC   ( pip3 install multiqc )

Directory structure on seqtech00

├── bin															binaries and symlinks of binaries live here
├── clarity 													exported Illumina Clarity directory
│   ├── gls_events	
│   ├── logs 													clarity logs go here
│   ├── miseq													miseq-clarity stopover directory
│   │   └── M06578												per serial number
│   │   	└── samplesheets									samplesheets for this serial number gohere
│   └── nextseq													nextseq-clarity stopover directory
│   	└── NB552450											per serial number
│   		└── samplesheets									samplesheets for this serial number gohere
├── demultiplex													demultiplexed data directory
├── for_transfer												data ready to be transfered over to NIRD or VIGASP
├── logs														all demultiplexing logs go here
├── rawdata														raw data directory, sequencers write here
│   ├── bad_runs												runs which are bad, or rejected
│   └── control_runs											water/other control runs
└── samplesheets												cummulative backups of all samplesheets
├── M06578 -> /data/clarity/miseq/M06578/samplesheets/			symlinmk to sample sheets for convinience
└── NB552450 -> /data/clarity/nextseq/NB552450/samplesheets/	symlinmk to sample sheets for convinience

Procedure

  • MiSeq writes as MiSEQ- to /data/scratch; shared folder Z:\ (alias rawdata) in MiSeq
  • Lab members modify an existing SampleSheet.csv file to include the new project data, then save the new file to the <RunId> folder in Z:\ and a copy within Z:\SampleSheets\ as <RunId>\SampleSheet.csv

Example:

Z:\190912_M06578_0001_000000000-CNNTP
        ├── SampleSheets.csv
Z:\SampleSheets
        ├── 190912_M06578_0001_000000000-CNNTP.csv
  • Cron job runs every 30 minutes if it finds a new run, RTAComplete.txt and SampleSheet.csv files within the run new, it starts the demultiplexing script
  • It can be manually started as below
/usr/bin/python3 /data/bin/demultiplex_script.py \<RunID\>

as the relevant user.

demultiplexrawsequencedata's People

Contributors

arvindsundaram avatar georgemarselis-nvi avatar karinlag avatar magnulei avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

demultiplexrawsequencedata's Issues

feature request: configuration file for the script

There are a lot of constants in the script, such as:

  • path names
  • file names
  • file extensions

which cannot be easily changed.

A configuration file would be beneficial, especially if we want to use ansible to configure those parameters

feature request: SampleSheet.csv validation

Regardless of LIMS or not, there can always be typos.

it will pay if we devote some time to either incorporate a SampleSheet.csv, or build one ourselves , in order to preemptively detect semantic errors or typos. The pay-off will be having a more automated and seemingly run/demultiplexing

according to @CathrineAB, The common errors found in SampleSheet.csv are:

  1. Space in sample name or project name. Especially hard to see if they occur at the end of the name. I replace the spaces with a “-“ if in middle of name. I erase the space if it is at the end.
  2. Æ, Ø or Å in sample name or project names.
  3. Extra lines in SampleSheet with no sample info in them. Will appear as a bunch of commas for each line which is empty. They need to be deleted or demuxing fails.
  4. Forget to put ekstra column called “Analysis” and set an “x” in that column for all samples (I don’t know if we will keep this feature for the future)
  5. '.' in sample names
  6. my own note: Check for commas == specific number ( ex: There are too many commas between ‘A1’ and ‘RRBS-NMBU’ )
  7. Check for missing commas: state machine and report if state N is missing comma after transitioning to N+1 state ( ex: a comma was missing between ’Sample1’ and ‘LPRSSBASNMBU1 )

feature request: Turn script into a daemon

Should watch the filesystem via dbus/inotify to see if new SampleSheet.csv files are instanciated

Advantages:
- not depended on time to start a demultiplexing run. Instead, the demultiplexing is controlled by other parameters, such as current resource usage.
- Gets started by systemda
- Allows us to run multiple runs at the same time, based on resource usage and not on time

Having the script as a daemons allows to query the status status via api. Is the run:
demultiplexted
rawdata
completed
how many in Que
RecentNewOnes
all appearing in a custom web page, probably even on LIMS, if Clarity can show random pages.

Automatic transfer of data to NIRD

We should figure out how to get the data automagically transferred to NIRD, so we can have the data available for those that will analyze it.

feature request: logging

What has to be done

-[ ] log to syslog
so later on we can ship the logs
-[ ] log to file
log file hierarchy is roughly outlined below

THOUGHTS ABOUT LOGGING

  • Logging should be done on every step

  • if you set demux.debug on the object, you will get debug information
    this needs to be expanded and worked on on a separate ticket

  • Logging happens on stdout, not stderr

  • Errors are logged on stderr, only

  • Log to syslog

  • stdout as INFO

  • stderr as WARNING or ERROR

  • sys.exit as CRITICAL

    Files created:

  • RunIDDir/demultiplex_log

  • RunIDDir/demultiplex_log/00_script.log

  • RunIDDir/demultiplex_log/01_blc2fastq.log

  • RunIDDir/demultiplex_log/02_FastQC.log

  • RunIDDir/demultiplex_log/03_MultiQC.log

  • RunIDDir/demultiplex_log/04_NIRD.log

  • RunIDDir/demultiplex_log/05_VIGASP.log
    /data/log/RunID.log ( copy of RunIDDir/Logs/00_script.log / symlink to RunID/Logs/00_script.log ? )

  • /data/log/demultiplex.log (cumulative log)
    /data/Logs/script.log ( copy of RunIDDir/Logs/00_script.log - gets rotated/overwritten with each run )
    /data/Logs/current_run_00.log
    /data/Logs/current_run_01.log if _00 exists and so on

          ~~Only for current run(s), file(s) do not get saved~~
    
  • overwrite when starting a new run

  • /data/log/script_cumulative.log ( cumulative - does not get rotated, append )

    Actions taken

  • Email on parsing configuration file error

  • Email send on completion

  • Email sent on error

  • run did not complete (specific step)
    disk full (no, belongs to nagios)

  • Email sent on run looping detected

  • Email sent on parsing error of SampleSheet.csv

  • Email on completion of NIRD upload

  • Email on completion of VIGASP upload

  • Email on completion of LIMS upload

      ######################################
      If debug run, indicate so in email
      ######################################
    

    if you run the script interactively:

  • Color:

  • Console: console has color

  • Files: files do not have color

      Timestamps:
    

-[ ] Console: console has "Time since started" and "time since started this step"
save to SQL database each time stamp?
-[ ] Files: files get regular current timestamp on each step

if you run the script non-interactively
  • Color:

  • Console: no console

  • Files: files do not have color

  • Timestamps:

  • Console: no console

  • Files: files get regular current timestamp on each step

    End of logging:

  • Time:

  • how much time the whole procedure took

  • how much each step

  • save times to SQL database?

feature request: Detect failed runs for MiSeq/NexSeq

No idea how, I have to ask Cathrine, but here is an idea

First file the directory + 35 hours: if no RTAComplete.txt has shown up then it is a failed run. Email person responsible for failed runs.

Points to improve:

  1. See if the 35 hour window can be minimized

Modify demultiplexing script to satisfy data delivery requirements

Script needs to:

  • Detect new SampleSheet.csv in /data/clarity , delivered from Clarity
  • Paranoid mode: check Clarity SampleSheet for common human errors
  • Demultiplex and create individual sequence files
  • Package files for delivery
  • Deliver tar files to NIRD
  • Deliver files to appropriate IRIDA_Project
  • Deliver rundata files to ClarityLims directory

feature requests Control Projects QC

From time to time, lab includes a Control sample in the list of projects, if the library space is available.

the script should

  • demultiplex it
  • do not regular QC
  • ignore it for tar files
  • develop a standard for naming the different Control Projects
  • develop a standard for what is a wanted Control Project result
  • develop a standard for what is an unwanted Control Project result
  • develop a standard for what is false-positive
  • develop a standard for what is false-negative
  • create a separate method that QCs the Control Project

Delineate between demultiplexing and delivery scripts

Demultiplexing sorts out the read data into fastq files. It requires the samplesheet from the Clarity LIMS system. The transfer script will require similar information to take the fastq files and put into the VIGAS system. Thus we need to delineate between the two scripts.

bug: ignore Control Projects

From time to time, lab includes a Control sample in the list of projects, if the library space is available.

the script should

  • demultiplex it
  • do regular QC
  • ignore it for tar files
  • ignore for delivery to VIGASP
  • ignore for delivery to NIRD

bug: fix output formatting

stop using tabs to tabulate and find a library or a PEP that teaches you how to properly and 100% accurately tabulates, 100% of the time

feature request: Nanopore data handling AT SOME POINT IN THE FAR FUTURE

at some point IN THE FAR FUTURE, this data management script will incorporate nanopore data handling

AT THAT POINT IN THE FAR FUTURE, we will need to see

  1. how to get status updates from nanopore
  2. handle the nanopore data in the same way as the illumina data

we will see about THE DETAILS ARE IN THE FAR FUTURE, WHEN WE GET THERE, but it is a good idea to just scribble the idea down.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.