norwegianveterinaryinstitute / demultiplexrawsequencedata Goto Github PK

A workflow automation script: demultiplex the library sequence, run quality checks, deliver to archiving and processing afterwards

License: GNU General Public License v3.0

Python 99.80% Shell 0.20%

bioinformatics bioinformatics-analysis bioinformatics-pipeline bioinformatics-scripts bioinformatics-tool fastqc multiqc python3

demultiplexrawsequencedata's Introduction

demultiplex_script.py

Demutliplex a MiSEQ or NextSEQ run, perform QC using FastQC and MultiQC and deliver files either to VIGASP for analysis or NIRD for archiving

Replace with relevant run id. Example : "190912_M06578_0001_000000000-CNNTP". RunID breaks down like this (date +%y%m%d/yymmdd_MACHINE-SERIAL-NUMBER_AUTOINCREASING-NUMBER-OF-RUN_000000000-FlowcellID-used-for-this-run .

Note: don't bother with enforcing ISO dates for the directory name. It is an Illumina standard and they do not care.

Software requirements

Python > v3.9
bcl2fastq ( from https://emea.support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software/downloads.html )
FastQC    ( https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ )
MultiQC   ( pip3 install multiqc )

Directory structure on seqtech00

├── bin															binaries and symlinks of binaries live here
├── clarity 													exported Illumina Clarity directory
│   ├── gls_events	
│   ├── logs 													clarity logs go here
│   ├── miseq													miseq-clarity stopover directory
│   │   └── M06578												per serial number
│   │   	└── samplesheets									samplesheets for this serial number gohere
│   └── nextseq													nextseq-clarity stopover directory
│   	└── NB552450											per serial number
│   		└── samplesheets									samplesheets for this serial number gohere
├── demultiplex													demultiplexed data directory
├── for_transfer												data ready to be transfered over to NIRD or VIGASP
├── logs														all demultiplexing logs go here
├── rawdata														raw data directory, sequencers write here
│   ├── bad_runs												runs which are bad, or rejected
│   └── control_runs											water/other control runs
└── samplesheets												cummulative backups of all samplesheets
├── M06578 -> /data/clarity/miseq/M06578/samplesheets/			symlinmk to sample sheets for convinience
└── NB552450 -> /data/clarity/nextseq/NB552450/samplesheets/	symlinmk to sample sheets for convinience

Procedure

MiSeq writes as MiSEQ- to /data/scratch; shared folder Z:\ (alias rawdata) in MiSeq
Lab members modify an existing SampleSheet.csv file to include the new project data, then save the new file to the <RunId> folder in Z:\ and a copy within Z:\SampleSheets\ as <RunId>\SampleSheet.csv

Example:

Z:\190912_M06578_0001_000000000-CNNTP
        ├── SampleSheets.csv
Z:\SampleSheets
        ├── 190912_M06578_0001_000000000-CNNTP.csv

Cron job runs every 30 minutes if it finds a new run, RTAComplete.txt and SampleSheet.csv files within the run new, it starts the demultiplexing script
It can be manually started as below

/usr/bin/python3 /data/bin/demultiplex_script.py \<RunID\>

as the relevant user.

demultiplexrawsequencedata's People

Contributors

Stargazers

Watchers

demultiplexrawsequencedata's Issues

feature request: Ansible script to install MultiQC from pip3

feature request: configuration file for the script

There are a lot of constants in the script, such as:

path names
file names
file extensions

which cannot be easily changed.

A configuration file would be beneficial, especially if we want to use ansible to configure those parameters

feature request: Turn demux script into an object

I don't think we need more than one file for everything, though if we eventually turn the object into a module.... we will revisit things then.

feature request: SampleSheet.csv validation

Regardless of LIMS or not, there can always be typos.

it will pay if we devote some time to either incorporate a SampleSheet.csv, or build one ourselves , in order to preemptively detect semantic errors or typos. The pay-off will be having a more automated and seemingly run/demultiplexing

according to @CathrineAB, The common errors found in SampleSheet.csv are:

Space in sample name or project name. Especially hard to see if they occur at the end of the name. I replace the spaces with a “-“ if in middle of name. I erase the space if it is at the end.
Æ, Ø or Å in sample name or project names.
Extra lines in SampleSheet with no sample info in them. Will appear as a bunch of commas for each line which is empty. They need to be deleted or demuxing fails.
Forget to put ekstra column called “Analysis” and set an “x” in that column for all samples (I don’t know if we will keep this feature for the future)
'.' in sample names
my own note: Check for commas == specific number ( ex: There are too many commas between ‘A1’ and ‘RRBS-NMBU’ )
Check for missing commas: state machine and report if state N is missing comma after transitioning to N+1 state ( ex: a comma was missing between ’Sample1’ and ‘LPRSSBASNMBU1 )

feature Request: WARN() about diff RawDataDir and DemultiplexDir

If there are any dir differences between RawDataDir and DemultiplexDir, issue a regular WARN()ing

syslog
email
workplace

feature request: Email notifications

We need email notifications for the following events

Errors (detailed)
Demux completion

feature request: Detect if the same run has been offered for demultiplexing, notify if stuck

Loop detection: if the script has gone over the same over the same run three times, it means it is stuck due to errors. Someone (or a lot of people) should receive an email notification, so this can be looked at.

Move paths to reflect saga?

I am wondering if we should use paths to reflect saga instead of NIRD. Most people operate from saga, and never log into nird. I am uncertain, but this at least would affect the email sent to people.

I am soliciting comments from @Camilsek, @CathrineAB, @magnulei, @georgemarselis-nvi and @ajkarloss here.

feature request: If the SampleSheet has changed, archive the old demux directory and tar files and then demultiplex the run again

make it clear that we are re-demultiplexing the run in the log file
copy new samplesheet in /data/SampleSheets with filename RunID-ISODATE.csv

Check if tar files can be unpacked correctly and with no errors

I think I noticed that the tar files produced, do not get unpacked error-free.

This is an issue as we need data integrity.

To investigate after logging.

if a project directory does not contain any CompressedFastQFiles, notify via email, but go on executing.

writing it down as a bug, but i have already dealt with it.

The issue here will be to figure out how to either create the final report with a huge highlight "HEY SOMETHING MAY BE WRONG HERE" or find out another way to deal with such exceptions.

feature request: Ensure that all steps have been executed

There should be a mechanism that ensures that all the steps have been executed.

The reason behind this is to ensure that someone such as myself has not forgotten a random sys.exit( ) inside a method.

feature request: Notifications on Workplace

https://www.google.com/search?q=facebook+workplace++api

https://developers.facebook.com/docs/workplace/introduction

https://developers.facebook.com/docs/workplace/

for_transfer/RunID : put tar files in the same directory

Naming convention for projects and samples in MiSeq

Deadline: Week 11

feature request: investigate if multiqc can use multiple CPUs

feature request: Turn script into a daemon

Should watch the filesystem via dbus/inotify to see if new SampleSheet.csv files are instanciated

Advantages:
- not depended on time to start a demultiplexing run. Instead, the demultiplexing is controlled by other parameters, such as current resource usage.
- Gets started by systemda
- Allows us to run multiple runs at the same time, based on resource usage and not on time

Having the script as a daemons allows to query the status status via api. Is the run:
demultiplexted
rawdata
completed
how many in Que
RecentNewOnes
all appearing in a custom web page, probably even on LIMS, if Clarity can show random pages.

Figure out how fields from samplesheet gets connected to sampleids/fastq filenames

demultiplex_script.py can be only invoked by sambauser01 user

demultiplex_script.py script can be invoked only as sambauser01 user.
This makes it difficult for other users while running the script in manual mode.
It should be changed to run by any user within the group 'sambagroup'

feature request: Check if demux script has been invoked more than one time

and have the ability to limit parallel runs

Automatic transfer of data to NIRD

We should figure out how to get the data automagically transferred to NIRD, so we can have the data available for those that will analyze it.

feature request: Add command-line arguments to demultiplex_script.py

A small number of command-line arguments should be incorporated into the demultiplex script, such as

--help
--version
--queue
--most-recent
--list-raw
--list-demutliplexed
--print-statistics
--longest-run
--shortest-run
--debug

not relevant here but equally important: have nagios? notifcation of Cathrine that it is ok for nextseq/miseq to write to seqtech01

how? good question, but we do not have a general queue for tickets like these so, it goes here to be seen later.

feature request: Ensure that /usr/bin/java exists before starting execution

notify before running script if not.

Sort out the architecture for the new features for the demultiplexing script and steps (Details in comments)

feature request: email QC report for either the specific project or the control project

feature request: Have Workplace/Teams chatbot report status of demultiplexing/sequencing

Would be nice if we "@DemultiplexBot: what is the status of sequencing?"

idea for the future

feature request: logging

What has to be done

-[ ] log to syslog
so later on we can ship the logs
-[ ] log to file
log file hierarchy is roughly outlined below

THOUGHTS ABOUT LOGGING

Logging should be done on every step
if you set demux.debug on the object, you will get debug information
this needs to be expanded and worked on on a separate ticket
Logging happens on stdout, not stderr
Errors are logged on stderr, only
Log to syslog
stdout as INFO
stderr as WARNING or ERROR
sys.exit as CRITICAL

Files created:
RunIDDir/demultiplex_log
RunIDDir/demultiplex_log/00_script.log
RunIDDir/demultiplex_log/01_blc2fastq.log
RunIDDir/demultiplex_log/02_FastQC.log
RunIDDir/demultiplex_log/03_MultiQC.log
RunIDDir/demultiplex_log/04_NIRD.log
RunIDDir/demultiplex_log/05_VIGASP.log
/data/log/RunID.log ( copy of RunIDDir/Logs/00_script.log / symlink to RunID/Logs/00_script.log ? )
/data/log/demultiplex.log (cumulative log)
~~/data/Logs/script.log ( copy of RunIDDir/Logs/00_script.log - gets rotated/overwritten with each run )~~
~~/data/Logs/current_run_00.log~~
~~/data/Logs/current_run_01.log if _00 exists and so on~~
```
      ~~Only for current run(s), file(s) do not get saved~~
```
overwrite when starting a new run
/data/log/script_cumulative.log ( cumulative - does not get rotated, append )

Actions taken
Email on parsing configuration file error
Email send on completion
Email sent on error
run did not complete (specific step)
~~disk full (no, belongs to nagios)~~
Email sent on run looping detected
Email sent on parsing error of SampleSheet.csv
Email on completion of NIRD upload
Email on completion of VIGASP upload

Email on completion of LIMS upload

  ######################################
  If debug run, indicate so in email
  ######################################

if you run the script interactively:

Color:
Console: console has color
Files: files do not have color
```
  Timestamps:
```

-[ ] Console: console has "Time since started" and "time since started this step"
save to SQL database each time stamp?
-[ ] Files: files get regular current timestamp on each step

if you run the script non-interactively

feature request: Detect failed runs for MiSeq/NexSeq

No idea how, I have to ask Cathrine, but here is an idea

First file the directory + 35 hours: if no RTAComplete.txt has shown up then it is a failed run. Email person responsible for failed runs.

Points to improve:

See if the 35 hour window can be minimized

Modify demultiplexing script to satisfy data delivery requirements

Script needs to:

Detect new SampleSheet.csv in /data/clarity , delivered from Clarity
Paranoid mode: check Clarity SampleSheet for common human errors
Demultiplex and create individual sequence files
Package files for delivery
Deliver tar files to NIRD
Deliver files to appropriate IRIDA_Project
Deliver rundata files to ClarityLims directory

feature requests Control Projects QC

From time to time, lab includes a Control sample in the list of projects, if the library space is available.

the script should

demultiplex it
do not regular QC
ignore it for tar files
develop a standard for naming the different Control Projects
develop a standard for what is a wanted Control Project result
develop a standard for what is an unwanted Control Project result
develop a standard for what is false-positive
develop a standard for what is false-negative
create a separate method that QCs the Control Project

Lane splitting for NextSeq?

bcl2fastq uses "--no-lane-splitting" option. We need to check if need it for NextSeq

feature request: cryptosign the outgoing tar files using openssl.

Delineate between demultiplexing and delivery scripts

Demultiplexing sorts out the read data into fastq files. It requires the samplesheet from the Clarity LIMS system. The transfer script will require similar information to take the fastq files and put into the VIGAS system. Thus we need to delineate between the two scripts.

QC for control projects

standardize names for control projects

bug: ignore Control Projects

From time to time, lab includes a Control sample in the list of projects, if the library space is available.

the script should

Can fastq.gz files be uploaded into IRIDA, without unzipping? how much additional space do we need? what directory are they being saved to?

feature request: SampleSheet.csv should contain the genus name

in order to help with the automatic uploading

Define SOP

We should have a meeting with Bjørn, Rachid and us to decide on the SOP for MiSeq data flow.
More information can be found here: https://github.com/NorwegianVeterinaryInstitute/prod_bioinf/blob/master/SOP.md

bug: if a project has no *fastq.gz files, do not upload it.

DemultiplexComplete.txt does not have the right file permissions after demux script finishes

def calcFileHash( DemultiplexRunID ) : do not hash any existing .md5 or .sha512 files present, but do report them

        if any( var in file for var in [ demux.md5Suffix, demux.sha512Suffix ] ): # if you find any .md5 or .sha512 files, skip them
            continue

that way we can ignore any .md5.md5.md5.sha512.md5.sha512.md5 silliness.

Please update the changelog with new changes

Please update the changelog before the why for the changes are forgotten.

feature request: Ansible script that installs FastQC and changes the "processes" variable in the startup script from 4 to nCPUs

so FastQC is a java program mainly, but it includes a start-up perl script.

Look at /usr/local/bin/fastqc for source

That script initializes the memory, threads and CPUs used for the program.

We should customize it to utilize all the available CPUs to speed up demultiplexing

how to get status updates from nanopore
handle the nanopore data in the same way as the illumina data

we will see about THE DETAILS ARE IN THE FAR FUTURE, WHEN WE GET THERE, but it is a good idea to just scribble the idea down.