pathogen-genomics-cymru / lodestone Goto Github PK

Mycobacterial pipeline

License: GNU Affero General Public License v3.0

Python 52.74% Nextflow 32.92% Roff 14.34%

bioinformatics bioinformatics-pipeline genomics global-health infectious-diseases next-generation-sequencing nextflow pathogen sequencing tuberculosis

lodestone's People

Contributors

Stargazers

Watchers

Forkers

maximfilimonovgh arthurvm mmueller76

lodestone's Issues

Tb-profiler issue

I have only recently started using Lodestone but am having an error that is cropping up. I have added an example.
tbprofiler.errlog.txt. It seems to relate to Input is not detected as bcf or vcf format which might relate to jodyphelan/TBProfiler#367. However I was also wondering if it is a version issue (tb-profiler 6.2.0 versus 6.2.1) because when I run the aformentioned samples using a conda environment with tb-profiler no issuses arise.

samtools sort error in clockwork:alignToRef

Following error has been recorded:

[ERROR] failed to open file 'null'
[bam_mating_core] ERROR: Couldn't read header
samtools sort: failed to read header from "-"
[markdup] error reading header

Need to pin version of clockwork in container

In the container, the dependencies for clockwork are installed manually and are pinned by version/git commit according to pre-v0.11.0 clockwork, but the git clone is using the main branch for clockwork, which is now at v0.11.0.

stub runs

Look at updating failing stub runs

Singularity permissions error


Caused by:
  Process `vcfpredict:tbprofiler (1)` terminated with an error exit status (1)

Command executed:

  bgzip SAMPLE_ID_allelic_depth.minos.vcf
  tb-profiler profile --vcf SAMPLE_ID_allelic_depth.minos.vcf.gz --threads 1
  mv results/tbprofiler.results.json SAMPLE_ID.tbprofiler-out.json
  
  cp SAMPLE_ID_report.json SAMPLE_ID_report_previous.json
  
  echo '{"complete":"workflow complete without error"}' | jq '.' > SAMPLE_ID_err.json
  
  jq -s ".[0] * .[1] * .[2]" SAMPLE_ID_err.json SAMPLE_ID_report_previous.json  SAMPLE_ID.tbprofiler-out.json > SAMPLE_ID_report.json

Command exit status:
  1

Command output:
  [00:05:44] INFO     Using ref file:                                    db.py:594
                      /opt/conda/share/tbprofiler//tbdb.fasta                     
             INFO     Using gff file:                                    db.py:594
                      /opt/conda/share/tbprofiler//tbdb.gff                       
             INFO     Using bed file:                                    db.py:594
                      /opt/conda/share/tbprofiler//tbdb.bed                       
             INFO     Using json_db file:                                db.py:594
                      /opt/conda/share/tbprofiler//tbdb.dr.json                   
             INFO     Using variables file:                              db.py:594
                      /opt/conda/share/tbprofiler//tbdb.variables.json            
             INFO     Using spoligotype_spacers file:                    db.py:594
                      /opt/conda/share/tbprofiler//tbdb.spoligotype_spac          
                      ers.txt                                                     
             INFO     Using spoligotype_annotations file:                db.py:594
                      /opt/conda/share/tbprofiler//tbdb.spoligotype_list          
                      .csv                                                        
             INFO     Using bedmask file:                                db.py:594
                      /opt/conda/share/tbprofiler//tbdb.mask.bed                  
             INFO     Using barcode file:                                db.py:594
                      /opt/conda/share/tbprofiler//tbdb.barcode.bed               
  [00:05:45] INFO     Running snpEff                                    vcf.py:119
  [00:05:47] ERROR    mkdtemp(/bcftools.p6luoZ) failed: Read-only     utils.py:391
                      file system                                                 
                                                                                  
             ERROR                                                  tb-profiler:58
                                                                                  
                      ################################# ERROR                     
                      #######################################                     
                                                                                  
                      This run has failed. Please check all                       
                      arguments and make sure all input files                     
                      exist. If no solution is found, please open                 
                      up an issue at                                              
                      https://github.com/jodyphelan/TBProfiler/issu               
                      es/new and paste or attach the                              
                      contents of the error log                                   
                      (tbprofiler.errlog.txt)                                     
                                                                                  
                      #############################################               
                      ##################################                          
                                                                                  

Command error:
  Traceback (most recent call last):
    File "/opt/conda/bin/tb-profiler", line 562, in <module>
      args.func(args)
    File "/opt/conda/bin/tb-profiler", line 110, in main_profile
      results.update(pp.run_profiler(args))
    File "/opt/conda/lib/python3.10/site-packages/pathogenprofiler/cli.py", line 74, in run_profiler
      results = vcf_profiler(conf=args.conf,prefix=args.files_prefix,sample_name=args.prefix,vcf_file=args.vcf,delly_vcf_file=args.delly_vcf)
    File "/opt/conda/lib/python3.10/site-packages/pathogenprofiler/profiler.py", line 121, in vcf_profiler
      vcf_obj = vcf_obj.run_snpeff(conf["snpEff_db"],conf["ref"],conf["gff"],rename_chroms= conf.get("chromosome_conversion",None))
    File "/opt/conda/lib/python3.10/site-packages/pathogenprofiler/vcf.py", line 134, in run_snpeff
      run_cmd("bcftools view -c 1 -a %(filename)s | bcftools view -v snps | combine_vcf_variants.py --ref %(ref_file)s --gff %(gff_file)s | %(rename_cmd)s snpEff ann %(snpeff_data_dir_opt)s -noLog -noStats %(db)s - %(re_rename_cmd)s | bcftools sort -Oz -o %(tmp_file1)s && bcftools index %(tmp_file1)s" % vars(self))
    File "/opt/conda/lib/python3.10/site-packages/pathogenprofiler/utils.py", line 392, in run_cmd
      raise ValueError("Command Failed:\n%s\nstderr:\n%s" % (cmd,result.stderr.decode()))
  ValueError: Command Failed:
  /bin/bash -c set -o pipefail; bcftools view -c 1 -a bec971b8-a6c5-4d7b-8fc0-f4321e950049.targets.vcf.gz | bcftools view -v snps | combine_vcf_variants.py --ref /opt/conda/share/tbprofiler//tbdb.fasta --gff /opt/conda/share/tbprofiler//tbdb.gff | rename_vcf_chrom.py --source NC_000962.3 --target Chromosome | snpEff ann -dataDir /opt/conda/share/snpeff-5.2-0/data -noLog -noStats Mycobacterium_tuberculosis_h37rv - | rename_vcf_chrom.py --source Chromosome --target NC_000962.3 | bcftools sort -Oz -o 2e834baf-bbb5-41be-ba37-b6192ea6df35.vcf.gz && bcftools index 2e834baf-bbb5-41be-ba37-b6192ea6df35.vcf.gz
  stderr:
  mkdtemp(/bcftools.p6luoZ) failed: Read-only file system
  
  Cleaning up after failed run

Work dir:
  /home/ubuntu/data2/lodestone/work/63/cc1294a65c49c2fb547641964a5c48

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line```



===========================================
Finished with errors

TB-profiler new issue

Unfortunately a new error has come up. It seems similar to previous one with tb-profiler error
tbprofiler.errlog.txt. Again staging of files seems to be an issue (SRR26331605 versus SRR26331609) when looking at directory and subdirectory hierarchy. Not sure if the error actually originates further up the pipeline
d2
└── b221477a01b061477e4a4d7e323198
├── SRR26331605_report.json -> /home/mniebel/tb_pipeline/lodestone/work/66/1e4361b50d3eb5c6ed4398345eed7c/SRR26331605_report.json
├── SRR26331609_allelic_depth.minos.vcf.gz
├── SRR26331609_allelic_depth.minos.vcf.gz.tbi
├── bam
├── results
├── tbprofiler.errlog.txt
├── tmp
└── vcf
The dataset I am currently running through Lodestone are the 79 samples from the MAGMA paper https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011648 (BioProject PRJNA1026351). I have been able to isolate the problem down to SRR26331605 & SRR26331609.

The command I am using is:

nextflow -bg run main.nf -profile singularity --input_dir MAGMA_dataset_2a_issue_v4/ --pattern '*_{1,2}.fastq.gz' -with-report

species hits from Afanc missing dot abbreviations

Problems with parsing of assembly_summary_refseq.txt, matches aren’t found for some of the species hits from Afanc due to missing dot abbreviations, e.g. Afanc reports Mycobacterium avium subsp paratuberculosis (after underscore is removed), but in assembly_summary_refseq.txt it’s reported as Mycobacterium avium subsp. paratuberculosis

use https links rather than ftp

Users appear to have issues with ftp choking out vs https links.

Update TB profiler manually

tb-profiler update_db doesn't work directly in Docker recipes without specifiying the version and commits for some reason. So they need to be manually updated.

Hanging recursion

Recursion hangs when there are multiple inputs/outputs to a workflow. See: nextflow-io/nextflow#3795

Bowtie2 database

Thank you for the excellent workflows, and I'm trying to run it on PC. But for bowtie2 database, I'm confused about what exactly to download. I found at least three links to download. Please let me know if the link is correct.

https://genome-idx.s3.amazonaws.com/bt/hg19.zip (from https://bowtie-bio.sourceforge.net/bowtie2/index.shtml)
https://genome-idx.s3.amazonaws.com/bt/hg19_1kgmaj_snvs_bt2.zip
https://genome-idx.s3.amazonaws.com/bt/hg19_1kgmaj_snvindels_bt2.zip
(both link 2&3 from github https://github.com/BenLangmead/bowtie-majref)

Best,
Trung

Update: please forget it, I just make myself remember how to download a ftp link.

Issue with preprocessing:bowtie2 (knock-on effect on clockwork workflow)

Channel going into preprocessing:bowtie2 is incorrectly defined. This causes preprocessing:bowtie2 to run for the first processed sample only, with subsequent samples incorrectly skipping this process. This will have a knock-on effect on the clockwork sub-workflow, meaning clockwork:alignToRef will only run for the first sample

Permissive flag on decontamination

Add in a permissive flag, proceed with samples if they could not download

Scripting/reporting still referring to mykrobe when it should refer to afanc

Afanc is exclusively used for downstream reporting, but error messages etc. are still referring to mykrobe

Empty contaminants.fa in preprocessing:downloadContamGenomes

Observed that sometimes contaminants.fa is empty (due to failed download) and no error is thrown in preprocessing:downloadContamGenomes

Samples not passing to gnomonicus

Due to differences in Mycobacterium tuberculosis species reporting in the Afanc report compared to the Mykrobe report, TB samples are failing to pass to gnomonicus

E.g.
Mykrobe top hit: Mycobacterium tuberculosis
Afanc top hit: Mycobacterium tuberculosis H37Rv

If-statement in clockwork:minos needs to be updated to reflect Afanc reporting, i.e. "starts with" Mycobcaterium tuberculosis

Assembly

changes needed for vcfpredict sub-workflow

VCFmix should be optional (fails for synthetic samples)
add parameter for choice of catalogue

Bash variables not being substituted in json

E.g. "error": "top hit ($top_hit) is not one of the 10 accepted mycobacteria"

Need to use --arg parameter with jq

AttributeError: 'list' object attribute 'append' is read-only in bin/create_final_json.py

The create_final_json.py script fails when trying to append to a list used to create the warnings field in the final JSON in process clockwork:alignToRef:

Traceback (most recent call last): File "/scratch/c.c1656075/sp3_testing_2/tb-pipeline/bin/create_final_json.py", line 128, in <module> out = read_and_parse_input_files(stats_file, report_file) File "/scratch/c.c1656075/sp3_testing_2/tb-pipeline/bin/create_final_json.py", line 96, in read_and_parse_input_files warnings.append = "there was %d error but no warnings" %num_errors AttributeError: 'list' object attribute 'append' is read-only

It appears this is because the script is trying to assign to a member called append rather than append to the warnings list.

Samples not passing to clockwork

Logic for samples passing to clockwork is broken. When unmix_myco=no and contaminants are found, samples are not passing to clockwork

Some samples don’t pass to clockwork when unmix_myco is set to yes

Samples that aren’t mixed and hence don’t pass through TM05-TM08 (because there’s no contaminant to download and map to) are not being passed to clockwork

Iteratively loop through contaminant removal

Rather than removing contaminants in one block (e.g. the rekraken reafanc block) structure the code such that the pipeline can keep trying until things are successfuly removed.

Spython

To-do: add actions workflow to keep Singularity recipes in line with Dockerfiles

Add missing report generation to preprocessing:summarise

Report json needs to be generated for the scenario where contaminants are unsuccessfully removed from sample

Resource request in nextflow.config not working

withLabel cpu and memory declarations are not working

Minos fails when trying to compare against an empty VCF file

Running Minos with an empty VCF file gives an error in process minos on the command:

minos adjudicate --force --reads sample minos ref.fa A19U007635_1561353218R1_M04557_164.bcftools.vcf sample.cortex.vcf

In this case the BCFTools VCF is empty and the Cortex VCF has a low number of variants. I have not seen any other incidences of this happening. I'll have a proper look and update accordingly.

Additions for v0.9.8

Add compatibility for CLIMB Jupyter Notebooks (uses K8s/S3)
Replace gnomonicus with tb-profiler
Add csv output from mykrobe
Minor reporting bug fix (add missing publishDir statement to reAfanc and reKraken)
Publish full afanc report

clockwork:callVarsMpileup writing files outside working directory

.fai files are being written outside the working directory to /tb-pipeline/resources. This can cause permission issues with read-only partitions.

assembly_summary_refseq.txt :check for updates or go straight to wget file

Check for updates to https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
assembly_summary_refseq.txt and update accordingly.

Process:summarise: identify_tophit_and_contaminants2.py KeyError

KeyError when identify tophits can't find species key in dict:

  /bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8 LANG en_US.UTF-8 LANGUAGE en_US.UTF-8)
  Traceback (most recent call last):
    File "/opt/bin/identify_tophit_and_contaminants2.py", line 507, in <module>
      out, out_urls = process_reports(afanc_json, kraken_json, supposed_species, unmix_myco, myco_dir, prev_species_json, urls, tax_ids, sample_id)
    File "/opt/bin/identify_tophit_and_contaminants2.py", line 219, in process_reports
      for key in kraken['Species']:
  KeyError: 'Species'

for sample SRR26331619

Read pair counting bug in preprocessing:fastp

There is a bug in preprocessing:fastp where the read pair count is checked as >100k. The count is performed by pulling it from the fastp output json, however, this is a count for the total number of reads rather than the number of pairs. Using the fqtools approach as in preprocessing:countReads resolves this issue.

Update identify_tophit_and_contaminants2.py to reflect the new NCBI taxonomy

The NCBI taxonomy for Mycobacteriaceae has been expanded in recent years to include the following genus: Mycobacterium, Mycobacteroides, Mycolicibacter, Mycolicibacterium, and Mycolicibacillus. Afanc and recent versions of Kraken2 databases use this taxonomy. Ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8376243/

Older versions of the NCBI taxonomy just use the genus Mycobacterium. Mykrobe and older (3 years+) Kraken2 databases use this taxonomy.

The script identify_tophit_and_contaminants2.py, identifies the top hit from a Mykrobe/Afanc report and the contaminant genomes from a Kraken2 report (and the Mykrobe/Afanc report if unmix_myco=yes).

The script identify_tophit_and_contaminants2.py has been written to recognise the old Mycobacterium taxonomy, and doesn't recognise Mycolicibacterium etc. as being part of Mycobacteriaceae. This means that the script only works for Mykrobe and old Kraken2 databases. When running the script on Afanc and recent Kraken2 reports, Mycolicibacterium etc. are incorrectly identified as contaminants (when unmix_myco=no) and the workflow tries to remove them.

Suggested fix:
Update identify_tophit_and_contaminants2.py to reflect the new taxonomy and drop support for Mykrobe which uses the old taxonomy. Mykrobe will still run as an independent process, but will NOT be used in any downstream reporting

Threshold decontamination

As well as passing through the iteration of preprocessing/decontamination add in threshold for passing the removal.

Also it is worth looking at how the comparison at the end is made e.g. if the removal removes some reads such that it is less than threshold, is this checked or does it fail because they're not all made

TB-profiler report

Change tb-profiler report such that it easier to report

e.g. drug -> mutation -> prediction

Missing version for tb-profiler container

versions.json is missing the versions of the software in the tbprofiler Dockerfile

Error in read counting in preprocessing:fastp

fqtools count is running on the raw reads when it should be running on the cleaned reads from fastp

Empty contaminated.fa and failed in preprocessing:mapToContamFa process

Hello, I found that tb-pipeline would be very useful to our lab.
However, the pipeline terminated at the mapToContamFa step.
And I found that a contam_dir was created in the work directory and GCF_000001405.39_GRCh38.p13_genomic.fna.gz (~920Mb) was downloaded at the dowloadConta step.

After this step, a contaminants.fa file was created and the contam_dir was removed.
However, the contaminants.fa is empty and the pipeline terminated.

I am not sure the pipeline failed was due to the empty contaminants.fa file.
Is there any reason why the file is empty? Is it possible to skip the contaminants mapping step?

And I suppose the human reads were processed using bowtie2 against hg19_1kgmaj ?
Why GCF_000001405.39_GRCh38.p13_genomic.fna.gz was downloaded again?

Thanks in advance for any reply.

TB-profiler issue v2

          Hi Tom, unfortunately this is still not working. Any ideas?

error_log_lodestone_180624.txt

Originally posted by @MarcNiebel in #58 (comment)

Output json of parse_samtools_stats.py is in incorrect format

The json produced by parse_samtools_stats.py is not in the format expected by create_final_json.pl. This causes the clockwork sub-workflow to prematurely exit after clockwork:alignToRef. In addition this causes an error message to be incorrectly recorded in the error log and the final report json is incomplete.

Thresholding issues

Thresholds seem to affect the second pass for decontimination.

SRR26331617
SRR26331661
SRR26331630
SRR26331659
SRR26331628
SRR26331650
SRR26331615
SRR26331608
SRR26331655
SRR26331649
SRR26331613
SRR26331606
SRR26331622
SRR26331637
SRR26331633

preprocessing:identifyBacterialContaminants is identifying human reads as a contaminant to remove

Human reads are removed by preprocessing:bowtie2, however preprocessing:identifyBacterialContaminants is identifying human reads as a contaminant to remove

identify_tophit_and_contaminants2.py should not be identifying human reads as a contaminant

intracellulare when statement bug


Caused by:
  Process `vcfpredict:add_allelic_depth (1)` terminated with an error exit status (2)

Command executed:

  samtools faidx intracellulare.fasta
  samtools dict intracellulare.fasta -o intracellulare.dict
  gatk VariantAnnotator -R intracellulare.fasta -I SAMPLE_ID.bam -V SAMPLE_ID.minos.vcf -A DepthPerAlleleBySample -O SAMPLE_ID_allelic_depth.minos.vcf

Command exit status:
  2

Command output:
  (empty)

Command error:
  Using GATK jar /opt/conda/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /opt/conda/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar VariantAnnotator -R intracellulare.fasta -I SAMPLE_ID.bam -V SAMPLE_ID.minos.vcf -A DepthPerAlleleBySample -O SAMPLE_ID_allelic_depth.minos.vcf
  04:09:45.030 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/conda/share/gatk4-4.4.0.0-0/gatk-package-4.4.0.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
  04:09:45.135 INFO  VariantAnnotator - ------------------------------------------------------------
  04:09:45.194 INFO  VariantAnnotator - The Genome Analysis Toolkit (GATK) v4.4.0.0
  04:09:45.194 INFO  VariantAnnotator - For support and documentation go to https://software.broadinstitute.org/gatk/
  04:09:45.194 INFO  VariantAnnotator - Executing as mambauser@parallelrunningtb on Linux v5.4.0-170-generic amd64
  04:09:45.194 INFO  VariantAnnotator - Java runtime: OpenJDK 64-Bit Server VM v17.0.8-internal+0-adhoc..src
  04:09:45.195 INFO  VariantAnnotator - Start Date/Time: February 13, 2024 at 4:09:44 AM UTC
  04:09:45.195 INFO  VariantAnnotator - ------------------------------------------------------------
  04:09:45.195 INFO  VariantAnnotator - ------------------------------------------------------------
  04:09:45.196 INFO  VariantAnnotator - HTSJDK Version: 3.0.5
  04:09:45.196 INFO  VariantAnnotator - Picard Version: 3.0.0
  04:09:45.197 INFO  VariantAnnotator - Built for Spark Version: 3.3.1
  04:09:45.197 INFO  VariantAnnotator - HTSJDK Defaults.COMPRESSION_LEVEL : 2
  04:09:45.197 INFO  VariantAnnotator - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
  04:09:45.198 INFO  VariantAnnotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
  04:09:45.199 INFO  VariantAnnotator - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
  04:09:45.200 INFO  VariantAnnotator - Deflater: IntelDeflater
  04:09:45.200 INFO  VariantAnnotator - Inflater: IntelInflater
  04:09:45.200 INFO  VariantAnnotator - GCS max retries/reopens: 20
  04:09:45.200 INFO  VariantAnnotator - Requester pays: disabled
  04:09:45.200 INFO  VariantAnnotator - Initializing engine
  WARNING	2024-02-13 04:09:45	SamFiles	The index file /home/ubuntu/data2/lodestone/work/97/14dacc900afcd5200dc703457c0ba7/SAMPLE_ID.bam.bai was found by resolving the canonical path of a symlink: SAMPLE_ID.bam -> /home/ubuntu/data2/lodestone/work/97/14dacc900afcd5200dc703457c0ba7/SAMPLE_ID.bam
  04:09:45.329 INFO  FeatureManager - Using codec VCFCodec to read file file://SAMPLE_ID.minos.vcf
  04:09:45.341 INFO  VariantAnnotator - Shutting down engine
  [February 13, 2024 at 4:09:45 AM UTC] org.broadinstitute.hellbender.tools.walkers.annotator.VariantAnnotator done. Elapsed time: 0.01 minutes.
  Runtime.totalMemory()=201326592
  ***********************************************************************
  
  A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
    reference contigs = [NC_016946.1]
    features contigs = [NC_000962.3]
  
  ***********************************************************************
  Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Work dir:
  /home/ubuntu/data2/lodestone/work/d7/5b3bfc56c15af6202ac715590fa01e

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`



===========================================
Finished with errors

Unable to analyze single FASTQ reads

Hi.

I am trying to analyze single reads but I am getting this error message:

WARN: Input tuple does not match input set cardinality declared by process preprocessing:checkFqValidity -- offending value: [ERR11243647, /home/olawoyei/projects/def-guthriej/olawoyei/mtb/fastq/ERR11243647.fastq.gz, /project/6083771/olawoyei/work/39/36be7d9f46f13a9a5a5d62ea719385/version.json]

Does lodestone only work with paired FASTQ reads?