ncbi / pgap Goto Github PK

NCBI Prokaryotic Genome Annotation Pipeline

License: Other

Common Workflow Language 85.46% XSLT 1.08% Shell 1.31% Perl 0.61% Python 11.54%

pgap's Introduction

PGAP

NCBI Prokaryotic Genome Annotation Pipeline

The NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs and pseudogenes.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Pipeline was developed in 2001 and is regularly upgraded to improve structural and functional annotation quality (Li W, O'Neill KR et al 2021). Recent improvements include utilization of curated protein profile hidden Markov models (HMMs), and curated complex domain architectures for functional annotation of proteins and annotation of Enzyme Commission numbers and Gene Ontology terms. Post-annotation, the completeness of the annotated gene set is estimated with CheckM.

The workflow provided here also offers the option to confirm or correct the organism associated with the genome assembly prior to starting the annotation, using the Average Nucleotide Identity tool.

Get started by watching this webinar!

Need to assemble the genome too? Use RAPT for producing an annotated genome starting from short reads

Instructions

To run the PGAP pipeline you will need Linux, or some compatible container technology, CWL (Common Workflow Language), and about 30GB of supplemental data. We provide instructions here for running under the CWL reference implementation, cwltool. Full instructions for installing, running, and interpreting the results may be found in our wiki.

References

NCBI

Expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.
Li W, O'Neill KR, Haft DH, DiCuccio M, Chetvernin V, Badretdin A, Coulouris G, Chitsaz F, Derbyshire MK, Durkin AS, Gonzales NR, Gwadz M, Lanczycki CJ, Song JS, Thanki N, Wang J, Yamashita RA, Yang M, Zheng C, Marchler-Bauer A, Thibaud-Nissen F. RefSeq: Nucleic Acids Res. 2021 Jan 8;49(D1):D1020-D1028.

RefSeq: an update on prokaryotic genome annotation and curation.
Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V, O'Neill K, Li W, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu F, Marchler GH, Song JS, Thanki N, Yamashita RA, Zheng C, Thibaud-Nissen F, Geer LY, Marchler-Bauer A, Pruitt KD.
Nucleic Acids Res. 2018 Jan 4;46(D1):D851-D860.

NCBI prokaryotic genome annotation pipeline.
Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP, Zaslavsky L, Lomsadze A, Pruitt KD, Borodovsky M, Ostell J.
Nucleic Acids Res. 2016 Aug 19;44(14):6614-24. Epub 2016 Jun 24.

Using average nucleotide identity to improve taxonomic assignments in prokaryotic genomes at the NCBI.
Ciufo S, Kannan S, Sharma S, Badretdin A, Clark K, Turner S, Brover S, Schoch CL, Kimchi A, DiCuccio M.
Int J Syst Evol Microbiol. 2018 Jul;68(7):2386-2392.

GeneMarkS-2+

Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes
Lomsadze A, Gemayel K, Tang S, Borodovsky M.
Genome Research. 2018; 28(7):1079-1089.

CheckM

CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW.
Genome Research. 2015; 25(7):1043-1055.

TIGRFAMs

TIGRFAMs: a protein family resource for the functional identification of proteins.
Haft DH, Loftus BJ, Richardson DL, Yang F, Eisen JA, Paulsen IT, White O.
Nucleic Acids Res. 2001 Jan 1;29(1):41-3.

The TIGRFAMs database of protein families.
Haft DH, Selengut JD, White O.
Nucleic Acids Res. 2003 Jan 1;31(1):371-3.

TIGRFAMs and Genome Properties: tools for the assignment of molecular function and biological process in prokaryotic genomes.
Selengut JD, Haft DH, Davidsen T, Ganapathy A, Gwinn-Giglio M, Nelson WC, Richter AR, White O.
Nucleic Acids Res. 2007 Jan;35(Database issue):D260-4. Epub 2006 Dec 6.

TIGRFAMs and Genome Properties in 2013.
Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E.
Nucleic Acids Res. 2013 Jan;41(Database issue):D387-95. doi: 10.1093/nar/gks1234. Epub 2012 Nov 28.

LICENSING TERMS

NCBI PGAP CWL

The NCBI PGAP CWL and other code authored by NCBI is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as United States Government employees and thus cannot be copyrighted. This software is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction.

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using this software or data. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite NCBI in any work or product based on this material.

Third-party tools

The Docker image contains third-party tools distributed under the licensing terms of the respective license holders.

GeneMarkS-2+

GeneMarkS-2+ is distributed as part of PGAP with limited rights of use and redistribution from the Georgia Tech Research Corporation. See the full text of the license.

CheckM

GNU General Public License v3.0

Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights. See the full text of the license.

TIGRFAMs

The original TIGRFAMs database was a research project of the J. Craig Venter Institute (JCVI) . TIGRFAMs, short for The Institute for Genomic Research's database of protein families, is a collection of manually curated protein families focusing primarily on prokaryotic sequences. It consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, Enzyme Commission (EC) numbers, gene symbols, protein family names, descriptive text, cross-references to related models in TIGRFAMs and other databases, and pointers to the literature. The work has been described in the articles listed in the References section above and use of the TIGRFAMs database must grant proper attribution by citing those four articles.

As of April 2018, rights were transferred to the National Center for Biotechnology Information (NCBI), National Library of Medicine, NIH, for the data to be made available for distribution under a Creative Commons Attribution-ShareAlike 4.0 license. Please see (https://creativecommons.org/licenses/by-sa/4.0/) for a brief summary of the license and (https://creativecommons.org/licenses/by-sa/4.0/legalcode) to see the full text.

pgap's People

Contributors

Stargazers

Watchers

Forkers

pbieberstein pythseq wangdi2014 inambioinfo chrishah hotliu acaprez fireywotan eternal-bug rajaldebnath iromeo simexin abdo3a stanleychiutw kasmiyassin thexiyang madetunj vikash84 ksanjeetsinha qisun2 morris1805 arghya1611 pvanheus gaybro8777 bioshare mysoldier ishratkhan kdhadda alexpersa7 stevetsa yochan1004 amani82 salinifrancesco mza0150 dzyim tek-god ashishkatyaal guxiaofei1 deepa-rao daukantas global-localhost global19 global19-atlassian-net kozo2 jiwon-jeong daz041 wook2014 bioinfoacademy yananzh aniketbioinfo syssynbio amvarani avecchio kang-yeoncho sdy2813 jclachance ronsherfey sailfish009 thoughtsynapse linhduongtuan happywlu xiangyang1984 zzsnow henriqueisia zhangxiaodong8315 wangzhichao1990 wanjinhu pstrope hua-clearlabs trellixvulnteam zhaoy2020 louisjgreen testxsubject mrzhao-handsome qpc-github samkenxstream tamannakamaliya willmk23 nistelrealy1988 giangtools duanwei617 ssjeela niicaii tatendacal iskold dewadewi2020 hardhary

pgap's Issues

Unic genes "not" working

I'm trying to run a analysis with identity 70% and coverage 80% with 4 strains. The problem is that some unic genes comes in output as a shared gene. What can I do?

Fail test annotation

I get this error when running test genome:
[2019-06-03 18:50:07] [workflow ] start
[2019-06-03 18:50:07] [workflow ] starting step passdata
[2019-06-03 18:50:07] [step passdata] start
[2019-06-03 18:50:07] [step passdata] completed success
[2019-06-03 18:50:07] [workflow ] starting step fastaval
[2019-06-03 18:50:07] [step fastaval] start
[2019-06-03 18:50:07] [step fastaval] completed success
[2019-06-03 18:50:07] [workflow ] starting step prepare_input_template
[2019-06-03 18:50:07] [step prepare_input_template] start
[2019-06-03 18:50:07] [workflow prepare_input_template] start
[2019-06-03 18:50:07] [workflow prepare_input_template] starting step yaml2json
[2019-06-03 18:50:07] [step yaml2json] start
[2019-06-03 18:50:07] [step yaml2json] completed success
[2019-06-03 18:50:07] [workflow prepare_input_template] starting step pgapx_yaml_ctl
[2019-06-03 18:50:07] [step pgapx_yaml_ctl] start
[2019-06-03 18:50:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_annotation
[2019-06-03 18:50:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_ltp
[2019-06-03 18:50:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/input_asn_type
[2019-06-03 18:50:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/taxid
[2019-06-03 18:50:08] [step pgapx_yaml_ctl] completed permanentFail
[2019-06-03 18:50:08] [workflow prepare_input_template] completed permanentFail
[2019-06-03 18:50:08] [step prepare_input_template] completed permanentFail
[2019-06-03 18:50:08] [workflow ] completed permanentFail
docker exited with rc = 1

No annot-gb.ent file?

Is the latest version of the pipeline supposed to produce an "annot-gb.ent" file?

I tried running the pipeline with the test dataset and it looks like everything ran fine -- had a "docker exited with rc = 0" message on the terminal and the cwltool.log file ends with:

[2019-10-15 21:30:55] INFO [job ping_stop] completed success
[2019-10-15 21:30:55] INFO [step ping_stop] completed success
[2019-10-15 21:30:55] INFO [workflow standard_pgap] completed success
[2019-10-15 21:30:55] INFO [step standard_pgap] completed success
[2019-10-15 21:30:55] INFO [workflow ] completed success
{
    "gbk": {
        "location": "file:///pgap/output/annot.gbk",
        "basename": "annot.gbk",
        "class": "File",
        "checksum": "sha1$df434a18e57084dc7406cbcc6aa6889c7a20fd7c",
        "size": 1362616,
        "path": "/pgap/output/annot.gbk"
    },
    "gff": {
        "location": "file:///pgap/output/annot.gff",
        "basename": "annot.gff",
        "class": "File",
        "checksum": "sha1$a18bbdc2812f741517442e6e78a0987acb0c003c",
        "size": 246471,
        "path": "/pgap/output/annot.gff"
    },
    "input_fasta": {
        "class": "File",
        "location": "file:///pgap/output/ASM2732v1.annotation.nucleotide.1.fasta",
        "size": 588482,
        "basename": "ASM2732v1.annotation.nucleotide.1.fasta",
        "checksum": "sha1$f6129783cc8562db7bca3c87310d57d8dd07ce2c",
        "path": "/pgap/output/ASM2732v1.annotation.nucleotide.1.fasta"
    },
    "input_submol": {
        "class": "File",
        "location": "file:///pgap/output/submol.yaml",
        "size": 1678,
        "basename": "submol.yaml",
        "checksum": "sha1$7e3cb9f882a8c55b2856aae6fb877ce1c2bf8718",
        "path": "/pgap/output/submol.yaml"
    },
    "nucleotide_fasta": {
        "location": "file:///pgap/output/annot.fna",
        "basename": "annot.fna",
        "class": "File",
        "checksum": "sha1$52dd979e0d9771cb8d9fbf102d2b8f0fcdd8a91d",
        "size": 588444,
        "path": "/pgap/output/annot.fna"
    },
    "protein_fasta": {
        "location": "file:///pgap/output/annot.faa",
        "basename": "annot.faa",
        "class": "File",
        "checksum": "sha1$f8d7f04c681d19c39ec334c83701f0e6c60f4699",
        "size": 222998,
        "path": "/pgap/output/annot.faa"
    },
    "sqn": {
        "location": "file:///pgap/output/annot.sqn",
        "basename": "annot.sqn",
        "class": "File",
        "checksum": "sha1$589971c275c173a02588f0e7236439a94258bbb5",
        "size": 3295930,
        "path": "/pgap/output/annot.sqn"
    }
}
[2019-10-15 21:30:55] INFO Final process status is success

As per the Output Files wiki, however, there should be an "annot-gb.ent" output file, but no such file is produced after running the pipeline. So I don't think this is an error...unless I am missing something in the internal portion of the file...

Any help in understanding is greatly appreciated!

Thank you,
Conrad

Any success under MacOS?

Before I waste any additional time trying what is known not to work ... a quick try under MacOS 10.13.6 with Anaconda fails before anything is actually installed (see below). Is it even worth pursuing?

guy@thermopylae/usr/local/ncbi: sudo ./pgap.py --update
Password:
Updating PGAP to version 2019-02-11.build3477 (previous version was None)
Downloading (as needed) PGAP Docker image version 2019-02-11.build3477
Traceback (most recent call last):
File "./pgap.py", line 275, in
main()
File "./pgap.py", line 261, in main
version = setup(args.update, args.local_runner)
File "./pgap.py", line 174, in setup
install_docker(latest)
File "./pgap.py", line 88, in install_docker
subprocess.check_call([docker, 'pull', get_docker_image(version)])
File "/Users/guy/anaconda2/lib/python2.7/subprocess.py", line 181, in check_call
retcode = call(*popenargs, **kwargs)
File "/Users/guy/anaconda2/lib/python2.7/subprocess.py", line 168, in call
return Popen(*popenargs, **kwargs).wait()
File "/Users/guy/anaconda2/lib/python2.7/subprocess.py", line 390, in init
errread, errwrite)
File "/Users/guy/anaconda2/lib/python2.7/subprocess.py", line 1024, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory

Failed to run the script Error: (302.26)

Dear All,
I have an issue to run the script: sudo ./pgap.py -r -o ~/Desktop/P-mult-10_results-test3 P-mult-10-test-1-input.yaml
An error occured in step called [job Prepare_Unannotated_Sequences_asnvalidate].
Also we could see following errors:
Error: (302.26) SSOCK#13000[5]@130.14.29.110:443: [SOCK::Connect] Failed pending connect(): Timeout[10.000000] Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=TaxService4&address=7deed985b88c(172.17.0.2)&platform=x86_64-unknown-linux-gnu] Failed to connect: Timeout[10.000000]

At the end of the run it creates 3 files: .fasta file, .yaml file and log file. Please help to solve the issue or run the script correctly. Thank you!

P.S. Log file is attached.
cwltool.log

Docker permissions after update

Installed pgap using the instructions on the wiki
on ubuntu 16.04 with Docker installed
Ran on local genome - brilliant
Came back a few days later to run and pgap said not upto date and updated
Now into Docker permissions quagmire (added user to docker group with no effect)
running pgap.py exits from Docker with rc=1 (permission denied)
running sudo pgap.py exits with rc=126
(fails to read pgap_input,yaml - no fasta. The pgap_input,yaml file gets corrupted with all lines but 1 deleted)
Would like to remove Docker and pgap and start again but looks like nightmare
Looks like have to back up data and re-install Ubuntu!!!!
Thanks for any help

I can't download reference data

Hi,
When I run ./pgap.py --update, I need to download the reference data (almost 26G), but I can't get whole reference data. I don't know if it is the website problem or my network connection (I think it may be due to the Great Wall Firewall set up in China), I don't have way to download the complete data. So I wan't ask you if I can have other way to download reference data. Thank you!

running time comparing with Prokka

Hello,

I am wondering why PGAP pipeline need so much time about 2.5 to 3 hours for one bacterial genome to complete compare with Prokka (mins for hundreds genomes)? What are the steps in PGAP taking it such long time?

Thanks!

Fastaval error

Hello, I'm having this error when trying to annotate one of my proprietary genomes. Do you have any idea how to fix it or any suggestions?
Thanks!

[2019-08-28 23:33:48] INFO /usr/bin/cwltool 1.0.20190815141648
[2019-08-28 23:33:48] INFO Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
[2019-08-28 23:35:24] INFO [workflow ] start
[2019-08-28 23:35:24] INFO [workflow ] starting step fastaval
[2019-08-28 23:35:24] INFO [step fastaval] start
[2019-08-28 23:35:24] INFO [job fastaval] /tmp/2qpjr8im$ fastaval.sh
-check_internal_ns
-check_min_seqlen
200
-in
/tmp/tmpqswdwhd2/stgadd6f7f3-7028-4733-b8b2-df74b381cfa2/XF73.fna
-out
fastaval.xml

POSITIONAL=()
[[ 7 -gt 0 ]]
key=-check_internal_ns
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 6 -gt 0 ]]
key=-check_min_seqlen
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 5 -gt 0 ]]
key=200
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 4 -gt 0 ]]
key=-in
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 3 -gt 0 ]]
key=/tmp/tmpqswdwhd2/stgadd6f7f3-7028-4733-b8b2-df74b381cfa2/XF73.fna
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 2 -gt 0 ]]
key=-out
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 1 -gt 0 ]]
key=fastaval.xml
ignore_all_errors=false
case "$key" in
POSITIONAL+=("$key")
shift
[[ 0 -gt 0 ]]
set -- -check_internal_ns -check_min_seqlen 200 -in /tmp/tmpqswdwhd2/stgadd6f7f3-7028-4733-b8b2-df74b381cfa2/XF73.fna -out fastaval.xml
set +e
fastaval -check_internal_ns -check_min_seqlen 200 -in /tmp/tmpqswdwhd2/stgadd6f7f3-7028-4733-b8b2-df74b381cfa2/XF73.fna -out fastaval.xml
result=255
set -e
false
exit 255
[2019-08-28 23:35:24] WARNING [job fastaval] completed permanentFail
[2019-08-28 23:35:24] WARNING [step fastaval] completed permanentFail
[2019-08-28 23:35:24] INFO [workflow ] completed permanentFail
[2019-08-28 23:35:24] WARNING Final process status is permanentFail
{
"gbk": null,
"gff": null,
"input_fasta": {
"class": "File",
"location": "file:///pgap/output/XF73.fna",
"size": 2629651,
"basename": "XF73.fna",
"checksum": "sha1$dac78d9a49bdf5d4c30f298993809f44e0aa1976",
"path": "/pgap/output/XF73.fna"
},
"input_submol": {
"class": "File",
"location": "file:///pgap/output/submol.yaml",
"size": 646,
"basename": "submol.yaml",
"checksum": "sha1$f8a018f47657f575632f81e5ab0786e3087492a5",
"path": "/pgap/output/submol.yaml"
},
"nucleotide_fasta": null,
"protein_fasta": null,
"sqn": null
}

[BUG] zero tRNAs

Hello,
For all the species I analysed with the version 2019-11-08.build4137 including the test genome MG37, I got in the annot.gbk file "zero" tRNAs (see below the results for MG37). Moreover the version is 55.1 I think it should be maybe 5.1.

So I do not get the number of tRNAs and the number of the version seems to be wrong in annot.gbk

I used Singularity version 3.5.0-rc.2 and I ran PGAP the same way it is described in issue: Feature request: support Singularity #14 because I can't use docker.
I do not know where is `cwltool.log``

##Genome-Annotation-Data-START##
Annotation Provider :: SkyNet consortium
Annotation Date :: 11/20/2019 18:06:39
Annotation Pipeline :: NCBI Prokaryotic Genome
Annotation Pipeline (PGAP)
Annotation Method :: Best-placed reference protein
set; GeneMarkS-2+
Annotation Software revision :: 55.11111123
Features Annotated :: Gene; CDS; rRNA; tRNA; ncRNA;
repeat_region
Genes (total) :: 528
CDSs (total) :: 522
Genes (coding) :: 509
CDSs (with protein) :: 509
Genes (RNA) :: 6
rRNAs :: 1, 1, 1 (5S, 16S, 23S)
complete rRNAs :: 1, 1, 1 (5S, 16S, 23S)
tRNAs :: 0
ncRNAs :: 3
Pseudo Genes (total) :: 13
CDSs (without protein) :: 13
Pseudo Genes (ambiguous residues) :: 0 of 13
Pseudo Genes (frameshifted) :: 9 of 13
Pseudo Genes (incomplete) :: 5 of 13
Pseudo Genes (internal stop) :: 2 of 13
Pseudo Genes (multiple problems) :: 3 of 13

JSON object must be str, not 'bytes'

Having installed pgap a couple of times without issues now installing on a more powerful dual xenon running Ubuntu 16.04 with Docker installed and following Quick Start

wget https://github.com/ncbi/pgap/raw/prod/scripts/pgap.py # downloads pgap.py
chmod +x pgap.py # makes it executable
./pgap.py --update
the JSON object must be str, not 'bytes' # ??????

default python is 2.7.12 though 3.5.2 is installed but

python3 ./pgap.py --update # same error

and in conda base we have python 3.7.4 but get the same error

It seems like python might be delivering some input as bytes rather than string but I can't find way to set text=True for python

Any help gratefully received

annotation running time

Hey,

How long it generally takes to annotate E. coli. genome?
It's been more than 24hrs and only single CPU is occupied by the run.
I have 88 CPU server with 512 GB RAM Centos 7. PGAP version 2019-08-01.build3919

Thanks
Mayank

error with version 2019-11-08.build4137

Hello,

I am still using the version 2019-11-08.build4137
It was working well but now I get error messages including for MG37.
Should I use a more recent version ?
If not necessary, please find below the error that I got:

Critical: (307.10) [NCBI-MESSAGE] [DISPD#ID2@sutils201] Please make sure your firewall does not block ports 4444-4544 bound for 130.14.29.112
Error: (302.26) SOCK#3000[5]@130.14.29.112:443: [SOCK::Connect] Failed pending connect(): Timeout[5.000000]
Error: (308.6) [ID2] Stateful relay connection failure (Closed) usually indicates possible firewall configuration problem(s); please consult https://www.ncbi.nlm.nih.gov/EB/ToolBox/NETWORK/dispatcher.html#Firewalling
Error: (301.5) [CONN_Open(ID2/SOCK; 130.14.29.112:443)] Failed to open connection: Closed
Error: (302.26) SSOCK#4000[5]@130.14.29.110:443: [SOCK::Connect] Failed pending connect(): Timeout[5.000000]
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=r692.pvt.bridges.psc.edu&platform=x86_64-unknown-linux-gnu] Failed to connet: Timeout[5.000000]
Error: (304.5) [ID2] Unable to create connection with network dispatcher at "https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=r692.pvt.bridges.psc.edu&pltform=x86_64-unknown-linux-gnu": Success
Error: (308.5) [ID2] Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf(): NULL connector: Unknown
Error: (302.26) SSOCK#5000[5]@130.14.29.110:443: [SOCK::Connect] Failed pending connect(): Timeout[5.000000]
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=r692.pvt.bridges.psc.edu&platform=x86_64-unknown-linux-gnu] Failed to connet: Timeout[5.000000]
Error: (304.5) [ID2] Unable to create connection with network dispatcher at "https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=r692.pvt.bridges.psc.edu&pltform=x86_64-unknown-linux-gnu": Success
Error: (308.5) [ID2] Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf(): NULL connector: Unknown
Critical: (307.10) [NCBI-MESSAGE] [DISPD#ID2@sutils201] Please make sure your firewall does not block ports 4444-4544 bound for 130.14.29.112
Error: (302.26) SOCK#8000[5]@130.14.29.112:443: [SOCK::Connect] Failed pending connect(): Timeout[5.000000]
Error: (308.6) [ID2] Stateful relay connection failure (Closed) usually indicates possible firewall configuration problem(s); please consult https://www.ncbi.nlm.nih.gov/EB/ToolBox/NETWORK/dispatcher.html#Firewalling
Error: (301.5) [CONN_Open(ID2/SOCK; 130.14.29.112:443)] Failed to open connection: Closed
Error: (302.26) SSOCK#10000[5]@130.14.29.110:443: [SOCK::Connect] Failed pending connect(): Timeout[5.000000]
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=r692.pvt.bridges.psc.edu&platform=x86_64-unknown-linux-gnu] Failed to connet: Timeout[5.000000]
Error: (301.5) [CONN_Open(ID2; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi)] Failed to open connection: Timeout[5.000000]
Error: (CLoaderException::eConnectionFailed) cannot open connection: ID2 -> https://www.ncbi.nlm.nih.gov/Service/dispd.cgi
Error: (106.16) Application's execution failed (CLoaderException::eNoConnection) cannot open initial connection
[2020-01-02 15:04:27] INFO [job Resolve_Annotation_Conflicts] Max memory used: 31MiB
[2020-01-02 15:04:28] ERROR [job Resolve_Annotation_Conflicts] Job error:
("Error collecting output for parameter 'annotation':\nprogs/bacterial_resolve_conflicts.cwl:69:13: Did not find output file with glob pattern: '['accept.asn']'", {})
[2020-01-02 15:04:28] WARNING [job Resolve_Annotation_Conflicts] completed permanentFail
[2020-01-02 15:04:28] ERROR [step Resolve_Annotation_Conflicts] Output is missing expected field file:///pgap/bacterial_annot/wf_bacterial_annot_pass1.cwl#Resolve_Annotatio_Conflicts/protein_aligns
[2020-01-02 15:04:28] ERROR [step Resolve_Annotation_Conflicts] Output is missing expected field file:///pgap/bacterial_annot/wf_bacterial_annot_pass1.cwl#Resolve_Annotatio_Conflicts/annotation
[2020-01-02 15:04:28] WARNING [step Resolve_Annotation_Conflicts] completed permanentFail
[2020-01-02 15:04:28] INFO [workflow bacterial_annot] completed permanentFail
[2020-01-02 15:04:28] WARNING [step bacterial_annot] completed permanentFail
[2020-01-02 15:04:28] INFO [workflow standard_pgap] completed permanentFail
[2020-01-02 15:04:28] WARNING [step standard_pgap] completed permanentFail
[2020-01-02 15:04:28] INFO [workflow ] completed permanentFail
[2020-01-02 15:04:28] WARNING Final process status is permanentFail

Processing more than one genome

I want to process more than one genome at a time. Currently the default output of the pipeline is to write the annot.* files into the current working directory, which must contain all of the database files within it, aka pgap-2018-11-07.build3190. I did try to write out the full paths of each database in the yaml file so that I could set the working directory separate from all the pipeline files, but this resulted in a failed run, the pipeline couldn't find the first databse. I've looked in the help documentation for the .cwl files but the flags available are for the myriad of databases that are required for the pipeline.

As of right now, my only option to run more than one sample is to run one genome out of a copy of the build dir. This is 32G, so it could get a bit crazy if I want to run 100 genomes through the pipeline at once. I'm wondering if there is something I've missed?

Quick Start instructions fail due to 'Missing required input parameter'

Version 2019-08-01.build3919

Original command: scripts/pgap.py --debug -r -o mg37_results MG37/input.yaml

Docker command: /usr/bin/docker run -i --user 1018:1019 --volume /home/justin.payne/pgap/input-2019-08-01.build3919:/pgap/input:ro,z --volume /home/justin.payne/pgap/MG37:/pgap/user_input:z --volume /home/justin.payne/pgap/MG37/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /home/justin.payne/pgap/mg37_results.2:/pgap/output:rw,z --volume /home/justin.payne/pgap/mg37_results.2/debug/log:/log/srv:z ncbi/pgap:2019-08-01.build3919 cwltool --timestamps --disable-color --outdir /pgap/output --tmpdir-prefix /pgap/output/debug/tmpdir/ --leave-tmpdir --tmp-outdir-prefix /pgap/output/debug/tmp-outdir/ --copy-outputs pgap.cwl /pgap/user_input/pgap_input.yaml

--- Start YAML Input ---
submit_block_template:
  class: File
  location: MG37/ASM2732v1.1.template
fasta:
  class: File
  location: MG37/ASM2732v1.annotation.nucleotide.1.fa
taxid: 243273
gc_assm_name: MG37
completeness: complete
supplemental_data: { class: Directory, location: /pgap/input }
report_usage: true
--- End YAML Input ---
[2019-08-13 22:07:12] INFO /usr/bin/cwltool 1.0.20190621234233
[2019-08-13 22:07:12] INFO Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
[2019-08-13 22:07:42] ERROR Workflow error, try again with --debug for more information:
Invalid job input record:
pgap.cwl:26:3: Missing required input parameter 'submol'

Quick Start instructions fail due to 'report_usage'

When running the Quickstart commands:

$ chmod +x pgap.py
$ ./pgap.py --update # required files are downloaded and extracted
$ ./pgap.py test_genomes/MG37/input.yaml # watch the progress reports and wait for some time.]

I get this error:

/usr/bin/cwltool 1.0.20181217162649
Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
Workflow error, try again with --debug for more information:
Invalid job input record:
pgap.cwl:22:3: Missing required input parameter 'report_usage'
Traceback (most recent call last): ...

It seems to work when I include the additional parameter '-n' to set report_usage to "False"

./pgap.py test_genomes/MG37/input.yaml -n

So I thought you may want to update the quickstart command or set report_usage to False on default so that future users don't have the same problems?

Thank you for making this pipeline available!!!

Error when running on custom data

I am getting an error when running on custom data. I've made the two yaml files as outlined in the Quick Start page of the Wiki. It runs fine on the test data but getting this error on my own genome:

[2019-06-25 15:41:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_annotation
[2019-06-25 15:41:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_ltp
[2019-06-25 15:41:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/input_asn_type
[2019-06-25 15:41:08] [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/taxid

I've double checked my YAML files and the formatting seems fine so not sure what the issue is.

Here is my submol.yaml file:

topology: circular
organism:
    genus_species: 'some_genus some_species'
    strain: 'some_strain'
contact_info:
    last_name: 'my_last_name'
    first_name: 'my_first'
    email: '[email protected]'
    organization: 'my_company'
    department: 'dept'
    street: 'a street'
    city: 'Berkeley'
    state: 'CA'
    postal_code: 'zip'
    country: 'USA'
    authors:
        - author:  
            last_name: 'my_last_name'    
            first_name: 'my_first_name'

I didn't include the other fields because it said they were optional, but maybe this is the error?

The command I'm running then is

./pgap.py -n -o my_results folder/input.yaml

What are the major differences between 2019-02-11 and 2019-08-01?

Hi,

I would like to provide my experience on 2019-02-11 and 2019-08-01.

On my desktop with 8G RAM and one Intel(R) Core i5-3470T (4 processors), I was able to run
2019-02-11 successfully.

However, for 2019-08-01, it stuck at certain step near completion for many hours for the same test case. It was about 1.5 hours for 2019-02-11. Also, the computer didn't response anymore. I have to reboot the computer. It seems that step ate all the resources.

So my question is what are the big changes in terms of memory and # cpu using? Will it possible to restore to the 2019-02-11 setting?

Thanks!

Cannot run PGAP with SElinux enabled

I tried to ./pgap --update and got an error: urllib.error.HTTPError: HTTP Error 403: Forbidden

It was caused by bot being able to access https://s3.amazonaws.com/pgap/input-2019-02-11.build3477.tgz where I got AccessDeniedAccess Denied5EA3714DAAA8A19BDUHYHrIjJGmuwWTapQA5YS1gsPPSBs81+EwRqg6F7bQMu/UTwi6XHga+G7h1OoYHwZv3USdhDCo=

What went wrong?

#Edit: I tried it from my station in Switzerland as well as via a US-VPN.

Permanent fail from step 'cluster_blastp_wnode'

I just made an error and close the issue :) ...

I just have tried the test genome yesterday with a just-downloaded last version of PGAP
on Mac (Os version 10.14.5 ; 2 different types, 32/64Go memory; 8/20 threads,200/300Go free on the internal SSD)
But I got an stop in the process with the following errors:
[2019-07-15 05:16:37] [step cluster_blastp_wnode] start
[2019-07-15 05:17:42] [step cluster_blastp_wnode] completed permanentFail
[2019-07-15 05:17:42] [workflow cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [step cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [workflow Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [step Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [workflow bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [step bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [workflow standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [step standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [workflow ] completed permanentFail
docker exited with rc = 1

What did I do to obtain that ?

Thanks for the help.

Here is the tail of the log file:
internal_thread=monitor&cpu_count=10&normalized_load=119&total_ram=2094985216&pct_ram_used=90&mem_total=1904762880&mem_peak=1904762880&mem_self=1904762880&ncbi_app_version=0.0.0&ncbi_app_sc_version=22&ncbi_app_vcs_revision=586246
68548/001/0047/RE BB760BC4D2C0C361 0144/0038 2019-07-15T05:17:33.869736 f0a199b4f08d UNK_CLIENT UNK_SESSION cluster_blastp_wnode request-stop 200 0.015216112 0 0
68548/001/0048/RB BB760BC4D2C0C361 0145/0039 2019-07-15T05:17:37.067589 f0a199b4f08d UNK_CLIENT UNK_SESSION cluster_blastp_wnode request-start internal_thread=monitor&cpu_count=10&normalized_load=118&total_ram=2094985216&pct_ram_used=92&mem_total=1935921152&mem_peak=1935921152&mem_self=1935921152&ncbi_app_version=0.0.0&ncbi_app_sc_version=22&ncbi_app_vcs_revision=586246
68548/001/0048/RE BB760BC4D2C0C361 0146/0040 2019-07-15T05:17:37.071790 f0a199b4f08d UNK_CLIENT UNK_SESSION cluster_blastp_wnode request-stop 200 0.008281946 0 0
[2019-07-15 05:17:42] [job cluster_blastp_wnode] Max memory used: 174MiB
[2019-07-15 05:17:42] [job cluster_blastp_wnode] completed permanentFail
[2019-07-15 05:17:42] [step cluster_blastp_wnode] completed permanentFail
[2019-07-15 05:17:42] [workflow cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [step cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [workflow Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [step Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [workflow bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [step bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [workflow standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [step standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [workflow ] completed permanentFail
[2019-07-15 05:17:42] Final process status is permanentFail
{
"gbent": null,
"gbk": null,
"gff": null,
"input_fasta": {
"class": "File",
"location": "file:///pgap/output/ASM2732v1.annotation.nucleotide.1.fasta",
"size": 588482,
"basename": "ASM2732v1.annotation.nucleotide.1.fasta",
"checksum": "sha1$f6129783cc8562db7bca3c87310d57d8dd07ce2c",
"path": "/pgap/output/ASM2732v1.annotation.nucleotide.1.fasta"
},
"input_submol": {
"class": "File",
"location": "file:///pgap/output/submol.yaml",
"size": 1529,
"basename": "submol.yaml",
"checksum": "sha1$f9eb1bbbbb1115b94b22b7148647adc3b6107c9f",
"path": "/pgap/output/submol.yaml"
},
"nucleotide_fasta": null,
"protein_fasta": null,
"sqn": null
}

Any parameters or ways to speed up annotation process?

Are there any multi-threading options or other parameters to make the process faster? I couldn't find anything in the README for this.

WARNING [step vecscreen] completed permanentFail when running GCA_000166555

Hello,

I wanted to inquire about some issues I have been having when using PGAP on fasta files containing multiple >headers for the same organism. I have tried running the pipeline using both my own input data and the sample file GCA_000166555 from the test_genome folder. In both cases, the analysis always fails during the same step with the following error message:

[2019-08-19 18:12:48] INFO Could not collect memory usage, job ended before monitoring began.
[2019-08-19 18:12:48] WARNING [job screen_evaluate] completed permanentFail
[2019-08-19 18:12:48] WARNING [step screen_evaluate] completed permanentFail
[2019-08-19 18:12:48] INFO [workflow vecscreen] completed permanentFail
[2019-08-19 18:12:48] WARNING [step vecscreen] completed permanentFail
[2019-08-19 18:12:48] INFO [workflow ] completed permanentFail

I also tried running the analysis using the test files from MG37 and the pipeline seemed to run without issues in that case.

Any ideas how can I solve this issue or what I might be doing incorrectly? I can provide the complete cwltool.log file as well as the input files used if needed.

Thanks in advance!

[BUG] Failure during Prepare_Unannotated_Sequences_asnvalidate_evaluate step

Describe the bug
PGAP fails at the Prepare_Unannotated_Sequences_asnvalidate_evaluate step with the error <message severity="ERROR" seq-id="lcl|Contig_01" code="SEQ_DESCR_StrainWithEnvironSample">Strain should not be present in an environmental sample</message>.

To Reproduce
My input files look like the following, you can give it a different input fasta file to generate the same error.
metadata.yaml

topology: 'linear'
locus_tag_prefix: 'AK208913'
organism:
    genus_species: 'Alistipes sp. CAG:29'
    strain: AK208913
contact_info:
    <...>

Note that the genus_species key is autogenerated by an upstream script, it is the best matching genome in NCBIs taxonomy.

input.yaml

report_usage: true
fasta:
    class: File
    location: /pgap/user_input/contigs_min1000.fasta
submol:
  class: File
  location: /pgap/user_input/metadata.yaml

Software versions (please complete the following information):

OS: ubuntu 18.04
pgap version 2019-11-25.build4172

Log Files
It's quite large and I can't attach it here.

Additional context
I suspect that because Alistipes sp. CAG:29 is from an environmental sample PGAP is also assuming that my genome is from an environmental source and fails. My genome is not from an environmental sample though it's likely a new species of Alistipes, Alistipes sp. CAG:29 is mearly the closest relative based on sequence similarity.

What do you think the work around for this is? do I need to select an Alistipes species that isn't from an environmental source? can I just use the genus without the species since my genome is a new species of Alistipes?

docker exited with rc = 0 when run test

Hello,
I've just finished installing the new release of PGAP, when running test it gives me this message docker exited with rc = 0 , here is my command and the log file ./pgap.py -r -o mg37_results test_genomes/MG37/input.yaml -c 16 -m 60g -d
Thanks for your help,

Fety
cwltool.log

Feature request: support Singularity

Description:

Because Docker requires root privileges, it cannot be used on clusters without compromising user management. For this reason, I suggest you add the option to run the pipeline using Singularity, which can run with regular privileges. I suspect it wouldn't be too hard...

How I got it to work:

I already tried it and it seems to work, although not yet on our cluster. In case anyone's interested, I'm posting my solution below:

# set a working directory with lots of available space
CWD=/xxx/xxx/xxx/PGAP

# set current version
DATE=2019-05-13
BUILD=build3740

# set the singularity cache location, meaning the location where singularity images are stored.
# this step is convenient if you don't want the image be stored in your home folder.
export SINGULARITY_CACHEDIR=$CWD/singularity_cache
mkdir $SINGULARITY_CACHEDIR

# download supplemental files
wget https://s3.amazonaws.com/pgap/input-$DATE.$BUILD.tgz
wget https://s3.amazonaws.com/pgap-data/test_genomes-$DATE.$BUILD.tgz

# extract supplemental files, delete tars
tar xzvf input-$DATE.$BUILD.tgz && rm input-$DATE.$BUILD.tgz
tar xzvf test_genomes-$DATE.$BUILD.tgz && rm test_genomes-$DATE.$BUILD.tgz

# get the pgap_input.yaml for the test genome
wget https://campuscloud.unibe.ch:443/ssf/s/readFile/share/29641/5440512207672247602/publicLink/pgap_input.yaml
mv pgap_input.yaml test_genomes/MG37/

# save the PGAP version to file (don't know if this is necessary...)
echo "$DATE.$BUILD" > VERSION

# download the docker container, convert it to singularity, and see if it works
# when the container is first run, it takes some time to download the container and convert it to singularity. if you run the same command a second time, it will be very quick.
mkdir $CWD/mg37_results
singularity run --pwd /pgap \
--bind $CWD/input-$DATE.$BUILD:/pgap/input:ro \
--bind $CWD/test_genomes/MG37:/pgap/user_input \
--bind $CWD/test_genomes/MG37/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro \
--bind $CWD/mg37_results:/pgap/output:rw \
docker://ncbi/pgap:$DATE.$BUILD

# if this works, you should be inside the container and the command line should look like this:
# "sh-4.2$"
# you can play around, see if the folders were successfully mounted and finally exit the container.
exit

# run PGAP on the test genome
mkdir $CWD/mg37_results
singularity exec \
--bind $CWD/input-$DATE.$BUILD:/pgap/input:ro \
--bind $CWD/test_genomes/MG37:/pgap/user_input \
--bind $CWD/test_genomes/MG37/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro \
--bind $CWD/mg37_results:/pgap/output:rw \
--pwd /pgap \
docker://ncbi/pgap:$DATE.$BUILD \
cwltool --timestamps --outdir /pgap/output pgap.cwl /pgap/user_input/pgap_input.yaml

Additional info:

These are the systems I tried it on:

Hardware: Laptop

Fedora 30 - Kernel 5.1.2 - Singularity 3.1.1-1.fc30 - i7-8565U (8 cores) - 16 GB RAM
Time: 1 h 25 min
Note: The resulting files (mg37_results/*) and the ones produced by the regular docker-pipeline are identical (except for timestamps, of course).

Hardware: Cluster

CentOS - Kernel 3.10.0 - Singularity 3.2.0-1 - 14 CPUs - 80 GB RAM
Result: It crashed because it ran out of RAM after ~ 15 min.
I suspect this is a problem with our job scheduler SLURM.

ignore-all-errors doesn't seem to ignore all errors

I working with a small sequence, ~50kb, and there's no real reference genome. The pipeline fails when trying to look up the taxonomy information despite setting ignore-all-errors.

./pgap.py -r --ignore-all-errors -o outDir test_genomes/sample/input.yaml
Here are the lines from the cwltool.log showing that ignore_all_errors is not set:

.
.
.
[2019-11-20 16:39:32] INFO [job fastaval] /tmp/9i4l4fg2$ fastaval.sh \
    -check_internal_ns \
    -check_min_seqlen \
    200 \
    -ignore_all_errors \
    -in \
    /tmp/tmpra2z6qx_/stg2f8f5550-4f51-4837-ba13-62c85fba5f8e/sample.fsa \
    -out \
    fastaval.xml
+ POSITIONAL=()
+ ignore_all_errors=false
+ [[ 8 -gt 0 ]]
+ key=-check_internal_ns
+ case "$key" in
+ POSITIONAL+=("$key")
.
.
.

And here's the snippet from the cwltool.log where the genus species look up fails:

.
.
.
Error: (106.16) Application's execution failed (CException::eUnknown) Unknown organism sample genome
[2019-11-20 16:40:32] INFO [job pgapx_yaml_ctl] Max memory used: 31MiB
[2019-11-20 16:40:32] ERROR [job pgapx_yaml_ctl] Job error:
("Error collecting output for parameter 'input_asn_type':\nprogs/pgapx_yaml_ctl.cwl:75:13: Did not find output file with glob pattern: '['input_asn_type.txt']'", {})
[2019-11-20 16:40:32] WARNING [job pgapx_yaml_ctl] completed permanentFail
[2019-11-20 16:40:32] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_annotation
[2019-11-20 16:40:32] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_ltp
[2019-11-20 16:40:32] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/input_asn_type
.
.
.

Permanent fail from step 'cluster_blastp_wnode'

I just have tried the test genome yesterday with a just-downloaded last version of PGAP
on Mac (Os version 10.14.5 ; 2 different types, 32/64Go memory; 8/20 threads,200/300Go free on the internal SSD)
But I got an stop in the process with the following errors:
[2019-07-15 05:16:37] [step cluster_blastp_wnode] start
[2019-07-15 05:17:42] [step cluster_blastp_wnode] completed permanentFail
[2019-07-15 05:17:42] [workflow cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [step cluster_and_qdump] completed permanentFail
[2019-07-15 05:17:42] [workflow Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [step Find_Naming_Protein_Hits_I] completed permanentFail
[2019-07-15 05:17:42] [workflow bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [step bacterial_annot_2] completed permanentFail
[2019-07-15 05:17:42] [workflow standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [step standard_pgap] completed permanentFail
[2019-07-15 05:17:42] [workflow ] completed permanentFail
docker exited with rc = 1

What did I do to obtain that ?

Thanks for the help.

Repeated “completed permanentFail” messages

My last few runs have all ended with multiple “completed permanentFail” messages and “docker exited with rc = 1” and yet all the output files were created. The timestamps on the output files indicate that they were created fairly early in the process compared to the full runtime, so I am confused. I have the all the files for the most recent such run, including a console log. If they would be useful, do I just attach the files here?

Inconsistent annotations

We ran the same E. coli genome with the same versions of PGAP and data -- one on a Mac an d one on a Linux box. The same Features were predicted in both cases (Feature type, coordinates, etc.) but there were numerous differences in qualifier annotations; out of 4420 CDS features, 302 had different annotations: 276 differed in /product, 32 differed in /gene, 22 differed in /EC_number, and 15 differed in /note (some CDS features differed in more than one qualifier). This sort of variation makes assessing the output less certain as we look into adopting pgap.

Issue with 'docker exited with rc = 1'

Hello,
I am trying to run NCBI PGAP program in Ubuntu 14.04 LTS OS with Python3. After struggling with installing and running PGAP, I could able to install the program. But while running the program with the below command:
"icar@icar-crijaf:~/Programs/pgap-master$ sudo python3 pgap.py -n -D /usr/local/bin/docker -o /home/icar/Programs/pgap-master/PJRB1 /home/icar/Programs/pgap-master/user_genome/genome.simple2.yaml -d
[sudo] password for icar:"
I am getting the following error:
PGAP version 2019-05-13.build3740 is up to date.
/home/icar/Programs/pgap-master/PJRB1
docker exited with rc = 1

The cwltool.log is below:
Original command: pgap.py -n -D /usr/local/bin/docker -o /home/icar/Programs/pgap-master/PJRB1 /home/icar/Programs/pgap-master/user_genome/genome.simple2.yaml -d

Docker command: /usr/local/bin/docker run -i --user 0:0 --volume /home/icar/Programs/pgap-master/input-2019-05-13.build3740:/pgap/input:ro,z --volume /home/icar/Programs/pgap-master/user_genome:/pgap/user_input:z --volume /home/icar/Programs/pgap-master/user_genome/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /home/icar/Programs/pgap-master/PJRB1:/pgap/output:rw,z --volume /home/icar/Programs/pgap-master/PJRB1/debug/log:/log/srv:z ncbi/pgap:2019-05-13.build3740 cwltool --timestamps --outdir /pgap/output --tmpdir-prefix /pgap/output/debug/tmpdir/ --leave-tmpdir --tmp-outdir-prefix /pgap/output/debug/tmp-outdir/ --copy-outputs pgap.cwl /pgap/user_input/pgap_input.yaml

[2019-06-18 11:21:27] /usr/bin/cwltool 1.0.20190228155703
[2019-06-18 11:21:27] Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
[2019-06-18 11:21:48] Failed to create directory: [Errno 30] Read-only file system: '/pgap/output/debug/tmpdir'

RUNTIME.json log looks like this:
{
"CPU cores": 16,
"Docker image": "ncbi/pgap:2019-05-13.build3740",
"cpu model": "Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz",
"max user processes": 524288,
"memory (GiB)": 62.7,
"memory per CPU core (GiB)": 3.9,
"open files": 524288,
"tmp disk space (GiB)": 1740.6,
"virtual memory": "unlimited",
"work disk space (GiB)": 1740.6
}

Please help me to troubleshoot the issue and suggest a measure to run the program successfully.

Thanking you.

-Dip

middle_initial in submol.yaml

I noticed that, contrary to your documentation, middle_initial in contact_info leads to a crash.

$ ./pgap.py -n -o mg37_results test_genomes/MG37/input.yaml 
PGAP version 2019-11-08.build4137 is up to date.
/home/username/ncbi_pgap/mg37_results
docker exited with rc = 1

content of mg37_results/cwltool.log

[2019-11-13 15:30:45] INFO [step fastaval] completed success
[2019-11-13 15:30:45] INFO [workflow ] starting step prepare_input_template
[2019-11-13 15:30:45] INFO [step prepare_input_template] start
[2019-11-13 15:30:45] INFO [workflow prepare_input_template] start
[2019-11-13 15:30:45] INFO [workflow prepare_input_template] starting step yaml2json
[2019-11-13 15:30:45] INFO [step yaml2json] start
[2019-11-13 15:30:45] INFO [job yaml2json] /tmp/ag98u2it$ yaml2json.py \
    /tmp/tmpb3v6qk5o/stg302b7ff6-8293-4177-9611-2c829077106a/submol.yaml \
    submol.json
[2019-11-13 15:30:45] INFO [job yaml2json] completed success
[2019-11-13 15:30:45] INFO [step yaml2json] completed success
[2019-11-13 15:30:45] INFO [workflow prepare_input_template] starting step pgapx_yaml_ctl
[2019-11-13 15:30:45] INFO [step pgapx_yaml_ctl] start
[2019-11-13 15:30:45] INFO [job pgapx_yaml_ctl] /tmp/mto3347e$ pgapx_yaml_ctl \
    -ifmt \
    JSON \
    -input \
    /tmp/tmpj0rel97x/stgd4806b11-e4bb-4036-b791-ee3c105196d7/submol.json \
    -input-fasta \
    /tmp/tmpj0rel97x/stg3c68e2b5-7096-4f1d-8d94-6313fb2b7691/ASM2732v1.annotation.nucleotide.1.fasta \
    -ofmt \
    JSON \
    -output-annotation \
    input.asn \
    -output-asn-type \
    input_asn_type.txt \
    -output-ltp \
    genome.ltp.txt \
    -output-taxid \
    taxid.txt \
    -taxon-db \
    /tmp/tmpj0rel97x/stg4b9177c7-7992-43f9-9e55-61fe86155952/taxonomy.sqlite3
Error: (106.16) Application's execution failed (CSerialException::eFormatError) line 1: member city expected ( at JsonValue.contact_info)
[2019-11-13 15:30:47] INFO [job pgapx_yaml_ctl] Max memory used: 24MiB
[2019-11-13 15:30:48] ERROR [job pgapx_yaml_ctl] Job error:
("Error collecting output for parameter 'input_asn_type':\nprogs/pgapx_yaml_ctl.cwl:75:13: Did not find output file with glob pattern: '['input_asn_type.txt']'", {})
[2019-11-13 15:30:48] WARNING [job pgapx_yaml_ctl] completed permanentFail
[2019-11-13 15:30:48] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_annotation
[2019-11-13 15:30:48] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/output_ltp
[2019-11-13 15:30:48] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/input_asn_type
[2019-11-13 15:30:48] ERROR [step pgapx_yaml_ctl] Output is missing expected field file:///pgap/prepare_user_input2.cwl#pgapx_yaml_ctl/taxid
[2019-11-13 15:30:48] WARNING [step pgapx_yaml_ctl] completed permanentFail
[2019-11-13 15:30:48] INFO [workflow prepare_input_template] completed permanentFail
[2019-11-13 15:30:48] WARNING [step prepare_input_template] completed permanentFail
[2019-11-13 15:30:48] INFO [workflow ] completed permanentFail
{
    "gbk": null,
    "gff": null,
    "input_fasta": {
        "class": "File",
        "location": "file:///pgap/output/ASM2732v1.annotation.nucleotide.1.fasta",
        "size": 588482,
        "basename": "ASM2732v1.annotation.nucleotide.1.fasta",
        "checksum": "sha1$f6129783cc8562db7bca3c87310d57d8dd07ce2c",
        "path": "/pgap/output/ASM2732v1.annotation.nucleotide.1.fasta"
    },
    "input_submol": {
        "class": "File",
        "location": "file:///pgap/output/submol.yaml",
        "size": 1702,
        "basename": "submol.yaml",
        "checksum": "sha1$e24e14004e37074c17143f9e5fbb604dd9f3e528",
        "path": "/pgap/output/submol.yaml"
    },
    "nucleotide_fasta": null,
    "protein_fasta": null,
    "sqn": null
}
[2019-11-13 15:30:48] WARNING Final process status is permanentFail

It worked when I ran the regular submol.yaml, but failed when I changed this

(...)
contact_info:
    last_name: 'Doe'
    middle_initial: 'X'
    first_name: 'Jane'
(...)

it failed.

I also noticed that characters such as 'í' lead to crashes when used in names. (However, 'ä' in street seems to work.) Please add this to the documentation.

Also, it would be great if pgap gave specific error messages.

Error when using provided test files

Hello,

I am currently having an issue when testing the sample files provided with pgap. I am running the test locally and I get the following error when following the quick start instructions:

Original command: ./pgap.py -r -o mg37_results test_genomes/MG37/input.yaml

Docker command: /usr/bin/docker run -i --user 1000:1000 --volume /home/aicasti1/input-2019-05-13.build3740:/pgap/input:ro,z --volume /home/aicasti1/test_genomes/MG37:/pgap/user_input:z --volume /home/aicasti1/test_genomes/MG37/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /home/aicasti1/mg37_results.6:/pgap/output:rw,z ncbi/pgap:2019-05-13.build3740 cwltool --timestamps --outdir /pgap/output pgap.cwl /pgap/user_input/pgap_input.yaml

[2019-06-19 17:26:03] /usr/bin/cwltool 1.0.20190228155703
[2019-06-19 17:26:03] Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
[2019-06-19 17:26:03] Tool definition failed initialization:
[Errno 21] Is a directory: '/pgap/user_input/pgap_input.yaml'

I tried with a different set of the provided files and I got the same error. Is there anything that I am missing? My apologies for such a basic question, this would be my first time using either docker or pgap.

Thanks a lot for your help,
Andreina

Genome size check fail

Permanent fail in job Find_Best_Evidence_Alignments

Hi,

I'm working with some bacterial genomes but I am getting the following error "WARNING [step Find_Best_Evidence_Alignments] completed permanentFail"

details of cwltool.log
[2019-11-13 00:11:07] INFO [step protein_alignment] completed success
[2019-11-13 00:11:07] INFO [workflow standard_pgap] starting step bacterial_annot_3
[2019-11-13 00:11:07] INFO [step bacterial_annot_3] start
[2019-11-13 00:11:07] INFO [workflow bacterial_annot_3] start
[2019-11-13 00:11:07] INFO [workflow bacterial_annot_3] starting step Find_Best_Evidence_Alignments
[2019-11-13 00:11:07] INFO [step Find_Best_Evidence_Alignments] start
invalid field nameroot, expected one of: 'class', 'location', 'path', 'basename', 'listing'
invalid field nameext, expected one of: 'class', 'location', 'path', 'basename', 'listing'
[2019-11-13 00:11:07] INFO [job Find_Best_Evidence_Alignments] /pgap/output/debug/tmp-outdir/5k9ctb67$ bact_best_evidence_alignments
-annotation
annotation.mft
-asn-cache
/tmp/tmpw7ltcwfz/stg8eef8bd6-3a75-4bc2-837c-2d869e59faa3/cache,/tmp/tmpw7ltcwfz/stgdbcb3d69-3029-4886-a1e7-8825bd97fa90/sequence_cache
-input-manifest
align.mft
-max-overlap
120
-nogenbank
-o
best_aligns.asn
-start-stop-allowance
60
-thr
/tmp/tmpw7ltcwfz/stg9cbb28af-3d21-4e55-b03d-2fb546d2b36d/thresholds.xml
-unicoll_sqlite
/tmp/tmpw7ltcwfz/stgd7877e4d-a89d-4298-b999-f09a76fda69d/naming.sqlite
[2019-11-13 00:11:58] INFO [job Find_Best_Evidence_Alignments] Max memory used: 142MiB
[2019-11-13 00:11:58] WARNING [job Find_Best_Evidence_Alignments] completed permanentFail
[2019-11-13 00:11:58] WARNING [step Find_Best_Evidence_Alignments] completed permanentFail
[2019-11-13 00:11:58] INFO [workflow bacterial_annot_3] completed permanentFail
[2019-11-13 00:11:58] WARNING [step bacterial_annot_3] completed permanentFail
[2019-11-13 00:11:58] INFO [workflow standard_pgap] completed permanentFail
[2019-11-13 00:11:58] WARNING [step standard_pgap] completed permanentFail
[2019-11-13 00:11:58] INFO [workflow ] completed permanentFail

Thanks,

Edson Machado

Missing "/gene=" names in genbank annotations vs. Prokka

I'll start off by saying that this may not necessarily be an issue, just something I've noticed. Compared to Prokka annotations of genomes, PGAP seems to not assign a lot of /gene= names in the genbank file it produces. The total number of features is pretty similar, it's the short gene names that are missing (ie /gene="rpoB"). These are often not well studied/annotated genomes in the literature.

Is this something you guys have noticed on your end? In most cases, the features that have /gene=some_gene in Prokka and not in PGAP still correspond pretty well in the /product= line, which makes me think that PGAP is a lot more conservative about providing a good /gene name for the CDS feature than Prokka. If this is the case, are there any parameters I can change to make it less conservative so that it aligns a little closer to Prokka and is more lenient on assigning /gene names?

I'm putting in the genus/species in to the YAML file and topology: circular (not sure if this matters), but that's about it for settings.

Error in step passdata when testing the sample files

Hello,

I'm having an issue when testing the sample files provided with pgap. I am running the test locally and I get the following error when following the quick start instructions:

$ python3.6 pgap.py -r -o mg37_results test_genomes/MG37/input.yaml
PGAP version 2019-05-13.build3740 is up to date.
/home/ioc/tools/PGAP/mg37_results
[2019-07-18 18:52:59] [workflow ] start
[2019-07-18 18:52:59] [workflow ] starting step passdata
[2019-07-18 18:52:59] [step passdata] start
[2019-07-18 18:52:59] [step passdata] Output is missing expected field file:///pgap/pgap.cwl#passdata/taxon_db
[2019-07-18 18:52:59] [step passdata] completed permanentFail
[2019-07-18 18:52:59] [workflow ] completed permanentFail
docker exited with rc = 1

$ cat mg37_results/cwltool.log
Original command: pgap.py -r -o mg37_results test_genomes/MG37/input.yaml

Docker command: /usr/bin/docker run -i --user 1000:1000 --volume /home/ioc/tools/PGAP/input-2019-05-13.build3740:/pgap/input:ro,z --volume /home/ioc/tools/PGAP/test_genomes/MG37:/pgap/user_input:z --volume /home/ioc/tools/PGAP/test_genomes/MG37/pgap_input.yaml:/pgap/user_input/pgap_input.yaml:ro,z --volume /home/ioc/tools/PGAP/mg37_results:/pgap/output:rw,z ncbi/pgap:2019-05-13.build3740 cwltool --timestamps --outdir /pgap/output pgap.cwl /pgap/user_input/pgap_input.yaml

[2019-07-18 18:52:44] /usr/bin/cwltool 1.0.20190228155703
[2019-07-18 18:52:44] Resolved 'pgap.cwl' to 'file:///pgap/pgap.cwl'
[2019-07-18 18:52:59] [workflow ] start
[2019-07-18 18:52:59] [workflow ] starting step passdata
[2019-07-18 18:52:59] [step passdata] start
[2019-07-18 18:52:59] [step passdata] Output is missing expected field file:///pgap/pgap.cwl#passdata/taxon_db
[2019-07-18 18:52:59] [step passdata] completed permanentFail
[2019-07-18 18:52:59] [workflow ] completed permanentFail
[2019-07-18 18:52:59] Final process status is permanentFail
{
"gbent": null,
"gbk": null,
"gff": null,
"input_fasta": {
"class": "File",
"location": "file:///pgap/output/ASM2732v1.annotation.nucleotide.1.fasta",
"size": 588482,
"basename": "ASM2732v1.annotation.nucleotide.1.fasta",
"checksum": "sha1$f6129783cc8562db7bca3c87310d57d8dd07ce2c",
"path": "/pgap/output/ASM2732v1.annotation.nucleotide.1.fasta"
},
"input_submol": {
"class": "File",
"location": "file:///pgap/output/submol.yaml",
"size": 1529,
"basename": "submol.yaml",
"checksum": "sha1$f9eb1bbbbb1115b94b22b7148647adc3b6107c9f",
"path": "/pgap/output/submol.yaml"
},
"nucleotide_fasta": null,
"protein_fasta": null,
"sqn": null
}

Can you help me?

Best regards,

Edson Machado

Wrong link for input data

On the page https://github.com/ncbi/pgap/wiki/Installation there is a link for input data: https://s3.amazonaws.com/pgap-data/input-[version].tgz which doesn't work for the latest version (2019-05-13.build3740).
In #15 @whlavina gives another link https://s3.amazonaws.com/pgap/input-2019-05-13.build3740.tgz which works.
I suggest to edit https://github.com/ncbi/pgap/wiki/Installation for the latest version compatibility.

Thanks!

Test genomes cant be accessed

Hello,
I try to download the genomes as specified in your https://github.com/ncbi/pgap/blob/master/scripts/setup_pgap_standalone.sh with the command:

wget -nc https://s3.amazonaws.com/pgap-data/test_genomes.tgz

--2018-11-29 11:59:04--  https://s3.amazonaws.com/pgap-data/test_genomes.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.224.75
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.224.75|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2018-11-29 11:59:05 ERROR 403: Forbidden.

I successfully downloaded though the other file with the input data.

problem about format input files

Hi,

I tried to use PGAP-1.2.1 based on my metagenome-assemble bins, but error occurred.
I got ffn and faa files from Prokka annotation, then got ptt file from gbk file from Prokka using gb2ptt.pl script. After putting them in one folder, I tried to format them using Converter_draft.pl, but the output .nuc file was empty and message showed in terminal:

Argument ">k_01851" isn't numeric in sort at converter_draft.pl line 237, line 27187.
Argument ">k_01852" isn't numeric in sort at converter_draft.pl line 237, line 27196.
Argument ">k_01853" isn't numeric in sort at converter_draft.pl line 237, line 27203.
Argument ">k_01854" isn't numeric in sort at converter_draft.pl line 237, line 27214.
Argument ">k_01855" isn't numeric in sort at converter_draft.pl line 237, line 27219.
Argument ">k_01856" isn't numeric in sort at converter_draft.pl line 237, line 27226.
Argument ">k_01857" isn't numeric in sort at converter_draft.pl line 237, line 27239.
Argument ">k_01858" isn't numeric in sort at converter_draft.pl line 237, line 27249.
Argument ">k_01859" isn't numeric in sort at converter_draft.pl line 237, line 27270.
Argument ">k_01860" isn't numeric in sort at converter_draft.pl line 237, line 27276.

Could you please help me with it? Did I use the wrong .pl or all these format .pl only suitable for NCBI files?
By the way, I also tried to apply my data on PGAweb Analyze server, but it seemed like stuck forever once I clicked submit.

Thanks in advance!

Cannot find "genus_species" on NCBI Taxonomy

Hello,

I want to annotate the genome of Wolbachia from Spodoptera picta. But I can't find any information about it on NCBI Taxomomy. Does this mean that this organism has not yet been registered? If so, what should I do next?

Thanks a lot for your help,
Zhixin Niu

Failed SOCK_gethostbyname

Hello and thank you for making this pipeline available. I have successfully annotated a few bacterial genomes and am really happy with the results.

For some reason, the pipeline seems to be stuck for the past 2 days, and it is unclear why.. could you please advise? I did paste what I hope is the relevant part of the cwl log file.

Thanks for your time!

[2019-10-30 10:29:12] INFO [job actual] /tmp/wuvjn2_8$ gp_makeblastdb \
    -asn-cache \
    /tmp/tmpln8ryaa_/stgae4bcb52-f84d-4219-a907-75d29118af69/sequence_cache \
    -dbtype \
    nucl \
    -fasta \
    /tmp/tmpln8ryaa_/stge548da05-915f-4e8c-a02f-b185372cec74/adaptor_fasta.fna \
    -found-ids-output \
    found_ids.txt \
    -found-ids-output-manifest \
    found_ids.mft \
    -db \
    blastdb \
    -output-manifest \
    blastdb.mft \
    -title \
    'BLASTdb created by GPipe'
Error: (302.22) SSOCK#1000[?]@:443: [SOCK::Connect]  Failed SOCK_gethostbyname("www.ncbi.nlm.nih.gov")
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=62ca0bd9e82f(172.17.0.3)&platform=x86_64-unknown-linux-gnu]  Failed to connect: Unknown
Error: (308.5) [ID2]  Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf():  NULL connector: Unknown
Error: (302.22) SSOCK#2000[?]@:443: [SOCK::Connect]  Failed SOCK_gethostbyname("www.ncbi.nlm.nih.gov")
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=62ca0bd9e82f(172.17.0.3)&platform=x86_64-unknown-linux-gnu]  Failed to connect: Unknown
Error: (308.5) [ID2]  Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf():  NULL connector: Unknown
Error: (302.22) SSOCK#3000[?]@:443: [SOCK::Connect]  Failed SOCK_gethostbyname("www.ncbi.nlm.nih.gov")
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=62ca0bd9e82f(172.17.0.3)&platform=x86_64-unknown-linux-gnu]  Failed to connect: Unknown
Error: (308.5) [ID2]  Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf():  NULL connector: Unknown
Error: (302.22) SSOCK#4000[?]@:443: [SOCK::Connect]  Failed SOCK_gethostbyname("www.ncbi.nlm.nih.gov")
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=62ca0bd9e82f(172.17.0.3)&platform=x86_64-unknown-linux-gnu]  Failed to connect: Unknown
Error: (308.5) [ID2]  Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf():  NULL connector: Unknown
Error: (302.22) SSOCK#5000[?]@:443: [SOCK::Connect]  Failed SOCK_gethostbyname("www.ncbi.nlm.nih.gov")
Error: (303.7) [URL_Connect; https://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=ID2&address=62ca0bd9e82f(172.17.0.3)&platform=x86_64-unknown-linux-gnu]  Failed to connect: Unknown
Error: (308.5) [ID2]  Service not found
Error: (315.2) CConn_Streambuf::CConn_Streambuf():  NULL connector: Unknown
Error: (CLoaderException::eConnectionFailed) cannot open connection: ID2
Error: (106.16) Application's execution failed (CLoaderException::eNoConnection) cannot open initial connection
[2019-10-30 10:29:24] INFO [job actual] Max memory used: 35MiB
[2019-10-30 10:29:24] ERROR [job actual] Job error:
("Error collecting output for parameter 'found_ids':\nprogs/gp_makeblastdb.cwl:90:25: Did not find output file with glob pattern: '['found_ids.txt']'", {})
[2019-10-30 10:29:24] WARNING [job actual] completed permanentFail
[2019-10-30 10:29:24] ERROR [step actual] Output is missing expected field file:///pgap/progs/gp_makeblastdb.cwl#actual/blastfiles
[2019-10-30 10:29:24] WARNING [step actual] completed permanentFail
[2019-10-30 10:29:24] INFO [workflow Create_Adaptor_BLASTdb] completed permanentFail
[2019-10-30 10:29:24] WARNING [step Create_Adaptor_BLASTdb] completed permanentFail
[2019-10-30 10:29:24] INFO [workflow default_plane] completed permanentFail
[2019-10-30 10:29:24] WARNING [step default_plane] completed permanentFail
[2019-10-30 10:29:24] INFO [workflow vecscreen] completed permanentFail
[2019-10-30 10:29:24] WARNING [step vecscreen] completed permanentFail
[2019-10-30 10:29:24] INFO [workflow ] completed permanentFail
[2019-10-30 10:29:24] WARNING Final process status is permanentFail

Process ends in Permanent Fail with other than mg37

Our group also cannot run docker on our cluster and have tried using Singularity and conda environments. We've managed to get the test genome that comes as part of the docker to complete, however when trying to run one of our sequences we are running into issues.

This is an error that I get when trying to run a closed single contig genome

Processing sequences
  Processing lcl|FWSEC0001_contig1
Error: (CException::eUnknown) GetTaxIdByOrgRef not supported for local execution
Error: (106.16) Application's execution failed (CException::eUnknown) GetTaxIdByOrgRef not supported for local execution
[job Prepare_Unannotated_Sequences] Job error:
Error collecting output for parameter 'master_desc':
bacterial_prepare_unannotated.cwl:45:7: Did not find output file with glob pattern: '['master-desc.asn']'
[job Prepare_Unannotated_Sequences] completed permanentFail
[step Prepare_Unannotated_Sequences] Output is missing expected field file:///PATHTO/pgap-2018-11-07.build3190/wf_pgap.cwl#Prepare_Unannotated_Sequences/master_desc
[step Prepare_Unannotated_Sequences] Output is missing expected field file:///PATHTO/pgap-2018-11-07.build3190/wf_pgap.cwl#Prepare_Unannotated_Sequences/sequences
[step Prepare_Unannotated_Sequences] completed permanentFail
[workflow standard_pgap] completed permanentFail
[step standard_pgap] completed permanentFail
[workflow ] completed permanentFail
Final process status is permanentFail
srun: error: waffles-g-1: task 0: Exited with exit code 1

I've also run into a permanentFail from a draft genome, though this one ended with some type of CRISPR error.

Any help would be appreciated.

Confirm that use of BLAST's `-max_target_seqs` is intentional

Hi there,

This is a semi-automated message from a fellow bioinformatician. Through a GitHub search, I found that the following source files make use of BLAST's -max_target_seqs parameter:

Based on the recently published report, Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows, there is a strong chance that this parameter is misused in your repository.

If the use of this parameter was intentional, please feel free to ignore and close this issue but I would highly recommend to add a comment to your source code to notify others about this use case. If this is a duplicate issue, please accept my apologies for the redundancy as this simple automation is not smart enough to identify such issues.

Thank you!
-- Arman (armish/blast-patrol)

Error when running update

I'm trying to update a previously working 2018-11-07.build3190 stand-alone install to the current 2019-02-11.buiild3477 on Linux Mint 18 (based on Ubuntu 16.04 LTS). Following the instructions in the quick start guide, I grabbed the current version via curl, set the execute bit and ran
$ ./pgap.py --update
Needed updates appeared to download successfully, but the install eventually failed with a type error; relevant output is below. Trying to rerun the update returns the same error. Am I just missing something obvious? Thanks!

$ ./pgap.py --update
Updating PGAP to version 2019-02-11.build3477 (previous version was 2018-11-07.build3190)
Downloading (as needed) PGAP Docker image version 2019-02-11.build3477
2019-02-11.build3477: Pulling from ncbi/pgap
af4b0a2388c6: Already exists 
...
64f7899a4cdc: Pull complete 
Digest: sha256:d8e38a6678fc64507472cdfe02c45d12bffb24ee43136a0f89a8144e6082ede8
Status: Downloaded newer image for ncbi/pgap:2019-02-11.build3477
Downloading PGAP reference data version 2019-02-11.build3477
Downloaded 14045155161 of 14045155161 bytes (100.00%)
Traceback (most recent call last):
  File "./pgap.py", line 282, in <module>
    main()
  File "./pgap.py", line 268, in main
    version = setup(args.update, args.local_runner)
  File "./pgap.py", line 178, in setup
    f.write('{}\n'.format(latest))
TypeError: write() argument 1 must be unicode, not str

container, runscript fails

Hi
Due to security issues we only run singularity and no docker container on our infrastructure.
I therefore simply tried to convert your docker image into a singularity one but seem to fail a test genome.

singularity exec -i   --bind $inDir:/pgap/user_input:ro \
                              --bind $(pwd -P)/input-2018-11-07.build3190:/pgap/user_input/input:ro \
                              --bind $(pwd -P)/output:/pgap/output:rw $image \
                                cwltool    --outdir ./outdir    /pgap/wf_pgap_simple.cwl  /pgap/user_input/pgap_input.yaml

/usr/bin/cwltool 1.0.20181102182747
Resolved '/pgap/wf_pgap_simple.cwl' to 'file:///pgap/wf_pgap_simple.cwl'
[workflow ] start
[workflow ] starting step prepare_input_template
[step prepare_input_template] start
[job prepare_input_template] /tmp/tmpci7z9td1$ cat \
    /tmp/tmpddgmgg9w/stgf9b1defa-c3e9-4488-9488-9b0639f67093/submit_block_static.template \
    /tmp/tmpddgmgg9w/stg7785d7c4-7c7b-4836-b0cf-8a590fc25c53/molinfo_wgs.asn > /tmp/tmpci7z9td1/complete.template
[job prepare_input_template] completed success
[step prepare_input_template] completed success
[workflow ] starting step standard_pgap
[step standard_pgap] start
[workflow standard_pgap] start
[workflow standard_pgap] starting step genomic_source
[step genomic_source] start
[workflow genomic_source] start
[workflow genomic_source] starting step Cache_FASTA_Sequences
[step Cache_FASTA_Sequences] start
[job Cache_FASTA_Sequences] /tmp/tmpk7g783sv$ prime_cache \
    -cache \
    sequence_cache \
    -ifmt \
    fasta \
    -i \
    /tmp/tmpwxdd6j43/stgb1219d63-cb5e-40b3-82ca-72dc39188469/SAMN09831750-rid6458073.denovo.assembly.contigs.fasta \
    -oseq-ids \
    oseq-ids.seqids \
    -submit-block-template \
    /tmp/tmpwxdd6j43/stg19a555ec-cc3c-41b9-b651-99d46c39abc8/complete.template \
    -taxid \
    1354 \
    -taxon-db \
    /tmp/tmpwxdd6j43/stg5664216c-bc9f-44b9-bfd4-87f368da638a/taxonomy.sqlite3
'prime_cache' not found
[job Cache_FASTA_Sequences] completed permanentFail
[step Cache_FASTA_Sequences] Output is missing expected field file:///pgap/genomic_source/wf_genomic_source.cwl#Cache_FASTA_Sequences/oseq_ids
[step Cache_FASTA_Sequences] Output is missing expected field file:///pgap/genomic_source/wf_genomic_source.cwl#Cache_FASTA_Sequences/asn_cache
[step Cache_FASTA_Sequences] completed permanentFail
[workflow genomic_source] completed permanentFail
[step genomic_source] completed permanentFail
[workflow standard_pgap] completed permanentFail
[step standard_pgap] completed permanentFail
[workflow ] completed permanentFail
{
    "gbent": null,
    "gbk": null,
    "gff": null,
    "nucleotide_fasta": null,
    "protein_fasta": null
}
Final process status is permanentFail

I am wondering whether it might be due to unfullfilled environmental variables.
Would you mind sharing part/the entire singularity recipe file ?
Currently I only added the following section which I fear is not enough:

%environment
export PATH=/pgap/:/bin/${PATH}

It adds all binaries into the path, but seem to fail as above shown.
As I am not an active user of your software but only packing it, it might be completely unrelated as well.

pgap.py with no arguments does a stack trace

If you run pgap.py with no arguments it gives a stack trace.

It would be good if it showed the help ie. behaved like -h

run failed

Hey

I am working with some bacterial genome but I am getting following msg.

grep -i "fail" cwltool.log
WARNING [job pgapx_yaml_ctl] completed permanentFail
WARNING [step pgapx_yaml_ctl] completed permanentFail
INFO [workflow prepare_input_template] completed permanentFail
WARNING [step prepare_input_template] completed permanentFail
INFO [workflow ] completed permanentFail
WARNING Final process status is permanentFail

Other details are-
grep -P '2019.*\[.*\].*\$' cwltool.log | tail
[2019-08-27 12:56:43] INFO [job blastn_wnode] /pgap/output/debug/tmp-outdir/qb9iws6x$ blastn_wnode
[2019-08-27 12:56:50] INFO [job gpx_make_outputs] /pgap/output/debug/tmp-outdir/mydkuqrv$ gpx_make_outputs
[2019-08-27 12:56:50] INFO [job Find_Frequent_contam_in_prok_Hits] /pgap/output/debug/tmp-outdir/9z342mhf$ align_find_frequent
[2019-08-27 12:56:51] INFO [job Filter_contam_in_prok_BLAST_Results] /pgap/output/debug/tmp-outdir/m82o2kgc$ align_filter
[2019-08-27 12:56:54] INFO [job Generate_contam_in_prok_hit_features] /pgap/output/debug/tmp-outdir/yy84irh6$ generate_fscr_feats
[2019-08-27 12:56:58] INFO [job fscr_calls_pass1] /pgap/output/debug/tmp-outdir/p7y20jeh$ fscr_calls_pass1
[2019-08-27 12:57:01] INFO [job fscr_format_calls] /pgap/output/debug/tmp-outdir/orbedbxc$ fscr_format_calls
[2019-08-27 12:57:07] INFO [job screen_evaluate] /pgap/output/debug/tmp-outdir/mpq331kb$ screen_evaluate
[2019-08-27 12:57:07] INFO [job yaml2json] /pgap/output/debug/tmp-outdir/7yg5pgqn$ yaml2json.py
[2019-08-27 12:57:08] INFO [job pgapx_yaml_ctl] /pgap/output/debug/tmp-outdir/gk938743$ pgapx_yaml_ctl \

Thanks
Mayank

pgap.py --update spits out python traceback

I followed the instructions:

wget ....
chmod +x pgap.py
pgap.py --update

And I get this:

The latest version of PGAP is 2019-05-13.build3740, you have nothing installed >
Downloading (as needed) Docker image ncbi/pgap:2019-05-13.build3740
Traceback (most recent call last):
  File "./pgap.py", line 387, in install_docker
    r = subprocess.run([self.dockercmd, 'pull', self.docker_image], check=True)
  File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/subprocess>    with Popen(*popenargs, **kwargs) as process:
  File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/subprocess>
    restore_signals, start_new_session)
  File "/home/linuxbrew/.linuxbrew/Cellar/python/3.7.3/lib/python3.7/subprocess>
    executable = os.fsencode(executable)
  File "/home/linuxbrew/.linuxbrew/bin/../Cellar/python/3.7.3/lib/python3.7/os.>
    filename = fspath(filename)  # Does type-checking of `filename`.
TypeError: expected str, bytes or os.PathLike object, not NoneType

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./pgap.py", line 475, in <module>
    main()
  File "./pgap.py", line 465, in main
    params = Setup(args)
  File "./pgap.py", line 267, in __init__
    self.update()
  File "./pgap.py", line 378, in update
    self.install_docker()
  File "./pgap.py", line 389, in install_docker
    except CalledProcessError:

Execute_CRISPR_wnode failed

Hi,

I am trying to annotate some short region from a bacterial genome. It is around 30 kb. I ran the program like this:

./pgap.py --ignore-all-errors -r -v -d -o pgap_annotation XXXXX_input.yaml

I am getting the following error message:

[2019-10-21 12:10:54] INFO [job Execute_CRISPR_submit] Max memory used: 36MiB
[2019-10-21 12:10:54] INFO [job Execute_CRISPR_submit] completed success
[2019-10-21 12:10:54] INFO [step Execute_CRISPR_submit] completed success
[2019-10-21 12:10:54] INFO [workflow bacterial_mobile_elem] starting step Execute_CRISPR_wnode
[2019-10-21 12:10:54] INFO [step Execute_CRISPR_wnode] start
[2019-10-21 12:10:54] INFO [job Execute_CRISPR_wnode] /pgap/output/debug/tmp-outdir/2ikx9n0a$ ncbi_crisper_wnode \
    -asn-cache \
    /tmp/tmpnj6oc9ec/stg02a16832-2d47-46e4-8c77-4d21008e0525/sequence_cache \
    -ncbi-crisper-path \
    /opt/crispr/1.0/bin/ \
    -input-jobs \
    /tmp/tmpnj6oc9ec/stg1e9332d9-e2d5-4c4b-9fb8-e45740ccfe8b/jobs.xml \
    -O \
    output
00324/000/0000/P  320D0144DADA04E1 0006/0006 2019-10-21T12:10:56.528834 a89a6049184f    UNK_CLIENT      320D0144DADA04E1_0000SID ncbi_crisper_wnode Info: LIB "wn_app.cpp", line 273: ncbi::CGPX_WorkerApp::Run() --- output path: /pgap/output/debug/tmp-outdir/2ikx9n0a/output
=== STDOUT: /opt/crispr/1.0/bin//pilercr -in /tmp/tmpn15i4az6/ncbi_crisper_wnode.324.324120404883200QhQlm/fasta_by_scaffold.324.32431717638912PgANHV/0.fa  -out .331.0.fa.pilecr.tmp -noinfo
FATAL: /opt/crispr/1.0/bin//pilercr -in /tmp/tmpn15i4az6/ncbi_crisper_wnode.324.324120404883200QhQlm/fasta_by_scaffold.324.32431717638912PgANHV/0.fa  -out .331.0.fa.pilecr.tmp -noinfo returned non-zero code 139
check out .331.0.fa.pilecr.tmp
[2019-10-21 12:10:59] INFO [job Execute_CRISPR_wnode] Max memory used: 36MiB
[2019-10-21 12:10:59] WARNING [job Execute_CRISPR_wnode] completed permanentFail
[2019-10-21 12:10:59] WARNING [step Execute_CRISPR_wnode] completed permanentFail
[2019-10-21 12:10:59] INFO [workflow bacterial_mobile_elem] completed permanentFail
[2019-10-21 12:10:59] WARNING [step bacterial_mobile_elem] completed permanentFail
[2019-10-21 12:10:59] INFO [workflow standard_pgap] completed permanentFail
[2019-10-21 12:10:59] WARNING [step standard_pgap] completed permanentFail
[2019-10-21 12:10:59] INFO [workflow ] completed permanentFail

And here is the end of the ncbi_crisper_wnode.0.log file:

wnode extra         id=lcl|NZ_JNMI01000006.1&id_offset=0
00324/004/0006/R  320D0144DADA04E1 0028/0006 2019-10-21T12:10:57.022212 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode Error: NCBI_CRISPER "ncbi_crisper_wnode.cpp", line 393: ncbi::CGPX_NcbiCrisperJob::Process() --- Error in the body of Process: (CException::eUnknown) ncbi_crisper call failed: /opt/crispr/1.0/bin//ncbi_crisper /tmp/tmpn15i4az6/ncbi_crisper_wnode.324.324120404883200QhQlm/fasta_by_scaffold.324.32431717638912PgANHV/0.fa
00324/004/0006/R  320D0144DADA04E1 0029/0007 2019-10-21T12:10:57.022578 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode Error: LIB "wn_worker_thread.cpp", line 275: ncbi::CWorkerThread::x_DoJob() --- error processing job: (CException::eUnknown) ncbi_crisper call failed: /opt/crispr/1.0/bin//ncbi_crisper /tmp/tmpn15i4az6/ncbi_crisper_wnode.324.324120404883200QhQlm/fasta_by_scaffold.324.32431717638912PgANHV/0.fa
00324/004/0006/RE 320D0144DADA04E1 0030/0008 2019-10-21T12:10:57.023659 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode request-stop  500 0.447921991 0 0
00324/002/0007/RB 320D0144DADA04E1 0031/0001 2019-10-21T12:10:57.023831 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode request-start internal_thread=output&processing_cycle=1&ncbi_app_version=0.0.0&ncbi_app_sc_version=22&ncbi_app_vcs_revision=591931
00324/002/0007/R  320D0144DADA04E1 0032/0002 2019-10-21T12:10:57.023862 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode extra         success_jobs=0&fail_jobs=1
00324/002/0007/R  320D0144DADA04E1 0033/0003 2019-10-21T12:10:57.024068 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode extra         attempts=1
00324/002/0007/RE 320D0144DADA04E1 0034/0004 2019-10-21T12:10:57.024119 a89a6049184f    UNK_CLIENT      UNK_SESSION              ncbi_crisper_wnode request-stop  200 0.000287771 0 0
00324/000/0000/P  320D0144DADA04E1 0035/0015 2019-10-21T12:10:59.573445 a89a6049184f    UNK_CLIENT      320D0144DADA04E1_0000SID ncbi_crisper_wnode Error: CORELIB(106.16) "ncbiapp.cpp", line 529: ncbi::CNcbiApplication::x_TryMain() --- Application's execution failed (CException::eUnknown) 1 jobs failed
00324/000/0000/PE 320D0144DADA04E1 0036/0016 2019-10-21T12:10:59.588016 a89a6049184f    UNK_CLIENT      320D0144DADA04E1_0000SID ncbi_crisper_wnode extra         ncbi_phid=320D0144DADA04E10000000000000001
00324/000/0000/PE 320D0144DADA04E1 0037/0017 2019-10-21T12:10:59.588113 a89a6049184f    UNK_CLIENT      320D0144DADA04E1_0000SID ncbi_crisper_wnode stop          3 4.776411056

Could you please help me to identify the problem?

Will any information be uploaded to NCBI?

For privacy and IP reasons, I was wondering if any information was uploaded to NCBI when running locally with the -n flag? I am working with private genomes.

ncbi / pgap Goto Github PK

pgap's Introduction

PGAP

Instructions

References

NCBI

GeneMarkS-2+

CheckM

TIGRFAMs

LICENSING TERMS

NCBI PGAP CWL

Third-party tools

GeneMarkS-2+

CheckM

TIGRFAMs

pgap's People

Contributors

Stargazers

Watchers

Forkers

pgap's Issues

Description:

How I got it to work:

Additional info:

Hardware: Laptop

Hardware: Cluster

Recommend Projects

Recommend Topics

Recommend Org