katholt / srst2 Goto Github PK

View Code? Open in Web Editor NEW

123.0 123.0 65.0 2.5 MB

Short Read Sequence Typing for Bacterial Pathogens

License: Other

Python 91.62% Shell 2.02% R 6.36%

srst2's People

Contributors

Stargazers

Watchers

Forkers

swlong jrherr bjpop hdashnow psgope bioinformaticsarchive darcymyers takadonet jkbf86 niyunyun rguerillot tarah28 holtlab sdwfrost rafalcode andreyto pauruihu jhenriksen-agbiome phylloxera alegione ssarria redmar-van-den-berg alenzhao phac-nml alexjfortuna ys4 jhaan1027 chanzuckerberg senasica-cnrpyc daneshmoradigaravand guilhemroyer amendal jingjiesong wangmz0617 wangpanqiao sadiarahman321 josemunozc iwashita-n indexofire willpitchers amani82 zhaoxia413 nurikim116 fmaguire shaodongyan kant gunzivan28 emollier eliasonaws mjohnpayne amarachike seafoxxsci atgenomics chau1311 cvn001 chen881215 wook2014 hudazuhair alienzj julianneyang zhangxiaodong8315 ztsin gilmahu otakuzerg

srst2's Issues

srst2.py throws error if allele has slash character

I ran srst2 against a gene database that happened to have some slash characters for some of the read names:

70__aec27/clpV__aec27/clpV_EC042_0215__R033079 R033079 aec27/clpV (EC042_0215) - putative type VI secretion system protein [Escherichia coli str. 042 (EAEC O44:H18)]

ATGATCCAGATTGATTTAGCCACGCTGGTAAAGCGGCTTAACCCCTTTGCAAAACAGGCG
...

This causes srst2.py to crash when generating pileups:

Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.5', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '67__aec27/clpV__aec27/clpV_G2583_0230__R033067.ERR024627__ERR024627.Escherichia_VF_clustered.pileup'

Possible solution is to just change the slashes to underscores

mlst results file only contains header when running certain samples

Here are the commands I ran:

getmlst.py --species "Escherichia coli#1"
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028698/ERR028698_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR028/ERR028698/ERR028698_2.fastq.gz
srst2 --input_pe ERR028698*.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt

This is the output I get:
Sample ST adk fumC gyrB icd mdh purA recA mismatches uncertainty depth maxMAF

Note that there is only a header and not even the sample name is printed

Here is the log, which does not indicate any errors:

01/24/2015 19:22:24 program started
01/24/2015 19:22:24 command line: /usr/local/bin/srst2 --input_pe ERR028698_1.fastq.gz ERR028698_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txtsrst2 --input_pe ERR028698_1.fastq.gz ERR028698_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt
01/24/2015 19:22:24 Total paired readsets found:1
01/24/2015 19:22:24 Building bowtie2 index for Escherichia_coli#1.fasta...
01/24/2015 19:22:24 Running: bowtie2-build Escherichia_coli#1.fasta Escherichia_coli#1.fasta
01/24/2015 19:22:29 Processing database Escherichia_coli#1.fasta
01/24/2015 19:22:29 Running: samtools faidx Escherichia_coli#1.fasta
01/24/2015 19:22:30 Processing sample ERR028698
01/24/2015 19:22:30 Output prefix set to: shigella1__ERR028698.Escherichia_coli#1
01/24/2015 19:22:30 Aligning reads to index Escherichia_coli#1.fasta using bowtie2...
01/24/2015 19:22:30 Running: bowtie2 -1 ERR028698_1.fastq.gz -2 ERR028698_2.fastq.gz -S shigella1__ERR028698.Escherichia_coli#1.sam -q --very-sensitive-local --no-unal -a -x Escherichia_coli#1.fasta
01/24/2015 19:22:32 Processing Bowtie2 output with SAMtools...
01/24/2015 19:22:32 Generate and sort BAM file...
01/24/2015 19:22:32 Running: samtools view -b -o shigella1__ERR028698.Escherichia_coli#1.unsorted.bam -q 1 -S shigella1__ERR028698.Escherichia_coli#1.sam.mod
01/24/2015 19:22:32 Running: samtools sort shigella1__ERR028698.Escherichia_coli#1.unsorted.bam shigella1__ERR028698.Escherichia_coli#1.sorted
01/24/2015 19:22:32 Deleting sam and bam files that are not longer needed...
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.sam
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.sam.mod
01/24/2015 19:22:32 Deleting shigella1__ERR028698.Escherichia_coli#1.unsorted.bam
01/24/2015 19:22:32 Generate pileup...
01/24/2015 19:22:32 Running: samtools mpileup -L 1000 -f Escherichia_coli#1.fasta -Q 20 -q 1 shigella1__ERR028698.Escherichia_coli#1.sorted.bam
01/24/2015 19:22:32 Processing SAMtools pileup...
01/24/2015 19:22:33 Scoring alleles...
01/24/2015 19:22:39 Finished processing for read set ERR028698 ...
01/24/2015 19:22:39 Finished processing for database Escherichia_coli#1.fasta ...
01/24/2015 19:22:39 MLST output printed to shigella1__mlst__Escherichia_coli#1__results.txt
01/24/2015 19:22:39 SRST2 has finished.

single end and paired end reads for a sample

Hi Kat,

I have single end read (merged from paired end reads) and paired end reads (remaining result of not merged) for a sample and I am using all for one srst2 run. I am not sure if the --merge_paired option is used for this condition? Because when I tried I got this error:

File "/srst2/srst2.py", line 1514, in main
mate2.append(reads[1])
IndexError: list index out of range

Please advise. Many thanks.

Regards,
Cheng

Fails on bowtie 2.2.x series, needs 2.1.0 only?

Seems to require bowtie 2.1.0 specifically? (which is nearly 2 years old now). We have the 2.2.2 version as default, but it fails:

11/28/2014 12:02:04 program started
11/28/2014 12:02:04 command line: /bio/sw/srst2/bin/srst2 --input_pe ../dropbox/E.faecium/ALL/Ef_aus2223/aus2223_S141_L002_R1_001.fastq.gz ../dropbox/E.faecium/ALL/Ef_aus2223/aus2223_S141_L002_R2_001.fastq.gz --output Aus2223_mlst --log --mlst_db Enterococcus_faecium.fasta --mlst_definitions efaecium.txt
11/28/2014 12:02:04 Total paired readsets found:1
11/28/2014 12:02:04 Incorrect version of bowtie installed.
11/28/2014 12:02:04 bowtie version 2.1.0 is required by SRST2.

error in forward/reverse read identification

Couple of fixes for

Dear Kat,

Firstly, thank you for providing such an excellent program. I have been using SRST2 on a few of our more unusual bugs. I encountered a couple of problems that I have fixed. As they were minor fixes, but big problems, I thought you would like to know.

1/ When using the --report_all_consensus. I was getting a bug which originated from line 764 where the output file contained a directory path. It was appending the directory path to the output file and generating an error. I fixed it by replacing

outpileup = allele_name + "." + all_pileup_file

with

outpileup = allele_name + "." + os.path.basename(all_pileup_file)

2/ I was getting a problem where the pileup file was empty but there was clearly coverage when viewing the bam file and sufficient reads were mapping. This seems to be a problem with the default options for samtool mpileup. I fixed this by adding the -A (map anomalous reads) option to line 622. While it may not be good practice to always consider unpaired reads, all of my reads were considered unpaired (all though they were a pe file). It maybe worth either making this the default option or providing an option to consider anomalous reads without having to dig into the code.

All the best,
Sion Bayliss

Change default behaviour to throw away pileups & final sorted bam

Add --keep_final_alignment option

if this and/or keep_interim_alignment is switched on, keep final sorted bam & pileups

otherwise delete pileup & bam

Why are my isolate names appended with a '?' in some cases but not all?

When running the command
srst2 --prev_output *_genes__*__results.txt --output xyz
Why are some isolate names appended with a '?' in the output table?

Sample column blank in outputmlstXYZ__results.txt

Hello SRST2 Team,
I'm running srsts2 on illumina reads.

I'm throwing my reads for Sequencing typing with below command:

sbatch -p 128gb --nodes=1 -J $i --time 12:0:0 -o $i".%j.out" -e $i".%j.stderr" ~/srst2-master/scripts/srst2.py --input_pe $forward_read $reverse_read --forward $forward_prefix --reverse $reverse_prefix --output $temp_dir/$i --log --save_scores --mlst_db Staphylococcus_aureus.fasta --mlst_definitions saureus.txt --mlst_delimiter '-'

$i is the sample/isolate name.
--output $temp_dir/$i
$temp_dir - is the output directory, and the prefix is $i i.e sample name

I get an output like below, where I'm having blank value in column for Sample:

Sample ST arcc aroe glpf gmk_ pta_ tpi_ yqil mismatches uncertainty depth maxMAF

    121     6       5       6       2       7       14      5       0       -       132.005428571   0.0512820512821`

However, when I run on data ERR024070, as in example.txt bundled with package, I get output as expected, with value in Sample column.

I've tried multiple times, but it seems output won't listen to me.

Is this because I'm having reads with different naming scheme?
Or because I'm running script incorrectly?

Kindly advise.

Many thanks.

accommodate slashes in species names

From: Paul Cherng [email protected]
Date: 30 November 2014 08:46:47 GMT+11
To: [email protected]
Subject: SRST2 getmlst.py bug

Hi Dr. Holt,

I am using your SRST2 software and I have come across a bug with the getmlst.py script. It appears that the "Campylobacter concisus/curvus" species from http://pubmlst.org/data/ causes the script to crash because of the slash in the name. Here is the error I get:

getmlst.py --species "Campylobacter concisus/curvus"

Traceback (most recent call last):
File "/usr/local/bin/getmlst.py", line 183, in
main()
File "/usr/local/bin/getmlst.py", line 141, in main
species_all_fasta_file = open(species_all_fasta_filename, 'w')
IOError: [Errno 2] No such file or directory: u'Campylobacter_concisus/curvus.fasta'

I'm not sure what the best solution for this is, perhaps simply replacing slashes with underscores will suffice.

Thanks,
-Paul Cherng

Add example output to doc (as figure since tables don't display properly)

Error in step: Using the VFDB Virulence Factor Database with SRST2

1: The URL to the VFDB has changed and, using wget, should now be:

wget http://www.mgc.ac.cn/VFs/Down/VFDB_setB_nt.fas.gz

2: We need three extra steps for the database parsing scripts to work.

i) gunzip VFDB_setB_nt.fas.gz
ii) mv VFDB_setB_nt.fas VFDB_setB_nt.fas.ffn
iii) save the following code as convert.py and execute as python convert.py, which gets rid of the "(gi:xxxxx)" bit in the sequence headers and the commas (replacing with '_') because downstream processing splits on the commas:

with open("VFDB_setB_nt.fas.ffn", "r") as input_handle:
    with open("VFDB_setB_nt.fas_corrected.ffn", "w") as output_handle:
        for line in input_handle:
            if '(gi:' in line:
                line=line.replace('(gi:', ' (gi:').replace(',','_')
                line = line.split()
                output_handle.write(line[0]+' '+' '.join(line[2:])+'\n')
            else:
                output_handle.write(line)

Without removing the text "(gi:xxxxx)" from the header, this next step will not work:

python VFDB_cdhit_to_csv.py --cluster_file Clostridium_cdhit90.clstr --infile Clostridium.fsa --outfile Clostridium_cdhit90.csv

Also, there is a redundant comment on line 46 in the script VFDB_cdhit_to_csv.py (should read "VFGxxxxxx"):

schultzm@x:gene_dbs $ grep 'the unique ID R0xxx' ~/VFDB_cdhit_to_csv.py -n
46:         database[ClusterNr].append(seqID) # for virulence gene DB, this is the unique ID R0xxx

PIP doesn't have scipy listed as dependency

I did a "pip install srst2/" from a git clone, using virtualenv (blank python canvas):

File "/bio/sw/srst2/local/lib/python2.7/site-packages/srst2/srst2.py", line 19, in
from scipy.stats import binom_test, linregress
ImportError: No module named scipy.stats

I then try to install scipy via PIP, but fails due to lack of numpy.

Once I install numpy, then scipy works, and srst2 works (well, i get the help screen).

I think the PIP file needs to have its dependencies listed in it somehow?

EcOH.fasta has 47 duplicate sequences & some inconsistences

I ran this command to check the database:

cd-hit-est -d 0 -i EcOH.fasta -c 1 -g 1 -o cdhit

and discovered 47 duplicate sequences (and some subsequences), including some which have confounding O/H types - here are some examples:

0       1251nt, >9__wzy__wzy-O123-Gp5__370... *
1       1251nt, >9__wzy__wzy-O123-Gp5__371... at +/100.00%
2       1251nt, >9__wzy__wzy-O123-Gp5__372... at +/100.00%
3       1251nt, >9__wzy__wzy-O186-Gp5__454... at +/100.00%

0       1230nt, >8__wzx__wzx-O153-Gp11__191... *
1       1230nt, >8__wzx__wzx-O178-Gp11__223... at +/100.00%

0       1071nt, >8__wzx__wzx-O28ac-Gp2__250... at +/100.00%
1       1071nt, >8__wzx__wzx-O42-Gp2__266... at +/100.00%
2       1230nt, >8__wzx__wzx-O42-Gp2__267... *

0       1221nt, >8__wzx__wzx-O118-Gp3__140... *
1       1221nt, >8__wzx__wzx-O118-Gp3__141... at +/100.00%
2       1221nt, >8__wzx__wzx-O151-Gp3__189... at +/100.00%

0       1149nt, >9__wzy__wzy-O129-Gp10__383... *
1       1149nt, >9__wzy__wzy-O13-Gp10__384... at +/100.00%
2       1149nt, >9__wzy__wzy-O135-Gp10__390... at +/100.00%
3       1149nt, >9__wzy__wzy-O135-Gp10__391... at +/100.00%

0       1053nt, >9__wzy__wzy-O111__351... *
1       1053nt, >9__wzy__wzy-O7__524... at +/100.00%

0       1074nt, >9__wzy__wzy-O153-Gp11__412... at +/100.00%
1       1098nt, >9__wzy__wzy-O178-Gp11__444... *

0       1380nt, >9__wzy__wzy-O28ac-Gp2__471... *
1       1380nt, >9__wzy__wzy-O42-Gp2__487... at +/100.00%
2       1380nt, >9__wzy__wzy-O42-Gp2__488... at +/100.00%

0       1332nt, >9__wzy__wzy-O17-Gp9__434... *
1       1332nt, >9__wzy__wzy-O44-Gp9__490... at +/100.00%
2       1332nt, >9__wzy__wzy-O77-Gp9__531... at +/100.00%

0       1191nt, >9__wzy__wzy-O18-Gp12__446... *
1       1146nt, >9__wzy__wzy-O18ac__457... at +/100.00%

Analyzing NextSEq data

Hi Kat,

I got NextSeq data which contains 4X forward fastq files and 4X reverse fastq files, in total 8 files for each sample. I concatenated 4 forward fastq files and 4 reverse fastq files to one forward and reverse fastq files, and run SRST2 but it seems like not processing the run.

Please tell me how to handle such multiple fastq files for SRST2.

Best regards,
Tomi

Crash during processing of large number of compiledResults.txt files into a single output

Running into an error while trying to use --prev_output to aggregate a large number of compiledResults from MLST & Antibiotic resistance gene database into a single report. Sample size is 932 samples. Following error is thrown during the processing phase:

Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1583, in main
compile_results(args,mlst_results_hashes,gene_result_hashes,compiled_output_file)
File "/share/apps/srst2/bin/srst2.py", line 1439, in compile_results
this_st = mlst_results_master[sample].split("\t")[1]
IndexError: list index out of range

I'm assuming it's something about the sheer number of samples? or perhaps the compiledResults files.

Best regards,
S. Wesley Long

Add 'How it works' section to doc

Gene detection reporting - top gene per gene symbol, rather than the top gene per cluster as intended

I found bug in the code whereby we were reporting the top gene per gene symbol, rather than the top gene per cluster.

So, in those cases where there are very distinct groups of genes that share a gene symbol, you would only ever get the top scoring allele amongst them. So you can miss genes.

For example I found some cases today where I was expecting blaOXA-23 to be present, and was getting blaOXA-66 reported as the allele for ‘blaOXA’. Actually blaOXA is a common gene symbol used for genes that span as low as 70% identity, so each of these subtypes of blaOXA need to be treated as different genes that could each have alleles present. We are prepared for this because we have several distinct blaOXA clusters annotated in our resistance gene database, and recommend pre-clustering of all user databases before using with SRST2… but the code was not using the clustering IDs properly. So, instead of having blaOXA-23 (cluster 297) and blaOXA-66 (cluster 299) reported as present, I was just seeing blaOXA-66 (blaOXA) in all the outputs.

Allow bowtie2 versions higher than 2.1.0

Currently we just check for 2.1.0, but I have tested and found it works with the more recent higher versions of bowtie2.
So check_command_version function needs to be modified to allow any versions higher than this FOR BOWTIE2 BUT NOT FOR SAMTOOLS as we need to enforce samtools 0.1.18 and not allow 0.1.19

grep gene name syntax error

Hi,

I have this issue at the end of running srst2 using resistance gene database. I don't find any problem with the result but I want to clarify this error with you. Is this error a significant one or it just output error with the gene naming but doesn't affect the result?

sh: -c: line 0: syntax error near unexpected token (' sh: -c: line 0:grep 194__mph(A)_1_D16251__mph(A)_1_D16251__00057 /database/resfinder2016_2.fasta'

Thanks,
Cheng

fullgenes table not reporting all samples when multiple readsets run in serial

if multiple readsets analysed in serial in a srst2 run, fullgenes reports the results for the last readset only as it was being rewritten each time

example.txt is missing the confident ST after line 106

On line 106 of the example.txt file, it says:

This time there is a confident ST called, which should look like this:

But then it doesn't actually have an example of what the confident ST should look like.

getmlst.py crashes with a cryptic error message when downloading "Campylobacter fetus"

When I run this command:

getmlst.py --species "Campylobacter fetus"

I get this error:

For SRST2, remember to check what separator is being used in this allele database
Traceback (most recent call last):
File "/usr/local/bin/getmlst.py", line 183, in
main()
File "/usr/local/bin/getmlst.py", line 173, in main
m = re.match('>(.)([-])(\d_)',head).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

It seems like everything downloaded though, but cfetus.txt doesn't seem to contain any profiles

root@ac486a86e05b:/home# ls
C_fe_tkt.tfa Campylobacter_fetus.fasta Cfe_aspA.tfa Cfe_glnA.tfa Cfe_gltA.tfa Cfe_glyA.tfa Cfe_pgm.tfa Cfe_uncA.tfa cfetus.txt mlst_data_download_Campylobacter_fetus_2014-11-29.log

root@ac486a86e05b:/home# cat cfetus.txt
Scheme 9 is not available.

This probably isn't something that can be fixed on the SRST2 side since this species doesn't even have any profiles, but it would be nice to have a more meaningful error message when encountering this error mode.

Difference in log files for negatives when using gene_db

Hi Kat, this is in reply to your email, but I have included the original query below also.

The files that are created are:
3__3-paired.gene_db.sam
3__3-paired.gene_db.sam.mod
3__3-paired.gene_db.unsorted.bam
3__genes__gene_db__results.txt
However the top 3 are empty and the genes results table just has the isolate number in it?

I've run 10 different genes using 10 different gene_db's now and have returned this error in the log file on a number of negatives - but a lot of them are also ones I'd expect to be negative. Sometimes when an isolate has a positive result at 90% and negative at 60% or vice versa and this log file error has been present it's indicated a file that needs to be re-run (I'm doing 1500 and the internet hamster in Sydney sometimes drops my VPN connection!) But a lot of them when I've rerun them have returned the same log file error and seem to be fine in all other aspects?

Many thanks in advance

Karen

Original query -
I am currently using SRST2 to map reads to target sequences using gene_db. I keep getting a difference in log files for my negatives? Some seem to run fine and return the usual but some of my log files are returning a message after the samtools portion which I have highlighted below.

I have run a different gene_db on this same isolate and not had this message turn up in the log file and similarly I have used the same gene_db.fasta file on different isolates and they seem to have worked fine?

Is this actually a run error and false negative or should I just assume this is a negative for the target sequence? Am I interpreting the log files incorrectly? The below message (2) occurs on the screen when the log files change also? I have tried forcing 0.1.18 version of samtools but this seems to have no effect?

Log file error
08/27/2015 04:47:29 program started
08/27/2015 04:47:29 command line: /local/software/python/2.7.5/bin/srst2 --input_pe /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_1.fastq.gz /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_2.fastq.gz --log$
08/27/2015 04:47:29 Total paired readsets found:1
08/27/2015 04:47:29 Index for gene_db.fasta is already built...
08/27/2015 04:47:29 Processing database gene_db.fasta
08/27/2015 04:47:29 Processing sample 3-paired
08/27/2015 04:47:29 Output prefix set to: 3__3-paired.gene_db
08/27/2015 04:47:29 Aligning reads to index gene_db.fasta using bowtie2...
08/27/2015 04:47:29 Running: bowtie2 -1 /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_1.fastq.gz -2 /scratch/kc1e12/PhD/WAIT_trimmed/3-paired_2.fastq.gz -S 3__3-paired.gene_db.sam -q --very-sensi$
08/27/2015 04:48:28 Processing Bowtie2 output with SAMtools...
08/27/2015 04:48:28 Generate and sort BAM file...
08/27/2015 04:48:28 Running: samtools view -b -o 3__3-paired.gene_db.unsorted.bam -q 1 -S 3__3-paired.gene_db.sam.mod
08/27/2015 04:48:28 {'message': "Command 'samtools view -b -o 3__3-paired.gene_db.unsorted.bam -q 1 -S 3__3-paired.gene_db.sam.mod' failed with non-zero exit status: 1"}
08/27/2015 04:48:28 failed gene detection
08/27/2015 04:48:28 Tabulating results for database gene_db.fasta ...
08/27/2015 04:48:28 Finished processing for database gene_db.fasta ...
08/27/2015 04:48:28 Gene detection output printed to 3__genes__gene_db__results.txt
08/27/2015 04:48:28 SRST2 has finished.

Screen error
' is recognized as '*'.
[main_samview] truncated file.

Typo in srst2.py causing VFDB searches to fail

I ran into this error exclusively when trying to use a gene database derived from the VFDB files as described on the website. There is a typo on line 649 where the variable seqid is mistyped as "seq_id". it causes the following failure:

foo@computer2:SRST2$ srst2 --input_se foostaph.fastq.gz --output SA_test --log --gene_db Staphylococcus_VF_clustered.fasta
65260 reads; of these:
65260 (100.00%) were unpaired; of these:
61910 (94.87%) aligned 0 times
21 (0.03%) aligned exactly 1 time
3329 (5.10%) aligned >1 times
5.13% overall alignment rate
[samopen] SAM header is present: 1178 sequences.
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.2', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 1316, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 937, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 991, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 1101, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols)
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 679, in parse_scores
gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[0]
File "/usr/local/lib/python2.7/dist-packages/srst2-0.1.2-py2.7.egg/srst2/srst2.py", line 649, in get_allele_name_from_db
allele_name += "_" + seq_id
NameError: global name 'seq_id' is not defined

Does not appear to affect the function of SRST2 with the supplied resistance.fasta database or MLST functions. Correcting the spelling of the variable at line 649 to "seqid" corrects the problem.

Nb. User error using getmlst.py script

I just wanted to note this user error which I managed to run afoul of in the past week. If one updates the MLST database files with the getmlst.py script in the same directory where an older version of the files is stored, it will overwrite the alleles files (tfa and fasta) and the ST definition text file, however it leaves the bowtie2 and samtools generated index files intact.

If you neglect to remember this and fail to remove these files manually, SRST2 will subsequently run quite happily and generate a slew of rather erroneous results using the bad index files and the new data files. As far as I can tell, no errors are generated, per se, but the resulting MLST calls are mostly NF*? due to terrible mismatches as the indices don't match the new data files.

It might be nice if the getmlst script removed the old indexes and perhaps even generated the new ones after downloading the MLST database files, but I leave that decision up to you.

Best,
S. Wesley Long

Redundancy (CARB-2 and CARB-8 genes) in ARGannot.fasta

Hi, the sequence of CARB-2 and CARB-8 in this database are the same.

[Evidence] In our database,

137__CARB_Bla__CARB-2__559 yes;yes;CARB-2;Bla;HQ157204;747-1613;867
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCA
AAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATA
GGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGC
TTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAG
CAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTAT
TCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCA
ACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCC
AAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATT
GAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCA
ATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAA
AAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTG
CCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATT
ACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAA
ACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTT
GACGTTTATACATCACAGTCGCGCTGA

137__CARB_Bla__CARB-8__564 yes;yes;CARB-8;Bla;GQ866976;1345-2211;867
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCA
AAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATA
GGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGC
TTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAG
CAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTAT
TCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCA
ACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCC
AAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATT
GAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCA
ATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAA
AAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTG
CCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATT
ACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAA
ACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTT
GACGTTTATACATCACAGTCGCGCTGA

They are identical as revealed by megaBLAST. This redudancy already exists in the original ARG-ANNOT database as well as in GenBank:

gb|HQ157204.1|:747-1613|CARB-2
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCAAAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATAGGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGCTTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAGCAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTATTCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCAACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCCAAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATTGAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCAATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAAAAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTGCCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATTACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAAACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTTGACGTTTATACATCACAGTCGCGCTGA

gb|GQ866976.1|:1345-2211|CARB-8
ATGAAGTTTTTATTGGCATTTTCGCTTTTAATACCATCCGTGGTTTTTGCAAGTAGTTCAAAGTTTCAGCAAGTTGAACAAGACGTTAAGGCAATTGAAGTTTCTCTTTCTGCTCGTATAGGTGTTTCCGTTCTTGATACTCAAAATGGAGAATATTGGGATTACAATGGCAATCAGCGCTTCCCGTTAACAAGTACTTTTAAAACAATAGCTTGCGCTAAATTACTATATGATGCTGAGCAAGGAAAAGTTAATCCCAATAGTACAGTCGAGATTAAGAAAGCAGATCTTGTGACCTATTCCCCTGTAATAGAAAAGCAAGTAGGGCAGGCAATCACACTCGATGATGCGTGCTTCGCAACTATGACTACAAGTGATAATACTGCGGCAAATATCATCCTAAGTGCTGTAGGTGGCCCCAAAGGCGTTACTGATTTTTTAAGACAAATTGGGGACAAAGAGACTCGTCTAGACCGTATTGAGCCTGATTTAAATGAAGGTAAGCTCGGTGATTTGAGGGATACGACAACTCCTAAGGCAATAGCCAGTACTTTGAATAAATTTTTATTTGGTTCCGCGCTATCTGAAATGAACCAGAAAAAATTAGAGTCTTGGATGGTGAACAATCAAGTCACTGGTAATTTACTACGTTCAGTATTGCCGGCGGGATGGAACATTGCGGATCGCTCAGGTGCTGGCGGATTTGGTGCTCGGAGTATTACAGCAGTTGTGTGGAGTGAGCATCAAGCCCCAATTATTGTGAGCATCTATCTAGCTCAAACACAGGCTTCAATGGCAGAGCGAAATGATGCGATTGTTAAAATTGGTCATTCAATTTTTGACGTTTATACATCACAGTCGCGCTGA

All of these sequences listed above are the same to each other. Therefore, we can consider to remove CARB-8 from the database ARGannot.fasta.

download error

$ git clone https://github.com/katholt/srst2
Initialized empty Git repository in /n/home01/yuanli/07152014/srst2/.git/
fatal: https://github.com/katholt/srst2/info/refs download error - The requested URL returned error: 403 Forbidden

allow multiple read sets per sample, merged to give single result

Crash when trying to create novel allele consensus fasta from ARG database

Howdy all,

Quick question, I am trying to recover the novel allele sequences from some samples being run against the ARG database included with SRST2. I'm getting the following error when trying to create the allele pileup:

06/17/2015 12:44:06 Generate pileup...
06/17/2015 12:44:06 Running: samtools mpileup -L 1000 -f /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta -Q 20 -q 1 -B data_out/KPN166ec__KPN166ec.ARGannot.sorted.bam
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
06/17/2015 12:44:11 Processing SAMtools pileup...
06/17/2015 12:44:34 Scoring alleles...
Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/share/apps/srst2/bin/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/share/apps/srst2/bin/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/share/apps/srst2/bin/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/share/apps/srst2/bin/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/share/apps/srst2/bin/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '77__OqxA_Flq__OqxA__1044.data_out/KPN166ec__KPN166ec.ARGannot.pileup'

So it seems like I'm having a problem creating a file / directory but I'm not certain why that is... the script is able to generate all the other output files it needs to... this is only an error when I try to run with --report_new_consensus. Any help is appreciated. This is happening for every paired end readset I attempt to generate the --report_new_consensus reports for.

Full program output is appended below, in case it is helpful.

06/17/2015 12:39:54 program started
06/17/2015 12:39:54 command line: /share/apps/srst2/bin/srst2.py --input_pe data_in/KPN166ec_1.fastq.gz data_in/KPN166ec_2.fastq.gz --report_new_consensus --gene_db /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta --output data_out/KPN166ec
06/17/2015 12:39:54 Total paired readsets found:1
06/17/2015 12:39:54 Index for /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta is already built...
06/17/2015 12:39:54 Processing database /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta
06/17/2015 12:39:54 Non-unique:55, CmlA_PheCmlA5
06/17/2015 12:39:54 Non-unique:114, CfrA_Phe
06/17/2015 12:39:54 Processing sample KPN166ec
06/17/2015 12:39:54 Output prefix set to: data_out/KPN166ec__KPN166ec.ARGannot
06/17/2015 12:39:54 Aligning reads to index /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta using bowtie2...
06/17/2015 12:39:54 Running: bowtie2 -1 data_in/KPN166ec_1.fastq.gz -2 data_in/KPN166ec_2.fastq.gz -S data_out/KPN166ec__KPN166ec.ARGannot.sam -q --very-sensitive-local --no-unal -a -x /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta
1764353 reads; of these:
1764353 (100.00%) were paired; of these:
1760384 (99.78%) aligned concordantly 0 times
2397 (0.14%) aligned concordantly exactly 1 time
1572 (0.09%) aligned concordantly >1 times
----
1760384 pairs aligned concordantly 0 times; of these:
156 (0.01%) aligned discordantly 1 time
----
1760228 pairs aligned 0 times concordantly or discordantly; of these:
3520456 mates make up the pairs; of these:
3517876 (99.93%) aligned 0 times
1544 (0.04%) aligned exactly 1 time
1036 (0.03%) aligned >1 times
0.31% overall alignment rate
06/17/2015 12:43:47 Processing Bowtie2 output with SAMtools...
06/17/2015 12:43:47 Generate and sort BAM file...
06/17/2015 12:43:47 Running: samtools view -b -o data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam -q 1 -S data_out/KPN166ec__KPN166ec.ARGannot.sam.mod
[samopen] SAM header is present: 1683 sequences.
06/17/2015 12:43:50 Running: samtools sort data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam data_out/KPN166ec__KPN166ec.ARGannot.sorted
06/17/2015 12:44:06 Deleting sam and bam files that are not longer needed...
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.sam
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.sam.mod
06/17/2015 12:44:06 Deleting data_out/KPN166ec__KPN166ec.ARGannot.unsorted.bam
06/17/2015 12:44:06 Generate pileup...
06/17/2015 12:44:06 Running: samtools mpileup -L 1000 -f /home/mresswl/archive/ARGFinder_DB/ARGannot.fasta -Q 20 -q 1 -B data_out/KPN166ec__KPN166ec.ARGannot.sorted.bam
[mpileup] 1 samples in 1 input files
Set max per-file depth to 8000
06/17/2015 12:44:11 Processing SAMtools pileup...
06/17/2015 12:44:34 Scoring alleles...
Traceback (most recent call last):
File "/share/apps/srst2/bin/srst2.py", line 1592, in
main()
File "/share/apps/srst2/bin/srst2.py", line 1548, in main
db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
File "/share/apps/srst2/bin/srst2.py", line 1102, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/share/apps/srst2/bin/srst2.py", line 1164, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/share/apps/srst2/bin/srst2.py", line 1275, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/share/apps/srst2/bin/srst2.py", line 859, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/share/apps/srst2/bin/srst2.py", line 765, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: '77__OqxA_Flq__OqxA__1044.data_out/KPN166ec__KPN166ec.ARGannot.pileup'

The module logging is necessary for slurm_srst2.py

The current version does not import this module, causing an error "global name 'logging' is not defined". A pull request has been submitted to fix this problem.

SRST2v0.1.5 crashes when using gene_db with only gene names in the FastA headers

When using a gene_db with FastA headers in the form ">geneid" or ">geneid additional information", srst2 terminates on a crash (see traceback below).

Attempting to read xxx loci from ST database yyy
Read ST database yyy successfully
Traceback (most recent call last):
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1592, in <module>
    main()
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1548, in main
    db_reports, db_results = run_srst2(args,fileSets,args.gene_db,"genes")
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1102, in run_srst2
    db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1164, in process_fasta_db
    unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 1275, in map_fileSet_to_db
    unique_gene_symbols, unique_allele_symbols, pileup_file)
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 788, in parse_scores
    gene_name = get_allele_name_from_db(allele,unique_allele_symbols,unique_cluster_symbols,run_type,args)[2] # cluster ID
  File "../SRST2/srst2-0.1.5/scripts/srst2.py", line 750, in get_allele_name_from_db
    cluster_id = gene_name = allele_name = seqid = allele_parts[0]
IndexError: list index out of range

As a temporary solution, I modified the code following the patch hereafter.

--- ./srst2orig.py 2015-02-10 14:54:08.000000000 +0100
+++ ./srst2.py  2015-02-10 16:23:16.000000000 +0100
@@ -202,7 +202,8 @@
                                                gene_cluster_symbols[gene_cluster] = cluster_symbol
                                else:
                                        # treat as unclustered database, use whole header
-                                       gene_cluster = cluster_symbol = name
+                                       gene_cluster = cluster_symbol = name.split()[0] #debug: name
+                                       gene_cluster_symbols[gene_cluster] = cluster_symbol #debug
                        else:
                                gene_cluster = name.split(delimiter)[0] # accept gene clusters raw for mlst
                                # check if the delimiter makes sense
@@ -738,7 +739,7 @@
        if run_type != "mlst":
                # header format: >[cluster]___[gene]___[allele]___[uniqueID] [info]
                allele_parts = allele.split()
-               allele_detail = allele_parts.pop(0)
+               allele_detail = allele_parts[0] #debug allele_parts.pop(0)
                allele_info = allele_detail.split("__")

                if len(allele_info)>2:

Please note there is a potential bug in the last line “if len(allele_info)>2:” this should be “if len(allele_info)>3:” instead.

Best Regards,
Christophe

default path to srst2 in the slurm_srst2.py script is 'srst2.py' but should be 'srst2'

This has been fixed in the master branch, but not yet in the release.
It will have to wait until the next release.

Miscalling alleles with only indels differences between variants

This will be a long post so the quick summary is that the new version of SRST2 added a new parameter which causes the miss prediction of allele ldh from 1 to 197. It effect only results where the different their is larger in length between different variant of an allele. Parameter in question is called "--mlst_max_mismatch" with default value of 10.

Possible solutions

Allow users to turn off "mlst_max_mismatch"

Details:

A few new parameters were added to SRST2 between the different version from 0.2.0 to 0.2.1 . One of theses parameters is called "--mlst_max_mismatch". Description from the srst2.py --help is the following :

--mlst_max_mismatch MLST_MAX_MISMATCH

Maximum number of mismatches per read for MLST allele calling (default 10)

The parameter will filter reads AFTER they have been mapped with bowtie2 based on the CIGAR string. If it see that it has more then 10 mismatch/indels, it throws out the read from the alignment completely. So another words, pre-filtering data before it starts determining the alleles.

However, this pre-filtering removes data that helps with making the create call in the case with ldh alelle.

Below is a screenshot of the alignment between ldh 1 and ldh 197. They are 100% identical but ldh_1 has an extra 19 base pairs near the end.

Since both variants of the alleles have a length differences of 19 base pairs, all reads aligning to ldh_197 that have the extra 19 base pairs are thrown out but should be kept! Attached is the pileup view of the region in question against ldh_197. Top windows is the old version of srst 0.20 and bottom is 0.2.1. As you can see, almost all reads are thrown out in that region.

So when it comes time to score each allele variants,both ldh_1 and ldh_197 both report having no indels when it 197 variant should have 19!

Instruction for the srst2_plots.R

Hi! I am trying to use the srst2_plots.R script to create similar plot as shown in Figure 7. When I am generating ST vs gene plot using the geneSTplot function, what should I use as an input file? Similarly for the geneSTbarPlot? Thank you!

Discrepancy between top scoring allele in result file and consensus sequence file

There is a discrepancy between the top scoring allele indicated in the result file versus the consensus sequence file (using --report_all_consensus) when the "mlst_allele/trun" warning is called in the mismatches field.

I believe this is caused by the arguments given to the create_allele_pileup function (specifically top_allele) which does not take into account next_best_allele that is used for reporting when a truncation override is detected.

My apologies if this post is bad form, as I am absolutely brand new to git!

Problem with reading pileup using non-recommended versions of samtools & bowtie

From:
Paola Flórez de Sessions, PhD
Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

Hello Kathryn,

I have been attempting to run srst2 on an mrsa dataset. The output files look really great and I would love to have that information for my dataset. However, I seem to have some trouble. While the program appears to run successfully my *results.txt files is empty while the *.pileup file is populated.

$ bsub -M 1024 -W 560 -n 1 /mnt/software/bin/srst2 --input_pe /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R1.fastq /mnt/pnsg10_projects/desessions/mrsa/WNB012.h5/HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001_R2.fastq --forward _R1 --reverse R2 --output WNB012_31PA --gene_db resistance.fas –log

We had to hack the version of samtools and bowtie as our data czars didn’t want to load older versions. Below is the *.pileup file.

$ [desessions@pegasus pileup]$ head WNB012_31PA__HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001.resistance.pileup
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 1 A 157 ^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^.^,^,^,^,^,^,^,^,^, BIIBIFI<IIIFIFFIFBBBFB<FFBFIFBFFFBIFIFIFIFIIBIIFIFIB7<IIFFFFFIFIBFFIBFFIII<FIIFIFIIIBIIFIIFFIIIIBIFF7IIFF7FIFIFFBIIFFFIIFIIIFIIIBFFFFFIFIIIIIFIIIFI<FBFFIIFFF
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 2 T 158 ...................................................................................................................................................,,,,,,,,,^.^. BIIBIFIIIFFIFBIFBFFIBFIF<FIFF<FFBFIIFIFIFIIBIIFIIIFBFIIFIIFIIFIFBFBFFFIII<FIFFIIIFIBIIFIIFFIIIIBIFFBIIIFFFIFIFFFIIBFFIIFIIIFFIIBFFFIBIIIIIIIIIIIFIFFBBBIIFFB<B
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 3 G 161 ...................................................................................................................................................,,,,,,,,,..^.^,^, FIBFIFFBIFIFIFFIFFFFIFIFFFIIFFBFFFIIIFIIIFFIFIIFIFIF<FIIIIIFI<FIFFFIFFFIIIFIIFIIIIIFIIFIIFFIIIIBIFF<IIFIFBIIIFFFIIFFBIIFIIIFFIIFFFFF<IBIIIIIIIIIFIFFBFFIIFFFBBB<<
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 4 G 159 ...............................................................................................................................................,,,,,,,,,...,,^.^. FIBFBIIBIFIFIF<IFFFFIFIIFFIIFFFIIFBIIFIIIBIF<IIIIFIFFIIIIIFIFBI<FBFFFIIII<FIIIIFIIFIIBIFFFIIIIFFIFIIIIIFIIIIFFIIFFIIFIIIFIIIFFBBIIFIIIIIIIIIFIFFBFIIIFFFBBB<<BB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 5 C 168 ................................................................................................................................................,,,,,,,,,...,,..^.^.^.^.^.^.^.^. FIFFFIIIFIFIFFIFF7FIFIFFFFIFFFFBFBIIFIIIFIIBIIFIFIFFIIIIIIIB<IIF<FFIIII7FIIIIIIIBII7IIIFFIIIFIIF7IIIIIFIIIIFFIIBIFIIFIIIIFIIFFF<IBIFIIIIIIIIIFFIFBFIIIFFBFFBB<BBBBBBBBBB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 6 T 177 ..................................................................................................................................................,,,,,,,,,...,,..........^.^.^.^.^.^.^. FIIFIII<IIIFIIFIFBFFIFIIFFFIFF7FFFBIIFIIIIFIBIIFIFIFFIIIIIIIB<I<IIFFIFIIIIIIBIIIIIFIIFIIIFFIIIFIII<IIFIIFIIIFFFIIFIIIIFIIIIIIIFBFBIBIBIIIIIIIIIFIFFFFIIIFFFFFFBBBBBBBBBBBBBBBBBBB
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 7 A 183 ....................................................................................................................................................,,,,,,,,,...,,.................^.^.^.^, FIIFFIIBIIIFIIBIFFFFFFIIFFFIFFFFIFFIIFIIIFIIFIIIIFIF7FIIIIIIIBFIBFIFFIIIIIBIII<IIIIIFIFFIFIIIIIIFFII<IIFFFFIIIIFFII7IIIIIIIIIIIIFFFFI<IFIIIIIIIIIFIIFBBFIIFFIFFFBBFFBBBBBBBBBBBBBBB????
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 8 A 184 ....................................................................................................................................................,,,,,,,,,...,,....................,^. FIIFIIIFIIIIIIFIFFFFFFIIFFBIIFFIFFFIIFIIIFII7IIIIIIF<FIIIIIFIFFIFFIFFIFIIIBIIIFIIIII<IIFIIIIIIIIFFII<IIIIIFIIIIFFIFFIFIIIIIIIIIIFFIFIFIFIIIIIIIIIFIFFFFFIIFFIFFFF<FFFFFFFFFFBB<BBBBAAAA;
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 9 A 189 ....................................................................................................................................................,,,,,,,,,...,,....................,.^.^.^.^.^, FIIFFIIBIIIIIIIIIFFFFFIFIFFIIFFFFFFIIIIIIIIIBIIIIIIFBIIIIIIIIBFIBBI<BIIIIIBIIIFIIIIIFIIIIIIIIIIIFFIIBIIFIFFIIIFFFIFFIFII<IIIIFIFFBFFIFIBIIIIIFIIIBIFFFBFIIIFFBFFF7FFFBFFFFFFFFFFFFFBBBB;=====
347__aph(3)-III__aph(3)-III__4_aph(3)-III_1_M26832_M26832_aminoglycosides 10 A 189 ..................................................................................................................................................,,,,,,,,,...,....................,.....,^.^.^, FIIFIIIBIFIIII<IIFIFIFIIIFFIIFBBFFFFIIIIFFIIBIIIIIIFBIIIIFIFIBFIBFIBFIIIIIBIIIBIIIIFFIFIIFIIIIIIFFIIBIFIFIIIIFIFIFIFIIBIIIIFIIB>>>>>BBE

Then when I check the results it looks like this:
$ [desessions@apollo1 Dec_1st_results]$ cat WNB012_31PA__genes__resistance__results.txt
Sample
HS001-PE-R00149_AHAJ23ADXX.WNB012_ATCACG_L001

The default minimum coverage remains 90% so in theory I should have results, right? Do you have any idea what could be happening?

Ps I have attached the resistance.fas in case you want to have a look.

Best,

Paola Flórez de Sessions, PhD
Genome Institute of Singapore Efficient Rapid Microbial Sequencing (GERMS) platform leader

Error with bowtie

Hi...

I have problem when try to run the command srst2

Input files DNA, FASTA:
salmonella.fasta
Error: could not open salmonella.fasta
Total time for call to driver() for forward index: 00:00:00
Error: Encountered internal Bowtie 2 exception (#1)
Command: bowtie2-build --wrapper basic-0 salmonella.fasta salmonella.fasta
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.7', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1604, in main
bowtie_index(args.mlst_db) # index the MLST database
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 154, in bowtie_index
run_command([bowtie_build_exec, fasta, fasta])
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 134, in run_command
raise CommandError({"message": message})
srst2.srst2.CommandError: {'message': "Command 'bowtie2-build salmonella.fasta salmonella.fasta' failed with non-zero exit status: 1"}

as I can resolve this error?....

I have a second question, if I have four files form miseq and nexseq ( 2file_R1 and 2file_R2 ) belonging to the same sample as I can gather the reads for the same result?

Rcode

Hello Kat,

I love SRST2, but I would like to plot out the results the readme file says "Added R code for plotting SRST2 output in R (plotSRST2data.R). Instructions will be added to the read me." When!! Is it ready? I was hoping to present it at lab meeting this Thursday. Is there any chance?!?

Please and thanks,
Paola

Allow forward and reverse flags as file name

Hi SRST2 team,

Your tool is amazing, very neat and fast.
I'm currently using SRST2 for my illumina data, paired end data.

I've files named as (example): 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz These were not accepted as
--input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005*.fastq.gz
This was probably, as the naming standardized in srst2.py is different.

I was able to run as:
python srst2.py --input_pe ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz --forward 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001 --reverse 111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001

This (above case) was handy, since I'm playing around with tool, and data and can provide prefixes (manually). I'm afraid, things might go wrong, when working on ~1000 isolates, in providing prefixes as in above fashion.

Can something be done in script to overcome/ignore this? This seems redundant; provide paired end flag, read1 file name, read2 file name and prefixes for forward and reverse reads.

Something Like:
python srst2.py --input_pe --forward ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R1_001.fastq.gz --reverse ../../sample_isolates_test/111_CN_04_B26_M3_C8_P1_Kleb_TATGTGGC_L005_R2_001.fastq.gz

Extract new allele sequence

Thanks a lot for this perfect tool! But I can't figure out is it possible to extract new allele sequence after it's identification by mapping?

create_allele_pileup crashes if --output contained a directory path

I ran this command:

"srst2 --input_pe /data/scratch/Bcereus-15/Bcereus-15_1.fastq.gz /data/scratch/Bcereus-15/Bcereus-15_2.fastq.gz --output /data/output/appresults/14091077/Bcereus-15/Bcereus-15 --save_scores --report_all_consensus --mlst_db /opt/gene_databases/mlst/11302014/Bacillus_cereus/Bacillus_cereus.fasta --mlst_definitions /opt/gene_databases/mlst/11302014/Bacillus_cereus/bcereus.txt --gene_db /opt/srst2/data/ARGannot.fasta /opt/srst2/data/PlasmidFinder.fasta /opt/gene_databases/vfdb/11302014/Bacillus/Bacillus_VF_clustered.fasta"

Note that --output contains a directory in the prefix because I want to write the outputs to that folder instead of the cwd. This apparently causes problems if you turn on --report_all_consensus because create_allele_pileup tries to just concatenate the allele name to the front of the directory name:

def create_allele_pileup(allele_name, all_pileup_file):
outpileup = allele_name + "." + all_pileup_file

which results in this error:

12/05/2014 08:11:09 Processing SAMtools pileup...
12/05/2014 08:11:14 Scoring alleles...
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.4', 'console_scripts', 'srst2')()
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1508, in main
mlst_report, mlst_results = run_srst2(args,fileSets,args.mlst_db,"mlst")
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1074, in run_srst2
db_reports, db_results_list = process_fasta_db(args, fileSets, run_type, db_reports, db_results_list, fasta)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1136, in process_fasta_db
unique_gene_symbols, unique_allele_symbols,run_type,ST_db,results,gene_list,db_report,cluster_symbols,max_mismatch)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 1247, in map_fileSet_to_db
unique_gene_symbols, unique_allele_symbols, pileup_file)
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 831, in parse_scores
allele_pileup_file = create_allele_pileup(top_allele, pileup_file) # XXX Creates a new pileup file for that allele. Not currently cleaned up
File "/usr/local/lib/python2.7/dist-packages/srst2/srst2.py", line 737, in create_allele_pileup
with open(outpileup, 'w') as allele_pileup:
IOError: [Errno 2] No such file or directory: 'pta_102./data/output/appresults/14091077/Bcereus-15/Bcereus-15__Bcereus-15.Bacillus_cereus.pileup'

One possible solution is to use the os.path module to split the all_pileup_file string into the directory and filename, concatenate allele_name to the filename, and the rejoin the directory with the new filename.

VFDB_cdhit_to_csv.py throws KeyError for Mycobacterium

I am following the directions from https://github.com/katholt/srst2#using-the-vfbd-virulence-factor-database-with-srst2 to generate a VFDB gene database for Mycobacterium. I have already run these commands:

python VFDBgenus.py --infile CP_VFs.ffn
cd-hit -i Mycobacterium.fsa -o Mycobacterium_cdhit90 -c 0.9 > Mycobacterium_cdhit90.stdout
(FYI, I am using cd-hit-v4.6.1-2012-08-27)

But when I run this command:
python VFDB_cdhit_to_csv.py --cluster_file Mycobacterium_cdhit90.clstr --infile Mycobacterium.fsa --outfile Mycobacterium_cdhit90.csv

I get this error:
Traceback (most recent call last):
File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 66, in
sys.exit(main())
File "/install/srst2/database_clustering/VFDB_cdhit_to_csv.py", line 59, in main
clusterid = seq2cluster[seqID]
KeyError: 'R027152'

UnboundLocalError for VFDB_cdhit_to_csv.py

Good day,

Upon running the VFDB_cdhit_to_csv.py against my cluster file, the following error ensued:

"Traceback (most recent call last):
File "../database_clustering/VFDB_cdhit_to_csv.py", line 67, in
sys.exit(main())
File "../database_clustering/VFDB_cdhit_to_csv.py", line 61, in main
outstring = ",".join([seqID, clusterid, gene, allele, str(record.seq), re.sub(",","",record.description)]) + "\n"
UnboundLocalError: local variable 'clusterid' referenced before assignment"

What could have caused the error?

Thank you very much,
Sarah

file scores_vs_expected.py seems missing

Hi Kat and others,

I tried to install srst2 today and run into some issues which seems to be related to missing file called scores_vs_expected.py. using the git clone command, followed by pip install.

When I copied that file from srst2-0.1.7.zip version and placed in into my directory I got it installed (i.e. it says it got installed). However, the installation is not very successful as, when asking for the current version, it gives out random things (see below).

Could you please direct me to the right version of that file?
Many thanks in advance,

Pieter Smit

Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.8', 'console_scripts', 'srst2')()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 351, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2363, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2088, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/local/lib/python3.4/dist-packages/srst2/srst2.py", line 306
print "Warning! MLST delimiter is " + delimiter + " but these genes may violate the pattern and cause problems:"
^
SyntaxError: Missing parentheses in call to 'print'
Traceback (most recent call last):
File "/usr/local/bin/srst2", line 9, in
load_entry_point('srst2==0.1.8', 'console_scripts', 'srst2')()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 351, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2363, in load_entry_point
return ep.load()
File "/usr/lib/python3/dist-packages/pkg_resources.py", line 2088, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/local/lib/python3.4/dist-packages/srst2/srst2.py", line 306
print "Warning! MLST delimiter is " + delimiter + " but these genes may violate the pattern and cause problems:"
^
SyntaxError: Missing parentheses in call to 'print'

global name 'seq_id' is not defined error

I get this error when I use SRST2, How can I resolve it .

SRST2 installation problem with multiple Bowtie versions

Hi,

I saw your paper and I'm really interested in what SRST2 can do.
I've just installed srsts2 following the installation instructions on https://github.com/katholt/srst2.

By default, install all programs in /usr/local. I already had samtools v1.2 and bowtie 2.2.5 in this directory before installing srst2.
So, I downloaded and installed bowtie2 2.2.4 and samtools 0.1.18 in the same directory (/usr/local/) before installing srst2.
During srst2 installation i did the test if the programs were installed properly and it was OK.
I established the
export SRST2_SAMTOOLS=/usr/local/samtools-0.1.18/
export SRST2_BOWTIE2=/usr/local/bowti2-2.2.4/bowtie2
export SRST2_BOWTIE2_BUILD=/usr/local/bowti2-2.2.4/bowtie2-build

I then tried to do a test of the tool using the example.txt. When i run the command
sudo /usr/local/bin/srst2 --input_pe ERR024070_1.fastq.gz ERR024070_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta

I only get the shigella1.log and when I look at the log this is what I see ($ gedit shigella1.log):

01/15/2016 12:21:15 program started
01/15/2016 12:21:15 command line: /usr/local/bin/srst2 --input_pe ERR024070_1.fastq.gz ERR024070_2.fastq.gz --output shigella1 --log --save_scores --mlst_db Escherichia_coli#1.fasta --mlst_definitions ecoli.txt --gene_db ARGannot.fasta
01/15/2016 12:21:15 Total paired readsets found:1
01/15/2016 12:21:15 Failed command: bowtie2 --version
01/15/2016 12:21:15 [Errno 2] No such file or directory
01/15/2016 12:21:15 Could not determine the version of bowtie.
01/15/2016 12:21:15 Do you have bowtie installed in your PATH?

I'm a newbie in bioinformatics (using the command line) so the "bug" must be very simple but I don't see how to solve it...

Perhaps this info is also useful : when I installed bowtie2 2.2.5 I also defined it on my PATH : export BT2_HOME=/usr/local/bowtie2-2.2.5/

Could this create a conflict?

Any help will be welcomed !:-)

Thanks in advance for you answer/advices.

Sincerely,
P

add example command and web link to the text printed upon installation

Need a maximum divergence option to simplify gene reporting.

I noticed that we were sometimes picking up quite distant homologs of genes that are in the database, e.g. calls with 250 SNPs against a gene of total length 1000bp. So 25% divergent. Unless you are actually looking for NOVEL genes rather than known ones, we don’t really want this! We JUST want the confident hits to known resistance genes, not hits to distant homologs. Note this might be useful for identifying new virulence genes, or generating hypotheses about unexplained resistance phenotypes, but just adds confusion to resistance gene typing.

Use a specific version of samtools (and others)

Hi @katholt,

I wanted to test your temperature on a proposed contribution before raising a pull request. I hope you don't mind. To be clear, I'm happy to make a pull request, I just wanted to check that you'd be happy with the principle before I do the work.

We've got a few different versions of samtools installed but the default is 0.1.19; I'd like to use version 0.1.18 with srst2*. There are a few ways I could do this, but the neatest would be to set an environment variable (e.g. SRST2_SAMTOOLS) to point to the preferred binary and then to set this in .bashrc or .bash_profile. What do you think?

I'm happy to make the change myself, if it's something you'd be interested in merging. I think it could possibly be useful to other people as well.

I'd suggest that I make changes to [scripts/srs2.py] and [scripts/slurm_srst2.py] to check if the environment variable is set otherwise default to vanilla 'samtools'. For consistency, I could do something similar for bowtie2-build if you like.

Let me know if this is a contribution you'd be interested in otherwise I'll have a think about something a bit hackier that I could do my end.

Many thanks,

Ben

OK so actually my situation is a bit more complicated in that I'm trying to help my colleagues use this software on a central machine and make it so that for this piece of software they use samtools-0.1.18 but for other bits of software they use a different version, preferably without them needing to notice that they're doing so.