mrolm / instrain Goto Github PK
View Code? Open in Web Editor NEWBioinformatics program inStrain
License: MIT License
Bioinformatics program inStrain
License: MIT License
I'm using inStrain version 1.3.11 and profiling individual genomes after mapping competitively to the entire assembled set. I first used the profile workflow basically as follows: inStrain profile POS-2015-07-16-mapping.sorted.bam POS2015-08-06-bin.38.fa -o results/CAPIIB.IS -p 8 -g all-POS-genomes-genes.fna -s POS-genomes.stb
where I prepared the genes and STB files as described in the docs. For this genome, there is ~20X coverage in this sample. From what I can tell pretty much everything works, except I get empty plots in the Gene Histogram output file.
I initially thought perhaps the profile_genes step in the larger profile workflow didn't work, so I ran profile_genes
on the IS directory as well. It looks like it just remade some of the raw files for gene level profiling, but none of the figures changed. I've attached the log for the profile and profile_genes runs for this IS directory. I think it's a small thing since from what I can tell everything is working and all the raw data is being produced/calculated correctly, but wanted to check.
log.log
Hello,
I am getting the following errors when running instrain plot.
For plotting plot 6:
/home/shengwei/anaconda3/envs/InStrain/lib/python3.8/site-packages/inStrain/plottingUtilities.py:743: RuntimeWarning: invalid value encountered in double_scalars '{:1.0f}%'.format((width/total)*100),
And for plotting plot 7:
Failed to make plot #7: 'midpoint'
Do you know what the problem could be? Thank you!
Hi there!
I've been trying to run a simple profile and keep getting a key error after it finishes profiling the genes. I've run instrain successfully in the past with no problem, so I'm not sure what went wrong/if I did anything wrong. I followed the documentation to generate all my input files/install instrain originally.
This is the command I used: inStrain profile SRR706782395+_mapped_to_GBSF.sort.bam GBSF.AOA2.fasta -g GBSF_prodigal_orfs.fna -o GBSF_SRR7067823_instrain
Here's a picture of the error output:
Any help would be much appreciated!
-Rebecca
Hello,
I have run inStrain compare
and I have got this error in these steps:
..:: inStrain compare Step 3. Auxiliary processing ::..
Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_utils.py", line 212, in cluster_genome_strains
arr = scipy.spatial.distance.squareform(arr, checks=True)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2259, in is_valid_dm
raise ValueError(('Distance matrix '%s' must be '
ValueError: Distance matrix 'X' must be symmetric.
..:: inStrain compare Step 4. Store results ::..
making plots 10
Plotting plot 10
Failed to make plot #10: Distance matrix 'X' must be symmetric.
Any ideas of why this is happening?
Thank you,
Sergio
Hi Dr. Olm,
coverm........................ ! NOT WORKING ! (version=na) (location = None)
Thanks,
Wang
Hello,
I'm running inStrain profile specifying the gene file (-g option) and I'm getting this error. Any ideas? I run it previously without the gene file and I didn't get any error. Gene file extension is .fna
Thank you
.:: inStrain profile Step 2. Profile scaffolds ::..
Traceback (most recent call last):
File "/home/srodriguez/.local/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 50, in main
self.profile_operation(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 76, in profile_operation
ProfileController(args).main()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 136, in main
self.run_profile()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 302, in run_profile
self.ISP = inStrain.profile.profile_bam(self.bam, self.fasta_db,
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/init.py", line 17, in profile_bam
return inStrain.profile.profile_controller.BamProfileController(
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 48, in main
self.gen_prof_args()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 78, in gen_prof_args
self.gen_genes_args()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 89, in gen_genes_args
scaff2geneinfo, scaff2gene2sequence = inStrain.GeneProfile.parse_genes(gene_file, **self.kwargs)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 754, in parse_genes
return parse_prodigal_genes(gene_file_loc)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 781, in parse_prodigal_genes
table['direction'].append(record.description.split("#")[3].strip())
IndexError: list index out of range
Thanks so much for this excellent software!
When I read the Glossary & FAQ https://instrain.readthedocs.io/en/latest/overview.html#term-nucleotide-diversity, I was confused by the formula of nucleotide diversity.
After I checked the source code, I found the right formula is 1 - [(frequency of A)^2 + (frequency of C)^2 + (frequency of G)^2 + (frequency of T)^2 ]
Hi, I'm checking the SNV output and it seems the refFreq is calculated based on the conBase instead of the refBase.
Here the output:
A | C | G | T | conBase | position | refBase | scaffold | varBase | cryptic | baseCoverage | varFreq | refFreq
4 | 0 | 2 | 0 | A | 675 | A | 10 | G | FALSE | 6 | 0.33 | 0.67
3 | 0 | 4 | 0 | G | 702 | G | 10 | A | FALSE | 7 | 0.43 | 0.57
0 | 2 | 0 | 3 | T | 717 | C | 10 | C | FALSE | 5 | 0.40 | 0.60
0 | 1 | 4 | 0 | G | 1741 | C | 10 | C | FALSE | 5 | 0.20 | 0.80
8 | 0 | 0 | 4 | A | 1769 | T | 10 | T | FALSE | 12 | 0.33 | 0.67
As you can see in the 3rd position, the refBase is C, therefore the refFreq should be 0.4, right? But instead, it says 0.6 (which is the conFreq).
I think it would be useful to get the three values, varFreq conFreq (the ones that are given now) and also the refFreq (so then it can be used to compare this value across different samples which share the position)
Hello,
I am trying to get the newest version of inStrain using conda install but I am getting the following error:
UnsatisfiableError: The following specifications were found to be incompatible with each error:
Output in format: Requested package -> Available versions
I also tried to use pip to install it but I got an older version (v1.2.14)
Is there another way to install the newest version?
Thank you,
Irmarie
Hi there,
I try to install inStrain using conda,
got this error by calling inStrain -h
File "/home/dell/.conda/envs/instrain/bin/inStrain", line 19, in <module>
import inStrain.controller
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 20, in <module>
import inStrain.profile
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/__init__.py", line 5, in <module>
import inStrain.profile.profile_controller
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 14, in <module>
import inStrain.SNVprofile
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 21, in <module>
import inStrain.profile.profile_utilities
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/profile_utilities.py", line 27, in <module>
import inStrain.GeneProfile
File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 26, in <module>
from Bio.codonalign.codonalphabet import default_codon_table
ModuleNotFoundError: No module named 'Bio.codonalign.codonalphabet'
I have tryied reinstall biopython==1.73, 1.74 or 1.78, all gave same error,
any suggestion on this,
thanks.
Liu
Hello,
I was wondering how I could get relative abundance of strains from the genome_wide output. Would that be from the breadth? Also, instead of using dRep, would it be possible to use the pangenome reference that is outputted by tools such as panaroo or roary as the reference database? Also my last question is that sometimes iRep is blank, do you have any guidance on why that is?
Best wishes,
Hi,
I would like to use inStrain profile to call variants from a wastewater sample with SARS-CoV2 virus amplified using the ARTIC pipeline. I would like to use as input the sorted BAM file that is output from the ARTIC fieldBioinformatics profile. However, when I tried this I get the error
"It seems like the .bam file could not be indexed!Make sure you have samtools version 1.6 or greater."
I am using samtools v1.12, so this should not be the issue.
I am using the sorted BAM file that was output from mapping reads to the reference fasta.
Input was as follow: inStrain profile ./primer-schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ./artic_output/samplename.sorted.bam
Your thoughts on this would be appreciated.
Thanks,
Candice
Running instrain with four genomes in the reference file fdaa...fasta to which I've mapped some Illumina reads which are a meta-genome of these. I get some sensible looking output in the output folder but get the error below at Step 4.. Any ideas?
You gave me a sam- I'm going to make it a .bam now
Converting ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sam to ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
samtools view -S -@ 8 -b ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sam > ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
sorting ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
samtools sort ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam -o ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam -@ 8
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Indexing ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam
samtools index ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam.bai -@ 8
***************************************************
..:: inStrain profile Step 1. Filter reads ::..
***************************************************
Filtering reads: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.79s/it]
1.5% of reads were removed during filtering
1,512,766 read pairs remain (0.4567 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************
Profiling splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 127/127 [13:49<00:00, 6.53s/it]
Merging splits and profiling genes: 50%|██████████████████████████████████████████████████████████████████▌ | 1/2 [00:31<00:31, 31.41s/it]
Merging splits and profiling genes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████Merging splits and profiling genes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [16:22<00:00, 491.07s/it]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
Scaffold to bin was made using .stb file
Error- no genomes detected. Example: stb has scaffold 1, database has scaffold 2
Traceback (most recent call last):
File "/well/bag/users/lipworth/miniconda3/envs/instrain/bin/inStrain", line 31, in <module>
inStrain.controller.Controller().main(args)
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
self.profile_operation(args)
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 82, in profile_operation
ProfileController(args).main()
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 153, in main
self.profile_genome_wide()
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 367, in profile_genome_wide
Controller().genome_wide_operation(copy.deepcopy(args))
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 94, in genome_wide_operation
inStrain.genomeUtilities.Controller().main(args)
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 55, in main
GIdb = genomeLevel_from_IS(IS, **vargs)
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 181, in genomeLevel_from_IS
GSI_db = _genomeLevel_scaffold_info_v3(gdb, stb, b2l, **kwargs)
File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 538, in _genomeLevel_scaffold_info_v3
for mm in sorted(list(gdb['mm'].unique())):
TypeError: 'NoneType' object is not subscriptable
My Conda env:
`# packages in environment at /well/bag/users/lipworth/miniconda3/envs/instrain:
#
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
asteval 0.9.23 pyhd8ed1ab_0 conda-forge
attrs 20.3.0 pyhd3deb0d_0 conda-forge
biopython 1.74 py38h516909a_0 conda-forge
bowtie2 2.4.2 py38hc2f83ea_2 bioconda
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.17.1 h7f98852_1 conda-forge
ca-certificates 2020.12.5 ha878542_0 conda-forge
cached-property 1.5.2 hd8ed1ab_1 conda-forge
cached_property 1.5.2 pyha770c72_1 conda-forge
capnproto 0.6.1 hfc679d8_1 conda-forge
certifi 2020.12.5 py38h578d9bd_1 conda-forge
checkm-genome 1.1.3 py_1 bioconda
cycler 0.10.0 py_2 conda-forge
decorator 4.4.2 py_0 conda-forge
dendropy 4.5.2 pyh3252c3a_0 bioconda
drep 3.0.0 py_1 bioconda
freetype 2.10.4 h0708190_1 conda-forge
future 0.18.2 py38h578d9bd_3 conda-forge
gettext 0.19.8.1 h0b5b191_1005 conda-forge
gsl 2.6 he838d99_2 conda-forge
h5py 3.1.0 nompi_py38hafa665b_100 conda-forge
hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge
hmmer 3.3.2 h1b792b2_1 bioconda
htslib 1.12 h9093b5e_1 bioconda
iniconfig 1.1.1 pyh9f0ad1d_0 conda-forge
instrain 1.5.3 py_0 bioconda
joblib 1.0.1 pyhd8ed1ab_0 conda-forge
jpeg 9d h36c2ea0_0 conda-forge
kiwisolver 1.3.1 py38h1fd1430_1 conda-forge
krb5 1.17.2 h926e7f8_0 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.35.1 hea4e1c9_2 conda-forge
libblas 3.9.0 8_openblas conda-forge
libcblas 3.9.0 8_openblas conda-forge
libcurl 7.76.0 hc4aaa36_0 conda-forge
libdeflate 1.7 h7f98852_5 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 9.3.0 h2828fa1_18 conda-forge
libgfortran-ng 9.3.0 hff62375_18 conda-forge
libgfortran5 9.3.0 hff62375_18 conda-forge
libgomp 9.3.0 h2828fa1_18 conda-forge
libidn2 2.3.0 h516909a_0 conda-forge
liblapack 3.9.0 8_openblas conda-forge
libllvm10 10.0.1 he513fc3_3 conda-forge
libnghttp2 1.43.0 h812cca2_0 conda-forge
libopenblas 0.3.12 pthreads_h4812303_1 conda-forge
libpng 1.6.37 h21135ba_2 conda-forge
libssh2 1.9.0 ha56f1ee_6 conda-forge
libstdcxx-ng 9.3.0 h6de172a_18 conda-forge
libtiff 4.2.0 hdc55705_0 conda-forge
libunistring 0.9.10 h14c3975_0 conda-forge
libwebp-base 1.2.0 h7f98852_2 conda-forge
llvmlite 0.36.0 py38h4630a5e_0 conda-forge
lmfit 1.0.2 pyhd8ed1ab_0 conda-forge
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
mash 2.3 he348c14_1 bioconda
matplotlib-base 3.4.1 py38hcc49a3a_0 conda-forge
more-itertools 8.7.0 pyhd8ed1ab_0 conda-forge
mummer 3.23 4 bioconda
ncurses 6.2 h58526e2_4 conda-forge
networkx 2.5 py_0 conda-forge
numba 0.53.1 py38h0e12cce_0 conda-forge
numpy 1.20.2 py38h9894fe3_0 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openssl 1.1.1k h7f98852_0 conda-forge
packaging 20.9 pyh44b312d_0 conda-forge
pandas 1.2.3 py38h51da96c_0 conda-forge
patsy 0.5.1 py_0 conda-forge
perl 5.32.0 h36c2ea0_0 conda-forge
perl-threaded 5.26.0 0 bioconda
pillow 8.1.2 py38ha0e1e83_0 conda-forge
pip 21.0.1 pyhd8ed1ab_0 conda-forge
pluggy 0.13.1 py38h578d9bd_4 conda-forge
pplacer 1.1.alpha19 h9ee0642_2 bioconda
prodigal 2.6.3 h779adbc_3 bioconda
psutil 5.8.0 py38h497a2fe_1 conda-forge
py 1.10.0 pyhd3deb0d_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pysam 0.16.0.1 py38hf7546f9_3 bioconda
pytest 6.2.2 py38h578d9bd_0 conda-forge
python 3.8.8 hffdb5ce_0_cpython conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2021.1 pyhd8ed1ab_0 conda-forge
readline 8.0 he28a2e2_2 conda-forge
samtools 1.12 h9aed4be_1 bioconda
scikit-learn 0.24.1 py38h658cfdd_0 conda-forge
scipy 1.6.2 py38h7b17777_0 conda-forge
seaborn 0.11.1 hd8ed1ab_1 conda-forge
seaborn-base 0.11.1 pyhd8ed1ab_1 conda-forge
setuptools 49.6.0 py38h578d9bd_3 conda-forge
six 1.15.0 pyh9f0ad1d_0 conda-forge
sqlite 3.35.3 h74cdb3f_0 conda-forge
statsmodels 0.12.2 py38h5c078b8_0 conda-forge
tbb 2020.2 h4bd325d_4 conda-forge
threadpoolctl 2.1.0 pyh5ca1d4c_0 conda-forge
tk 8.6.10 h21135ba_1 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
tornado 6.1 py38h497a2fe_1 conda-forge
tqdm 4.59.0 pyhd8ed1ab_0 conda-forge
uncertainties 3.1.5 pyhd8ed1ab_0 conda-forge
wget 1.20.1 h22169c7_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.4.9 ha95c52a_0 conda-forge
`
Hello Matt,
I am writing to ask whether the coverage breadth option is the CoverM genome mean option (total position mapped by reads for at least one time divided by length minus 2*75 bp for each contig in the genome)
Thank you,
Jianshu
I ran the profile workflow with inStrain profile 2013-05-13-EBPR-mapping.sam ../bins/R1R2-EBPR-combined-bins.fasta -o 2013-05-13-profiling
using version 1.2.7 of inStrain, and installed with pip. I got past profiling the reads/scaffolds, but got this error at the GenomeWide comparison step:
03-25 15:29 DEBUG Done processing scaffolds- making SNV profile
03-25 16:01 INFO Storing output
03-25 16:01 DEBUG Writing output files now
03-25 16:02 INFO ***************************************************
.:: inStrain profile Step 3. Profile genes ::..
***************************************************
03-25 16:02 INFO Nevermind! You didnt include a genes file
03-25 16:02 INFO ***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
03-25 16:02 DEBUG Loading scaffold to bin
03-25 16:02 INFO Scaffold to bin will consider all scaffolds the same genome
03-25 16:02 DEBUG GenomeWide compare failed
Traceback (most recent call last):
File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 508, in main
gdb = genomeWideFromIS(IS, 'read_compaer', mm_level=mm_level)
File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 92, in genomeWideFromIS
gdb = _add_stb(db, stb)
File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 130, in _add_stb
gdb = db.copy()
AttributeError: 'NoneType' object has no attribute 'copy'
There are no output or figures files, as this step failed.
Hey!
I'm getting the following error in all my inStrain runs:
making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
Plotting plot 1
Plotting plot 2
Plotting plot 3
Plotting plot 4
Plotting plot 5
Plotting plot 6
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: divide by zero encountered in double_scalars
'{:1.0f}%'.format((width/total)*100),
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: invalid value encountered in double_scalars
'{:1.0f}%'.format((width/total)*100),
Plotting plot 7
Skipping plot 8 - you don't have all required information. You need to run inStrain profile_genes first
Skipping plot 9 - you don't have all required information. You need to run inStrain profile_genes first
The line that is causing the issue is the following:
plt.text(offset + p.get_width(), p.get_y()+0.55*p.get_height(),
'{:1.0f}%'.format((width/total)*100),
ha='center', va='center')
The plots are coming out like this:
Hello. I am able to run the profile command when I don't also use a genes file, but this is the error I get when using a .gbk file. I also get different errors when using the .fna file.
(my-instrain) [cmh574@wind /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater ]$ /home/cmh574/.conda/envs/my-instrain/bin/inStrain profile /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/SARSCT-WastewaterxxxxSWIFTxxNAUxx02222021-xx-xx-xxx-xxxx-169-CH_S4.sorted.bam /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/NC_045512.fasta -g /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/NC_045512_genes.gbk -o WastewaterxxxxNAUxx02222021_Instrain
..:: inStrain profile Step 1. Filter reads ::..
Filtering reads: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.80s/it]
0.5% of reads were removed during filtering
160,839 read pairs remain (0.06530 Gbp)
.:: inStrain profile Step 2. Profile scaffolds ::..
Profiling splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:25<00:00, 68.37s/it]
Merging splits and profiling genes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.00s/it]
Traceback (most recent call last):
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'coverage'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/cmh574/.conda/envs/my-instrain/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
self.profile_operation(args)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 82, in profile_operation
ProfileController(args).main()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 147, in main
self.run_profile()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 319, in run_profile
self.write_output()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 331, in write_output
self.ISP.generate(t)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 251, in generate
db = db[db['coverage'] > 0]
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'coverage'
Hey!
I've noticed that inStrain is sorting my input BAM files even though they're already sorted, adding unnecessary computing time to the execution. Maybe it would be a good idea if inStrain check whether the BAM is sorted before calling samtools.
Here's a function I wrote for that:
from Bio import bgzf
def is_sorted_bam(filepath):
"""
Checks if a BAM file is sorted by coordinate.
Parameters
----------
filepath : Path
Path object pointing to BAM file.
Returns
-------
bool
Returns `True` if the BAM file is sorted by coordinate and `False`
otherwise.
"""
with bgzf.BgzfReader(filepath, "rb") as fin:
bam_header = fin.readline().strip()
return b"SO:coordinate" in bam_header
Hi,
After calling inStrain profile (version 1.4.0), I have this issue which causes some SNVs that are intergenic to have no mutation type annotated as "I". For example, the following output:
scaffold position position_coverage allele_count ref_base con_base var_base ref_freq con_freq var_freq A C T G gene mutation mutation_typecryptic class
NODE_999_length_47735_cov_45.183958 41629 107 2 C C T 0.9439252336448598 0.9439252336448598 0.05607476635514018 0 101 6 0 True SNV
NODE_999_length_47735_cov_45.183958 45312 37 2 C C T 0.945945945945946 0.945945945945946 0.054054054054054064 0 35 2 0 True SNV
NODE_999_length_47735_cov_45.183958 47635 80 2 T T G 0.95 0.95 0.05 0 0 76 4 True SNV
It is something that can be fixed manually during data analysis, but I was wondering if this also would need fixing in inStrain, or whether something else could be going on, for example with the data?
This is not entirely consistent over 2 similar datasets, and the mutation_type columns are not always identical concerning "I" or empty annotations.
Perhaps another but comparable issue is that in one dataset an annotation from Prodigal is incorporated under the "gene" column, but not in the other dataset.
Best wishes,
Elmar
Hi,
I was running instrain profile (v1.3.9) with 20 samples. Ten of them ran perfectly, however the other 10 got the following error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The input are sam files.
Do you know why some ran well and some didn't?
I've noticed that inStrain calls samtools sort
with a single thread even if -p
is larger than 1.
Is this intentional? If not, setting samtools sort
to use -p
threads can reduce the execution time.
% inStrain --version
inStrain 1.0.1
to stdout and return 0
Running instrain on 272 metagenomes:
Scaffold to bin was made using .stb file
***************************************************
..:: inStrain compare Step 1. Load data ::..
***************************************************
Loading Profiles into RAM: 100%|██████████| 272/272 [05:25<00:00, 1.20s/it]
36 of 36 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************
Running group 1 of 1
Comparing scaffolds: 100%|██████████| 36/36 [94:23:25<00:00, 9439.05s/it]
***************************************************
..:: inStrain compare Step 3. Auxiliary processing ::..
***************************************************
Cannot cluster genome AZ482__metabat2__Low_004.fna; 102 of 35511 comaprisons involve no genomic overlap at all: see log for more
Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/inStrain/compare_utils.py", line 234, in cluster_genome_strains
return pd.concat(cdbs).reset_index(drop=True)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 285, in concat
op = _Concatenator(
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
***************************************************
..:: inStrain compare Step 4. Store results ::..
***************************************************
My conda env:
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 1_gnu conda-forge
asteval 0.9.22 pyhd8ed1ab_0 conda-forge
attrs 20.3.0 pyhd3deb0d_0 conda-forge
biopython 1.74 py38h516909a_0 conda-forge
boost 1.70.0 py38h9de70de_1 conda-forge
boost-cpp 1.70.0 h7b93d67_3 conda-forge
bzip2 1.0.8 h7f98852_4 conda-forge
c-ares 1.17.1 h36c2ea0_0 conda-forge
ca-certificates 2021.1.19 h06a4308_0
cached-property 1.5.2 py_0
capnproto 0.6.1 hfc679d8_1 conda-forge
certifi 2020.12.5 py38h578d9bd_1 conda-forge
cycler 0.10.0 py_2 conda-forge
decorator 4.4.2 py_0 conda-forge
drep 3.0.1 py_0 bioconda
fastani 1.32 he1c1bb9_0 bioconda
freetype 2.10.4 h0708190_1 conda-forge
future 0.18.2 py38h578d9bd_3 conda-forge
gsl 2.6 he838d99_2 conda-forge
h5py 3.1.0 nompi_py38hafa665b_100 conda-forge
hdf5 1.10.6 nompi_h6a2412b_1114 conda-forge
htslib 1.11 hd3b49d5_1 bioconda
icu 67.1 he1b5a44_0 conda-forge
iniconfig 1.1.1 pyh9f0ad1d_0 conda-forge
instrain 1.5.1 py_0 bioconda
joblib 1.0.1 pyhd8ed1ab_0 conda-forge
jpeg 9d h516909a_0 conda-forge
kiwisolver 1.3.1 py38h1fd1430_1 conda-forge
krb5 1.17.2 h926e7f8_0 conda-forge
lcms2 2.12 hddcbb42_0 conda-forge
ld_impl_linux-64 2.35.1 hea4e1c9_2 conda-forge
libblas 3.9.0 8_openblas conda-forge
libcblas 3.9.0 8_openblas conda-forge
libcurl 7.71.1 hcdd3856_8 conda-forge
libdeflate 1.6 h516909a_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 h516909a_1 conda-forge
libffi 3.3 h58526e2_2 conda-forge
libgcc-ng 9.3.0 h2828fa1_18 conda-forge
libgfortran-ng 9.3.0 hff62375_18 conda-forge
libgfortran5 9.3.0 hff62375_18 conda-forge
libgomp 9.3.0 h2828fa1_18 conda-forge
liblapack 3.9.0 8_openblas conda-forge
libllvm10 10.0.1 he513fc3_3 conda-forge
libnghttp2 1.43.0 h812cca2_0 conda-forge
libopenblas 0.3.12 pthreads_h4812303_1 conda-forge
libpng 1.6.37 hed695b0_2 conda-forge
libssh2 1.9.0 hab1572f_5 conda-forge
libstdcxx-ng 9.3.0 h6de172a_18 conda-forge
libtiff 4.2.0 hdc55705_0 conda-forge
libwebp-base 1.2.0 h7f98852_0 conda-forge
llvmlite 0.35.0 py38h4630a5e_1 conda-forge
lmfit 1.0.2 pyhd8ed1ab_0 conda-forge
lz4-c 1.9.3 h9c3ff4c_0 conda-forge
mash 2.2.2 ha61e061_2 bioconda
matplotlib-base 3.3.4 py38h0efea84_0 conda-forge
more-itertools 8.7.0 pyhd8ed1ab_0 conda-forge
mummer4 4.0.0rc1 pl526he1b5a44_0 bioconda
ncurses 6.2 h58526e2_4 conda-forge
networkx 2.5 py_0 conda-forge
numba 0.52.0 py38h51da96c_0 conda-forge
numpy 1.20.1 py38h18fd61f_0 conda-forge
olefile 0.46 pyh9f0ad1d_1 conda-forge
openssl 1.1.1j h7f98852_0 conda-forge
packaging 20.9 pyh44b312d_0 conda-forge
pandas 1.2.2 py38h51da96c_0 conda-forge
patsy 0.5.1 py_0 conda-forge
perl 5.26.2 h36c2ea0_1008 conda-forge
pillow 8.1.0 py38ha0e1e83_2 conda-forge
pip 21.0.1 pyhd8ed1ab_0 conda-forge
pluggy 0.13.1 py38h578d9bd_4 conda-forge
prodigal 2.6.3 h516909a_2 bioconda
psutil 5.8.0 py38h497a2fe_1 conda-forge
py 1.10.0 pyhd3deb0d_0 conda-forge
pyparsing 2.4.7 pyh9f0ad1d_0 conda-forge
pysam 0.16.0.1 py38hbdc2ae9_1 bioconda
pytest 6.2.2 py38h578d9bd_0 conda-forge
python 3.8.6 hffdb5ce_5_cpython conda-forge
python-dateutil 2.8.1 py_0 conda-forge
python_abi 3.8 1_cp38 conda-forge
pytz 2021.1 pyhd8ed1ab_0 conda-forge
readline 8.1 h27cfd23_0
samtools 1.11 h6270b1f_0 bioconda
scikit-learn 0.24.1 py38h658cfdd_0 conda-forge
scipy 1.6.0 py38hb2138dd_0 conda-forge
seaborn 0.11.1 ha770c72_0 conda-forge
seaborn-base 0.11.1 pyhd8ed1ab_1 conda-forge
setuptools 52.0.0 py38h06a4308_0
six 1.15.0 pyh9f0ad1d_0 conda-forge
sqlite 3.34.0 h74cdb3f_0 conda-forge
statsmodels 0.12.2 py38h5c078b8_0 conda-forge
threadpoolctl 2.1.0 pyh5ca1d4c_0 conda-forge
tk 8.6.10 hed695b0_1 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
tornado 6.1 py38h497a2fe_1 conda-forge
tqdm 4.56.2 pyhd8ed1ab_0 conda-forge
uncertainties 3.1.5 pyhd8ed1ab_0 conda-forge
wheel 0.36.2 pyhd3deb0d_0 conda-forge
xz 5.2.5 h516909a_1 conda-forge
zlib 1.2.11 h516909a_1010 conda-forge
zstd 1.4.8 ha95c52a_1 conda-forge
Hello,
I was following your tutorial for MAGs and wanted to use MAGs to assess similarity of two metagenomes. I ran dRep compare and have subsetted the comparisons to those with popANI greater than 99.99% and percent bases compared of at least 0.5. Would I compare the absolute number of "same strains" to determine which metagenomes are similar or calculate the proportion of "same strains" over all strain level comparisons between the two samples. I am not sure if we can get a reliable denominator. Any other guidance you would have on showing the similarity between metagenomes using output from inStrain on a MAG catalogue such as that generated by Almeida et al would be appreciated.
Best wishes,
When trying to run the .get_raw_coverage_table
attribute from the documentation, I get the following error:
Type "help", "copyright", "credits" or "license" for more information.
>>> import inStrain
>>> import inStrain.SNVprofile
>>> IS=inStrain.SNVprofile.SNVprofile("results/CAPIIB.IS/")
>>> coverage_table = IS.get_raw_coverage_table()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SNVprofile' object has no attribute 'get_raw_coverage_table'
I looked in the SNVprofile.py script and couldn't find that attribute defined. The .get_nonredundant_scaffold_table
method worked when I tried that.
I'm getting an KeyError: 'mm'
when inStrain is writing the output plots. I suspect that this may be caused by this genome having a low read coverage in this specific sample, as I didn't get this error with other genomes.
making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
Plotting plot 1
Plotting plot 2
Failed to make plot #2: 'scaffold'
Plotting plot 3
Skipping plot 4 - you don't have all required information. You need to run inStrain genome_wide first
Traceback (most recent call last):
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py", line 1422, in allele_freq_plot_from_IS
db = db.sort_values('mm').drop_duplicates(subset=['scaffold', 'position'], keep='last')\
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/frame.py", line 4927, in sort_values
k = self._get_label_or_level_values(by, axis=axis)
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'mm'
Skipping plot 5 - you don't have all required information. You need to run inStrain genome_wide first
Traceback (most recent call last):
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py", line 1463, in linkage_decay_from_IS
db = db.sort_values('mm').drop_duplicates(subset=['scaffold', 'position_A', 'position_B'], keep='last')\
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/frame.py", line 4927, in sort_values
k = self._get_label_or_level_values(by, axis=axis)
File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
raise KeyError(key)
KeyError: 'mm'
Plotting plot 6
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: divide by zero encountered in double_scalars
'{:1.0f}%'.format((width/total)*100),
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: invalid value encountered in double_scalars
'{:1.0f}%'.format((width/total)*100),
Plotting plot 7
Failed to make plot #7: 'scaffold'
Skipping plot 8 - you don't have all required information. You need to run inStrain profile_genes first
Skipping plot 9 - you don't have all required information. You need to run inStrain profile_genes first
Hi, thanks so much for your effort in developing and maintaining inStrain, it's incredibly useful.
I'm running inStrain v 1.4.0 installed with conda.
I am using two bacterial genomes as references and running profile across multiple metagenomes mapped to these references with bowtie2. I'm supplying a sorted bam, the prodigal nucleotide file, (concatenated) reference genomes, and and reference scaffolds but getting this AssertionError with some (but not all) of inStrain profile analyses.
Is there an obvious reason why neither -s argument method would work for a subset of metagenome analyses? I recognize this is somewhat similar to #15 but not exactly.
thanks!
Mike
## RUN submitting space-separated scaffold list
inStrain profile ${bam_dir}/${ID}.sorted.bam ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fasta -o ${outdir}/${ID}_PM04-derep-isolate-genomes.IS -p 20 -g ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fna -s ${f_dir}/p20105-s030_scaffolds.fasta ${f_dir}/p20105-s034_scaffolds.fasta
## RUN submitting scaffold2bin.json file (pointing to successful profile run) instead of space-separated scaffold list
inStrain profile ${bam_dir}/${ID}.sorted.bam ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fasta -o ${outdir}/${ID}_PM04-derep-isolate-genomes.IS -p 20 -g ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fna -s ${stb_dir}/scaffold2bin.json
############# Identical Assertion errors below
***************************************************
..:: inStrain profile Step 1. Filter reads ::..
***************************************************
Filtering reads: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 377/377 [00:04<00:00, 77.89it/s]
15.7% of reads were removed during filtering
68,100 read pairs remain (0.02590 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************
Profiling splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:25<00:00, 1.20s/it]
Merging splits and profiling genes: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [01:57<00:00, 5.86s/it]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
Could not load the scaffold to bin file!
Traceback (most recent call last):
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/bin/inStrain", line 31, in <module>
inStrain.controller.Controller().main(args)
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 50, in main
self.profile_operation(args)
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 76, in profile_operation
ProfileController(args).main()
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 142, in main
self.profile_genome_wide()
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 356, in profile_genome_wide
Controller().genome_wide_operation(copy.deepcopy(args))
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 88, in genome_wide_operation
inStrain.genomeUtilities.Controller().main(args)
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 40, in main
self.prepare_genome_wide(IS, vargs)
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 127, in prepare_genome_wide
stb = load_scaff2bin(vargs.get('stb'), IS)
File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 875, in load_scaff2bin
assert False
AssertionError
Hi there,
I just packaged inStrain for GNU Guix and noticed that the LICENSE file in the repository is for GPL version 3 or later while the setup.py claims that the license is MIT (Expat). Could you please clarify which license actually applies?
Thank you!
Hi,
I just realised than in some cases the mean clonality value is missing in genomeWide_scaffold_info.tsv. This thing seems to happen in populations with one or more scaffolds having 0 conANI, 0 consensus_SNP, and 0 unmaskedBreadth in the scaffold_info.tsv file. In those scaffolds the mean_clonality | median_clonality | mean_microdiversity | median_microdiversity columns are empty in the scaffold_info.tsv file.
Maybe one solution is to take only into account those scaffolds with coverage > 5, or unmaskedBreadth > 0.5 for the mean clonality calculation?
I have a genome from IMG that has 'N' characters at some positions. The error when trying to profile scaffolds is:
ValueError: 'N' is not in list
whole scaffold exception- Ga0131788_11
We had a failure profiling a scaffold! Not sure which!
Should I remove the N's from the reference genome, or replace with some other character?
I've noticed that tqdm isn't really accurate for the group-level processing. It usually quickly reaches N-1/N
and then "stalls" for quite a while before finishing and moving onto the next group. Maybe it's taking a long time to close the processes:
# Close multi-processing
for proc in processes:
proc.terminate()
As an example of non-accurate tqdm time reporting, this took ~5 hours:
Running group 27 of 148
Comparing scaffolds: 100%|██████████| 687/687 [15:16<00:00, 1.33s/it]
Running group 28 of 148
Comparing scaffolds: 85%|████████▌ | 1371/1608 [1:04:07<14:14, 3.60s/it]
Maybe putting it all in tqdm.process_map()
will help?
Hi, this is a fantastic tool! I have a question about the scaffold to bin file.
In the tutorial it says 'One is to give inStrain a list of the .fasta files that went into making the concatenated .fasta file. The other is to provide inStrain with a “scaffold to bin” file, which lists the genome assignment of each scaffold in a tab-delimited file. In this case we’re going to use the scaffold to bin file provided by inStrain.."
I tried to supply a .txt file that has the universal location of the .fasta files present in the concatenated fasta as per your tutorial, space separated, but I keep getting the following error:
Could not load the scaffold to bin file!
Traceback (most recent call last):
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 35, in main
self.profile_operation(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 61, in profile_operation
ProfileController().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 224, in main
Controller().genome_wide_operation(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 73, in genome_wide_operation
inStrain.genomeUtilities.Controller().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 468, in main
stb = load_scaff2bin(args.stb, IS)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 568, in load_scaff2bin
assert False
AssertionError
My command:
inStrain profile bt2_peru_mapped/peru_drepbins_SRR2938072.sam peru_drepbins.fasta -o peru_drepbins_SRR2938072.IS -p 6 -s perubins_space.txt
I have a set of ~270 dereplicated MAGs from about 100 samples, and another set of ~270 MAGs from about 180 samples, so I'd prefer not to go through and grep the scaffolds from every MAG then match them to MAG ID -- I am sure there's an easier way to pull scaffolds from fasta and associate them with the bin ID w/ perl or something but I'm not super saavy in that arena. Am I just formatting my .txt file incorrectly or is there something I am missing that can solve this issue? Since the tutorial says inStrain provides the scaffold to bin file I was wondering if there is a way to generate it? I've included one of my fasta files that goes into the concat fasta file (I figured the concat file is too big to upload here) and then the list of fasta files in the concat fasta file.
debug.zip
Thanks!
Came across this when doing a fresh install of inStrain via pip. The Bio.codonalign.codonalphabet class is not present in biopython version 1.78 from what I can tell... no mention of it in the docs for that version.
I was able to install 1.74 via conda, then inStrain functions normally.
Hello, thank you for creating a great tool! I get this error when running inStrain compare with three input objects. Could you help me figure out what the issue is?
..:: inStrain compare Step 3. Auxiliary processing ::..
Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/compare_utils.py", line 212, in cluster_genome_strains
arr = scipy.spatial.distance.squareform(arr, checks=True)
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2215, in squareform
is_valid_dm(X, throw=True, name='X')
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2290, in is_valid_dm
raise ValueError(('Distance matrix '%s' must be '
ValueError: Distance matrix 'X' must be symmetric.
..:: inStrain profile Step 1. Filter reads ::..
Traceback (most recent call last):
File "/var/spool/slurm/d/job2216591/slurm_script", line 4, in
import('pkg_resources').run_script('inStrain==1.5.3', 'inStrain')
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/pkg_resources/init.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/pkg_resources/init.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/EGG-INFO/scripts/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 53, in main
self.profile_operation(args)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 82, in profile_operation
ProfileController(args).main()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 144, in main
self.profile_filter_reads()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 228, in profile_filter_reads
inStrain.filter_reads.Controller().main_from_profile(self.ISP,
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 64, in main_from_profile
Rdic, RR = load_paired_reads(bam, scaffolds, **kwargs)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 164, in load_paired_reads
scaff2pair2info = get_paired_reads_multi(bam, scaffolds, **kwargs)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 764, in get_paired_reads_multi
proc.start()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/spawn.py", line 183, in get_preparation_data
main_mod_name = getattr(main_module.spec, "name", None)
AttributeError: module 'main' has no attribute 'spec'
slurmstepd: error: *** JOB 2216591 ON r1513 CANCELLED AT 2021-04-12T10:53:58 ***
I'm using instrain 1.2.8 (pypi)
and getting the following error when running inStrain profile
:
Traceback (most recent call last):
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/bin/inStrain", line 31, in <module>
inStrain.controller.Controller().main(args)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 32, in main
self.profile_operation(args)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 56, in profile_operation
ProfileController().main(args)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 181, in main
Controller().profile_genes_operation(args)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 65, in profile_genes_operation
inStrain.GeneProfile.Controller().main(args)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 541, in main
Gdb = get_gene_info(IS)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 479, in get_gene_info
db = db[db['mm'] <= mm].sort_values('mm').drop_duplicates(subset=['gene'], keep='last')
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/pandas/core/frame.py", line 2800, in __getitem__
indexer = self.columns.get_loc(key)
File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'mm'
I'm guessing that it's due to the fact that the input *.bam file has almost nothing in it, but then it would be good if instrain profile
exited gracefully, especially since I'm running it in parallel on ~300 samples; some of which complete, and some are dying with this KeyError.
Hi there, I might have another issue. For the SNV output table, I also tried to manually calculate the position of the SNP in the gene it is located in, and I compared it to the position provided by the "mutation" row, and I noticed something.
For inStrain, the calculations are done as followed I think? Please correct me if I'm wrong, I'd very much like to learn:
So, for example, a SNV at position 8067 is at position 0 in the gene if the 1-based Prodigal gene starts at position 8068 (in the Prodigal file)
A SNV at position 9067 is at position 1000 in the gene if the Prodigal gene starts at position 8068 and goes until 9068.
A SNV at position 9068 will not be in the gene.
And now for my question:
A gene can be on the (+)-strand or (-)-strand (1 or -1 in the Prodigal genes.fna file).
Currently, the SNV position in the gene is calculated in the same way for both, namely as described above, and counting from the 0-based start position.
However, should the SNV position on the (-)-strand not be calculated with respect to the end position?
For example:
A SNV at position 234 will be at position 111 if the Prodigal gene end positions (which is really the start position of the gene) is at 345.
I am also wondering whether this has implications for the determination of the N or S annotations for "mutation" and "mutation_type"?
I hope you could take a look at this issue.
Best wishes,
Elmar
Hello,
I am trying to run the inStrain plot to generate the dendrogram but I am getting the following error:
making plots 10
/shared/centos7/anaconda3/3.7/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Plotting plot 10
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
/home/icotto25/.local/lib/python3.7/site-packages/inStrain/plottingUtilities.py:1065: RuntimeWarning: Mean of empty slice
combo2value["{0}-vs-{1}".format(samp2,samp1)]]))
/home/icotto25/.local/lib/python3.7/site-packages/inStrain/plottingUtilities.py:1067: RuntimeWarning: Mean of empty slice
combo2value2["{0}-vs-{1}".format(samp2,samp1)]]))
Failed to make plot #10: Distance matrix 'X' must be symmetric.
I am not sure how to fix it. Attached is the inStrain genome_wide output.
Thank you,
NEU_nitrifiers_bins.fa.IS.COMPARE_genomeWide_compare.xlsx
I ran the inStrain profile command resulting in this error (screenshot), which is not reported in the log file and the output is as expected.
The exact command I used:
inStrain profile MoCom-vs-gtdb.sam gtdb.fasta -o MoCom -p 10 -g gtdb_genes/gtdb.gene.fna -s scaf2bin.tsv --database_mode
What could cause this error?
Can I still use the output?
Hi @MrOlm ,
I have started to use inStrain and thanks for developing such a great tool.
I have mapped reads with bowtie to a subset of the UHGG database, as described in the tutorial, but then I am only interested in one species and am using inStrain with only one reference genome.
After using inStrain profile
, I am getting multiple warnings such as this:
FAILURE StbError GUT_GENOME284629_65 GUT_GENOME284629.fna no_length will not be considered as part of the genome
Is this simply because my sam/bam files have been mapped against the UHGG, but then I have a single reference genome when using inStrain?
The code I used was:
bowtie2 -p 10 -X 2000 --no-mixed --very-sensitive \
-x UHGG_reps_selected.fasta.bt2 \
-1 cat_trimmed_1.fastq.gz \
-2 cat_trimmed_2.fastq.gz > test.sam
inStrain profile test.sam reference.fna \
-o test.IS -p 10 -g reference.genes.fna \
-s reference.stb
the reference.stb
only has the contigs for my genome of interest reference.fna
.
Thanks for any help,
Ramiro
This is a question versus an issue: do you think that minimap2 could be used instead of bowtie2? minimap2 is 2-3X faster than bowtie2, so it would be nice to use it, if possible. You have done any testing comparing the mappers to see if it affects the inStrain results?
I'm running instrain compare
(v1.2.8) with the following params: inStrain compare --min_cov 5 --min_freq 0.05 --fdr 1e-06 -p 8 -o $OUTDIR -i $INDIRS
, with ~40 genome references and ~280 samples (~1 mil reads per sample). According to the progress report for the command, each of the 8736 iterations is taking ~5.5 hours. More threads will help a bit, but I'm guessing that there still will be a huge time requirement. Is there a good way to substantially increase the speed, or does instrain compare
just not scale well to many samples?
Also, requiring one to list all paths to instrain objects as the input can lead to total command length limitations, at least on some operating systems. I would be great to be able to provide a file that lists all objects instead of actually listing them all in the command.
Running instrain compare
in a snakemake kept failing for certain jobs. I realized that instrain compare
will compress the output table if it is of a certain size:
if store:
out_base = self.get_output_base()
ft = '.tsv'
# If this file is going to be huge, compress it
if len(db) > 1e6:
ft = '.tsv.gz'
Can you please remove this dynamic output (ie., always *.tsv
or *.tsv.gz
)? It makes integrating instrain
into a snakemake pipeline much more complicated.
Hi Matt,
I realized that when converting sam
to bam
, inStrain is using samtools view
with a single thread. Is there a specific reason for this? just wondering if it could speed this step a little by adding more threads.
Best,
Ramiro
Hi Matthew
Thanks for the brianliant tool for the de-novo based population analysis.
I followed the tutorial and had a bechmark of several mags under version 1.2.4. It worked fine.
But when I applied to a bigger set of dereplicated mags, it got stuck at the read-filtering step, running for several hours.
Here is the std and log.log file
std.txt
log.log
From the log.log, it didn't move for several hours since below.
03-03 08:42 INFO Making read report
03-03 08:43 DEBUG running on all reads
03-03 08:54 DEBUG running on individual scaffolds
I am not sure whether it's still running but just slowly. Is there any ways to solve this issue or just speed up? From the top command, I can see instrain is still running but without utilize those defined threads at this step.
Best
Yan
Dear authors,
Many thanks for creating inStrain, and the very thorough documentation that goes with it. I finished the tutorial, and now want to use it on some of our own samples.
In my case, I have metagenomic samples from very different environments (think from complex soil, to more simple dairy samples). So I expect to find many, many different microbes. I also have a set of target microbial genomes (includes multiple strains for some species) that I am interested in and trying to find in all those environments. I can think of the following ways to use inStrain:
Would you be able to give me guidance on if these options make sense, and if you have an idea on which one would be the best approach?
Kind regards
Hello,
I ran inStrain compare using: inStrain compare -i XXXX -o YYYY
and got the following error in the first on one of the genomes:
Loading Profiles into RAM: 0%| | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'covT'The above exception was the direct cause of the following exception:Traceback (most recent call last):
File "/home/srodriguez/.local/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
self.compare_operation(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 79, in compare_operation
inStrain.compare_controller.CompareController(args).main()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 53, in main
self.create_scaffoldcomparison_objects()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 162, in create_scaffoldcomparison_objects
valid_SCs, scaffold2length = make_scaffoldcomparison_objects(self.inputs, scaffolds_to_compare)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 513, in make_scaffoldcomparison_objects
scaffolds = list(ISP._get_covt_keys())
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 533, in _get_covt_keys
filename = self.get_location('raw_data') + '/' + os.path.basename(Adb.loc['covT', 'value'])
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 873, in getitem
return self._getitem_tuple(key)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 786, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1110, in _getitem_axis
return self._get_label(key, axis=axis)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1059, in _get_label
return self.obj.xs(label, axis=axis)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/generic.py", line 3488, in xs
loc = self.index.get_loc(key)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
raise KeyError(key) from err
KeyError: 'covT'
However, the rest of the steps ran without errors but only one output was produced (*comparisonsTable.tsv). Is some extra argument or parameter needed?
Thank you,
Sergio
I mostly use BBMap for most of my mapping needs. When I passed the BAM/SAM files to InStrain, I got errors about scaffolds couldn't be found and there weren't any reads in said scaffolds. This was resolved when I followed the tutorial and used Bowtie2 instead. I believe the error comes from the scaffold ID not being in the same column when BBMap outputs the BAM/SAM files, as I've had similar problems with how BBMAp outputs things and then using for other downstream programs. This is no longer an issue for me (since I switched to Bowtie2) - but I would maybe add a disclaimer to use a certain mapping software or to reformat the BAM/SAM files in a certain order than InStrain can take in, so future users know of this.
Hello,
I had a question about filtering the genome wide compare tsv file. I am comparing highly similar metagenomes and I am getting very low percent_compared (<0.005). I was hoping to filter based on a percent_compared of 0.5 and ANI of 99.99%. I am worried that nothing will be left after I filter. Do you know how to make sense of this low percent_compared given that these samples are expected to contain similar strains? Could it be an issue with my dereplicated isolates? I dereplicated a set of isolates at 97% ANI to get this dereplicated set that was inputted to instrain.
Thank you as always,
Just as a side question, do you ever filter based on coverage_overlap?
Thanks for the very nice tool! In the output, there is no parameter for allele frequency. Is there any way to calculate this based on the output data? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.