mrolm / instrain Goto Github PK

View Code? Open in Web Editor NEW

135.0 7.0 33.0 267.33 MB

Bioinformatics program inStrain

License: MIT License

Python 99.51% Dockerfile 0.15% Makefile 0.16% Shell 0.18%

instrain's People

Contributors

Stargazers

Watchers

instrain's Issues

Gene Histogram Plots Emtpy

I'm using inStrain version 1.3.11 and profiling individual genomes after mapping competitively to the entire assembled set. I first used the profile workflow basically as follows: inStrain profile POS-2015-07-16-mapping.sorted.bam POS2015-08-06-bin.38.fa -o results/CAPIIB.IS -p 8 -g all-POS-genomes-genes.fna -s POS-genomes.stb where I prepared the genes and STB files as described in the docs. For this genome, there is ~20X coverage in this sample. From what I can tell pretty much everything works, except I get empty plots in the Gene Histogram output file.
I initially thought perhaps the profile_genes step in the larger profile workflow didn't work, so I ran profile_genes on the IS directory as well. It looks like it just remade some of the raw files for gene level profiling, but none of the figures changed. I've attached the log for the profile and profile_genes runs for this IS directory. I think it's a small thing since from what I can tell everything is working and all the raw data is being produced/calculated correctly, but wanted to check.
log.log

Error plotting plotting

Hello,

I am getting the following errors when running instrain plot.

For plotting plot 6:
/home/shengwei/anaconda3/envs/InStrain/lib/python3.8/site-packages/inStrain/plottingUtilities.py:743: RuntimeWarning: invalid value encountered in double_scalars '{:1.0f}%'.format((width/total)*100),

And for plotting plot 7:
Failed to make plot #7: 'midpoint'

Do you know what the problem could be? Thank you!

KeyError running profile function?

Hi there!

I've been trying to run a simple profile and keep getting a key error after it finishes profiling the genes. I've run instrain successfully in the past with no problem, so I'm not sure what went wrong/if I did anything wrong. I followed the documentation to generate all my input files/install instrain originally.

This is the command I used: inStrain profile SRR706782395+_mapped_to_GBSF.sort.bam GBSF.AOA2.fasta -g GBSF_prodigal_orfs.fna -o GBSF_SRR7067823_instrain

Here's a picture of the error output:

Any help would be much appreciated!

-Rebecca

Distance matrix 'X' must be symmetric

Hello,

I have run inStrain compare and I have got this error in these steps:

..:: inStrain compare Step 3. Auxiliary processing ::..

Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_utils.py", line 212, in cluster_genome_strains
arr = scipy.spatial.distance.squareform(arr, checks=True)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2259, in is_valid_dm
raise ValueError(('Distance matrix '%s' must be '
ValueError: Distance matrix 'X' must be symmetric.

..:: inStrain compare Step 4. Store results ::..

making plots 10
Plotting plot 10
Failed to make plot #10: Distance matrix 'X' must be symmetric.

Any ideas of why this is happening?
Thank you,

Sergio

Is it possible to know the species/strain name

Hi Dr. Olm,

It is very nice to read and understand you newest tool inStrain. I have run it on some of my samples, and one question in my mind right now is whether it is possible to know the strain names that are shared across the samples found by inStrain?
If not running quick_profile, it does not matter whether CoverM is installed or not, right? When I run inStrain profile, I found the CoverM is not passed.
coverm........................ ! NOT WORKING ! (version=na) (location = None)

Thanks,
Wang

Step 2 IndexError: list index out of range

Hello,
I'm running inStrain profile specifying the gene file (-g option) and I'm getting this error. Any ideas? I run it previously without the gene file and I didn't get any error. Gene file extension is .fna
Thank you

.:: inStrain profile Step 2. Profile scaffolds ::..

Traceback (most recent call last):
File "/home/srodriguez/.local/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 50, in main
self.profile_operation(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 76, in profile_operation
ProfileController(args).main()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 136, in main
self.run_profile()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 302, in run_profile
self.ISP = inStrain.profile.profile_bam(self.bam, self.fasta_db,
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/init.py", line 17, in profile_bam
return inStrain.profile.profile_controller.BamProfileController(
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 48, in main
self.gen_prof_args()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 78, in gen_prof_args
self.gen_genes_args()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 89, in gen_genes_args
scaff2geneinfo, scaff2gene2sequence = inStrain.GeneProfile.parse_genes(gene_file, **self.kwargs)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 754, in parse_genes
return parse_prodigal_genes(gene_file_loc)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 781, in parse_prodigal_genes
table['direction'].append(record.description.split("#")[3].strip())
IndexError: list index out of range

A small error in InStrain Docs

Thanks so much for this excellent software!

When I read the Glossary & FAQ https://instrain.readthedocs.io/en/latest/overview.html#term-nucleotide-diversity, I was confused by the formula of nucleotide diversity.

After I checked the source code, I found the right formula is 1 - [(frequency of A)^2 + (frequency of C)^2 + (frequency of G)^2 + (frequency of T)^2 ]

confusion conBase, refBase, and varBase

Hi, I'm checking the SNV output and it seems the refFreq is calculated based on the conBase instead of the refBase.

Here the output:

A | C | G | T | conBase | position | refBase | scaffold | varBase | cryptic | baseCoverage | varFreq | refFreq
4 | 0 | 2 | 0 | A | 675 | A | 10 | G | FALSE | 6 | 0.33 | 0.67
3 | 0 | 4 | 0 | G | 702 | G | 10 | A | FALSE | 7 | 0.43 | 0.57
0 | 2 | 0 | 3 | T | 717 | C | 10 | C | FALSE | 5 | 0.40 | 0.60
0 | 1 | 4 | 0 | G | 1741 | C | 10 | C | FALSE | 5 | 0.20 | 0.80
8 | 0 | 0 | 4 | A | 1769 | T | 10 | T | FALSE | 12 | 0.33 | 0.67

As you can see in the 3rd position, the refBase is C, therefore the refFreq should be 0.4, right? But instead, it says 0.6 (which is the conFreq).
I think it would be useful to get the three values, varFreq conFreq (the ones that are given now) and also the refFreq (so then it can be used to compare this value across different samples which share the position)

Error installing new version

Hello,

I am trying to get the newest version of inStrain using conda install but I am getting the following error:

UnsatisfiableError: The following specifications were found to be incompatible with each error:
Output in format: Requested package -> Available versions

I also tried to use pip to install it but I got an older version (v1.2.14)

Is there another way to install the newest version?

Thank you,
Irmarie

No module named 'Bio.codonalign.codonalphabet'

Hi there,
I try to install inStrain using conda,
got this error by calling inStrain -h

File "/home/dell/.conda/envs/instrain/bin/inStrain", line 19, in <module>
    import inStrain.controller
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 20, in <module>
    import inStrain.profile
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/__init__.py", line 5, in <module>
    import inStrain.profile.profile_controller
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/profile_controller.py", line 14, in <module>
    import inStrain.SNVprofile
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 21, in <module>
    import inStrain.profile.profile_utilities
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/profile/profile_utilities.py", line 27, in <module>
    import inStrain.GeneProfile
  File "/home/dell/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 26, in <module>
    from Bio.codonalign.codonalphabet import default_codon_table
ModuleNotFoundError: No module named 'Bio.codonalign.codonalphabet'

I have tryied reinstall biopython==1.73, 1.74 or 1.78, all gave same error,

any suggestion on this,

thanks.

Liu

Question: estimating relative abundace

Hello,

I was wondering how I could get relative abundance of strains from the genome_wide output. Would that be from the breadth? Also, instead of using dRep, would it be possible to use the pangenome reference that is outputted by tools such as panaroo or roary as the reference database? Also my last question is that sometimes iRep is blank, do you have any guidance on why that is?

Best wishes,

inStrain profile issue with BAM format

Hi,
I would like to use inStrain profile to call variants from a wastewater sample with SARS-CoV2 virus amplified using the ARTIC pipeline. I would like to use as input the sorted BAM file that is output from the ARTIC fieldBioinformatics profile. However, when I tried this I get the error
"It seems like the .bam file could not be indexed!Make sure you have samtools version 1.6 or greater."
I am using samtools v1.12, so this should not be the issue.

I am using the sorted BAM file that was output from mapping reads to the reference fasta.

Input was as follow: inStrain profile ./primer-schemes/nCoV-2019/V3/nCoV-2019.reference.fasta ./artic_output/samplename.sorted.bam

Your thoughts on this would be appreciated.

Thanks,
Candice

No genomes detected

Running instrain with four genomes in the reference file fdaa...fasta to which I've mapped some Illumina reads which are a meta-genome of these. I get some sensible looking output in the output folder but get the error below at Step 4.. Any ideas?

You gave me a sam- I'm going to make it a .bam now
Converting ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sam to ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
samtools view -S -@ 8 -b ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sam > ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
sorting ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam
samtools sort ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.bam -o ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam -@ 8
[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Indexing ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam
samtools index ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam ./fdaaaaf6-4681-4377-b550-fe7b162149f6/all.sorted.bam.bai -@ 8
***************************************************
    ..:: inStrain profile Step 1. Filter reads ::..
***************************************************
        
Filtering reads: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:07<00:00, 16.79s/it]
1.5% of reads were removed during filtering
1,512,766 read pairs remain (0.4567 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************
        
Profiling splits: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 127/127 [13:49<00:00,  6.53s/it]
Merging splits and profiling genes:  50%|██████████████████████████████████████████████████████████████████▌                                                                  | 1/2 [00:31<00:31, 31.41s/it]
Merging splits and profiling genes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████Merging splits and profiling genes: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [16:22<00:00, 491.07s/it]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
        
Scaffold to bin was made using .stb file
Error- no genomes detected. Example: stb has scaffold 1, database has scaffold 2
Traceback (most recent call last):
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
    self.profile_operation(args)
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 82, in profile_operation
    ProfileController(args).main()
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 153, in main
    self.profile_genome_wide()
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 367, in profile_genome_wide
    Controller().genome_wide_operation(copy.deepcopy(args))
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/controller.py", line 94, in genome_wide_operation
    inStrain.genomeUtilities.Controller().main(args)
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 55, in main
    GIdb = genomeLevel_from_IS(IS, **vargs)
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 181, in genomeLevel_from_IS
    GSI_db = _genomeLevel_scaffold_info_v3(gdb, stb, b2l, **kwargs)
  File "/well/bag/users/lipworth/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 538, in _genomeLevel_scaffold_info_v3
    for mm in sorted(list(gdb['mm'].unique())):
TypeError: 'NoneType' object is not subscriptable

My Conda env:

`# packages in environment at /well/bag/users/lipworth/miniconda3/envs/instrain:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
asteval                   0.9.23             pyhd8ed1ab_0    conda-forge
attrs                     20.3.0             pyhd3deb0d_0    conda-forge
biopython                 1.74             py38h516909a_0    conda-forge
bowtie2                   2.4.2            py38hc2f83ea_2    bioconda
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h7f98852_1    conda-forge
ca-certificates           2020.12.5            ha878542_0    conda-forge
cached-property           1.5.2                hd8ed1ab_1    conda-forge
cached_property           1.5.2              pyha770c72_1    conda-forge
capnproto                 0.6.1                hfc679d8_1    conda-forge
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
checkm-genome             1.1.3                      py_1    bioconda
cycler                    0.10.0                     py_2    conda-forge
decorator                 4.4.2                      py_0    conda-forge
dendropy                  4.5.2              pyh3252c3a_0    bioconda
drep                      3.0.0                      py_1    bioconda
freetype                  2.10.4               h0708190_1    conda-forge
future                    0.18.2           py38h578d9bd_3    conda-forge
gettext                   0.19.8.1          h0b5b191_1005    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
h5py                      3.1.0           nompi_py38hafa665b_100    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
hmmer                     3.3.2                h1b792b2_1    bioconda
htslib                    1.12                 h9093b5e_1    bioconda
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
instrain                  1.5.3                      py_0    bioconda
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h36c2ea0_0    conda-forge
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libcurl                   7.76.0               hc4aaa36_0    conda-forge
libdeflate                1.7                  h7f98852_5    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
libidn2                   2.3.0                h516909a_0    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libpng                    1.6.37               h21135ba_2    conda-forge
libssh2                   1.9.0                ha56f1ee_6    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libtiff                   4.2.0                hdc55705_0    conda-forge
libunistring              0.9.10               h14c3975_0    conda-forge
libwebp-base              1.2.0                h7f98852_2    conda-forge
llvmlite                  0.36.0           py38h4630a5e_0    conda-forge
lmfit                     1.0.2              pyhd8ed1ab_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
mash                      2.3                  he348c14_1    bioconda
matplotlib-base           3.4.1            py38hcc49a3a_0    conda-forge
more-itertools            8.7.0              pyhd8ed1ab_0    conda-forge
mummer                    3.23                          4    bioconda
ncurses                   6.2                  h58526e2_4    conda-forge
networkx                  2.5                        py_0    conda-forge
numba                     0.53.1           py38h0e12cce_0    conda-forge
numpy                     1.20.2           py38h9894fe3_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1k               h7f98852_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandas                    1.2.3            py38h51da96c_0    conda-forge
patsy                     0.5.1                      py_0    conda-forge
perl                      5.32.0               h36c2ea0_0    conda-forge
perl-threaded             5.26.0                        0    bioconda
pillow                    8.1.2            py38ha0e1e83_0    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pluggy                    0.13.1           py38h578d9bd_4    conda-forge
pplacer                   1.1.alpha19          h9ee0642_2    bioconda
prodigal                  2.6.3                h779adbc_3    bioconda
psutil                    5.8.0            py38h497a2fe_1    conda-forge
py                        1.10.0             pyhd3deb0d_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py38hf7546f9_3    bioconda
pytest                    6.2.2            py38h578d9bd_0    conda-forge
python                    3.8.8           hffdb5ce_0_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
readline                  8.0                  he28a2e2_2    conda-forge
samtools                  1.12                 h9aed4be_1    bioconda
scikit-learn              0.24.1           py38h658cfdd_0    conda-forge
scipy                     1.6.2            py38h7b17777_0    conda-forge
seaborn                   0.11.1               hd8ed1ab_1    conda-forge
seaborn-base              0.11.1             pyhd8ed1ab_1    conda-forge
setuptools                49.6.0           py38h578d9bd_3    conda-forge
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.35.3               h74cdb3f_0    conda-forge
statsmodels               0.12.2           py38h5c078b8_0    conda-forge
tbb                       2020.2               h4bd325d_4    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               h21135ba_1    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.1              py38h497a2fe_1    conda-forge
tqdm                      4.59.0             pyhd8ed1ab_0    conda-forge
uncertainties             3.1.5              pyhd8ed1ab_0    conda-forge
wget                      1.20.1               h22169c7_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.9                ha95c52a_0    conda-forge
`

Is coverage breadth for MAG the mean method in CoverM

Hello Matt,

I am writing to ask whether the coverage breadth option is the CoverM genome mean option (total position mapped by reads for at least one time divided by length minus 2*75 bp for each contig in the genome)

Thank you,

Jianshu

GenomeWide compare failed

I ran the profile workflow with inStrain profile 2013-05-13-EBPR-mapping.sam ../bins/R1R2-EBPR-combined-bins.fasta -o 2013-05-13-profiling using version 1.2.7 of inStrain, and installed with pip. I got past profiling the reads/scaffolds, but got this error at the GenomeWide comparison step:

03-25 15:29 DEBUG    Done processing scaffolds- making SNV profile
03-25 16:01 INFO     Storing output
03-25 16:01 DEBUG    Writing output files now
03-25 16:02 INFO     ***************************************************
  .:: inStrain profile Step 3. Profile genes ::..
***************************************************

03-25 16:02 INFO     Nevermind! You didnt include a genes file
03-25 16:02 INFO     ***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************

03-25 16:02 DEBUG    Loading scaffold to bin
03-25 16:02 INFO     Scaffold to bin will consider all scaffolds the same genome
03-25 16:02 DEBUG    GenomeWide compare failed
Traceback (most recent call last):
  File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 508, in main
    gdb = genomeWideFromIS(IS, 'read_compaer', mm_level=mm_level)
  File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 92, in genomeWideFromIS
    gdb = _add_stb(db, stb)
  File "/home/GLBRCORG/emcdaniel/.conda/envs/pipenv/lib/python3.8/site-packages/inStrain/genomeUtilities.py", line 130, in _add_stb
    gdb = db.copy()
AttributeError: 'NoneType' object has no attribute 'copy'

There are no output or figures files, as this step failed.

Division by zero in read_filtering_plot

Hey!

I'm getting the following error in all my inStrain runs:

making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
Plotting plot 1
Plotting plot 2
Plotting plot 3
Plotting plot 4
Plotting plot 5
Plotting plot 6
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: divide by zero encountered in double_scalars
  '{:1.0f}%'.format((width/total)*100),
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: invalid value encountered in double_scalars
  '{:1.0f}%'.format((width/total)*100),
Plotting plot 7
Skipping plot 8 - you don't have all required information. You need to run inStrain profile_genes first
Skipping plot 9 - you don't have all required information. You need to run inStrain profile_genes first

The line that is causing the issue is the following:

        plt.text(offset + p.get_width(), p.get_y()+0.55*p.get_height(),
                 '{:1.0f}%'.format((width/total)*100),
                 ha='center', va='center')

The plots are coming out like this:

Error running inStrain when including a genes file

Hello. I am able to run the profile command when I don't also use a genes file, but this is the error I get when using a .gbk file. I also get different errors when using the .fna file.

(my-instrain) [cmh574@wind /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater ]$ /home/cmh574/.conda/envs/my-instrain/bin/inStrain profile /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/SARSCT-WastewaterxxxxSWIFTxxNAUxx02222021-xx-xx-xxx-xxxx-169-CH_S4.sorted.bam /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/NC_045512.fasta -g /scratch/cmh574/COVID19/Wastewater/TGN-MiSeq1061/PMI-NAU_Hepp_Hepp-Wastewater/NC_045512_genes.gbk -o WastewaterxxxxNAUxx02222021_Instrain

..:: inStrain profile Step 1. Filter reads ::..

Filtering reads: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:14<00:00, 14.80s/it]
0.5% of reads were removed during filtering
160,839 read pairs remain (0.06530 Gbp)

.:: inStrain profile Step 2. Profile scaffolds ::..

Profiling splits: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [03:25<00:00, 68.37s/it]
Merging splits and profiling genes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.00s/it]
Traceback (most recent call last):
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'coverage'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/cmh574/.conda/envs/my-instrain/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
self.profile_operation(args)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 82, in profile_operation
ProfileController(args).main()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 147, in main
self.run_profile()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 319, in run_profile
self.write_output()
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/controller.py", line 331, in write_output
self.ISP.generate(t)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 251, in generate
db = db[db['coverage'] > 0]
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in getitem
indexer = self.columns.get_loc(key)
File "/home/cmh574/.conda/envs/my-instrain/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'coverage'

inStrain is sorting an already sorted BAM

Hey!

I've noticed that inStrain is sorting my input BAM files even though they're already sorted, adding unnecessary computing time to the execution. Maybe it would be a good idea if inStrain check whether the BAM is sorted before calling samtools.

Here's a function I wrote for that:

from Bio import bgzf

def is_sorted_bam(filepath):
    """
    Checks if a BAM file is sorted by coordinate.

    Parameters
    ----------
    filepath : Path
        Path object pointing to BAM file.

    Returns
    -------
    bool
        Returns `True` if the BAM file is sorted by coordinate and `False`
        otherwise.
    """
    with bgzf.BgzfReader(filepath, "rb") as fin:
        bam_header = fin.readline().strip()
        return b"SO:coordinate" in bam_header

mutation type missing for intergenic regions

Hi,
After calling inStrain profile (version 1.4.0), I have this issue which causes some SNVs that are intergenic to have no mutation type annotated as "I". For example, the following output:

scaffold position position_coverage allele_count ref_base con_base var_base ref_freq con_freq var_freq A C T G gene mutation mutation_typecryptic class
NODE_999_length_47735_cov_45.183958 41629 107 2 C C T 0.9439252336448598 0.9439252336448598 0.05607476635514018 0 101 6 0 True SNV
NODE_999_length_47735_cov_45.183958 45312 37 2 C C T 0.945945945945946 0.945945945945946 0.054054054054054064 0 35 2 0 True SNV
NODE_999_length_47735_cov_45.183958 47635 80 2 T T G 0.95 0.95 0.05 0 0 76 4 True SNV

It is something that can be fixed manually during data analysis, but I was wondering if this also would need fixing in inStrain, or whether something else could be going on, for example with the data?
This is not entirely consistent over 2 similar datasets, and the mutation_type columns are not always identical concerning "I" or empty annotations.
Perhaps another but comparable issue is that in one dataset an annotation from Prodigal is incorporated under the "gene" column, but not in the other dataset.

Best wishes,
Elmar

TypeError: ufunc 'isnan'

Hi,

I was running instrain profile (v1.3.9) with 20 samples. Ten of them ran perfectly, however the other 10 got the following error:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

The input are sam files.

Do you know why some ran well and some didn't?

inStrain always uses a single thread to sort BAMs

I've noticed that inStrain calls samtools sort with a single thread even if -p is larger than 1.

Is this intentional? If not, setting samtools sort to use -p threads can reduce the execution time.

Add --version

% inStrain --version
inStrain 1.0.1

to stdout and return 0

ValueError: No objects to concatenate

Running instrain on 272 metagenomes:

Scaffold to bin was made using .stb file
***************************************************
    ..:: inStrain compare Step 1. Load data ::..
***************************************************

Loading Profiles into RAM: 100%|██████████| 272/272 [05:25<00:00,  1.20s/it]
36 of 36 scaffolds are in at least 2 samples
***************************************************
..:: inStrain compare Step 2. Run comparisons ::..
***************************************************

Running group 1 of 1
Comparing scaffolds: 100%|██████████| 36/36 [94:23:25<00:00, 9439.05s/it]
***************************************************
..:: inStrain compare Step 3. Auxiliary processing ::..
***************************************************

Cannot cluster genome AZ482__metabat2__Low_004.fna; 102 of 35511 comaprisons involve no genomic overlap at all: see log for more
Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
    Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/inStrain/compare_utils.py", line 234, in cluster_genome_strains
    return pd.concat(cdbs).reset_index(drop=True)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 285, in concat
    op = _Concatenator(
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/d20af029/lib/python3.8/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
***************************************************
..:: inStrain compare Step 4. Store results ::..
***************************************************

My conda env:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       1_gnu    conda-forge
asteval                   0.9.22             pyhd8ed1ab_0    conda-forge
attrs                     20.3.0             pyhd3deb0d_0    conda-forge
biopython                 1.74             py38h516909a_0    conda-forge
boost                     1.70.0           py38h9de70de_1    conda-forge
boost-cpp                 1.70.0               h7b93d67_3    conda-forge
bzip2                     1.0.8                h7f98852_4    conda-forge
c-ares                    1.17.1               h36c2ea0_0    conda-forge
ca-certificates           2021.1.19            h06a4308_0
cached-property           1.5.2                      py_0
capnproto                 0.6.1                hfc679d8_1    conda-forge
certifi                   2020.12.5        py38h578d9bd_1    conda-forge
cycler                    0.10.0                     py_2    conda-forge
decorator                 4.4.2                      py_0    conda-forge
drep                      3.0.1                      py_0    bioconda
fastani                   1.32                 he1c1bb9_0    bioconda
freetype                  2.10.4               h0708190_1    conda-forge
future                    0.18.2           py38h578d9bd_3    conda-forge
gsl                       2.6                  he838d99_2    conda-forge
h5py                      3.1.0           nompi_py38hafa665b_100    conda-forge
hdf5                      1.10.6          nompi_h6a2412b_1114    conda-forge
htslib                    1.11                 hd3b49d5_1    bioconda
icu                       67.1                 he1b5a44_0    conda-forge
iniconfig                 1.1.1              pyh9f0ad1d_0    conda-forge
instrain                  1.5.1                      py_0    bioconda
joblib                    1.0.1              pyhd8ed1ab_0    conda-forge
jpeg                      9d                   h516909a_0    conda-forge
kiwisolver                1.3.1            py38h1fd1430_1    conda-forge
krb5                      1.17.2               h926e7f8_0    conda-forge
lcms2                     2.12                 hddcbb42_0    conda-forge
ld_impl_linux-64          2.35.1               hea4e1c9_2    conda-forge
libblas                   3.9.0                8_openblas    conda-forge
libcblas                  3.9.0                8_openblas    conda-forge
libcurl                   7.71.1               hcdd3856_8    conda-forge
libdeflate                1.6                  h516909a_0    conda-forge
libedit                   3.1.20191231         he28a2e2_2    conda-forge
libev                     4.33                 h516909a_1    conda-forge
libffi                    3.3                  h58526e2_2    conda-forge
libgcc-ng                 9.3.0               h2828fa1_18    conda-forge
libgfortran-ng            9.3.0               hff62375_18    conda-forge
libgfortran5              9.3.0               hff62375_18    conda-forge
libgomp                   9.3.0               h2828fa1_18    conda-forge
liblapack                 3.9.0                8_openblas    conda-forge
libllvm10                 10.0.1               he513fc3_3    conda-forge
libnghttp2                1.43.0               h812cca2_0    conda-forge
libopenblas               0.3.12          pthreads_h4812303_1    conda-forge
libpng                    1.6.37               hed695b0_2    conda-forge
libssh2                   1.9.0                hab1572f_5    conda-forge
libstdcxx-ng              9.3.0               h6de172a_18    conda-forge
libtiff                   4.2.0                hdc55705_0    conda-forge
libwebp-base              1.2.0                h7f98852_0    conda-forge
llvmlite                  0.35.0           py38h4630a5e_1    conda-forge
lmfit                     1.0.2              pyhd8ed1ab_0    conda-forge
lz4-c                     1.9.3                h9c3ff4c_0    conda-forge
mash                      2.2.2                ha61e061_2    bioconda
matplotlib-base           3.3.4            py38h0efea84_0    conda-forge
more-itertools            8.7.0              pyhd8ed1ab_0    conda-forge
mummer4                   4.0.0rc1        pl526he1b5a44_0    bioconda
ncurses                   6.2                  h58526e2_4    conda-forge
networkx                  2.5                        py_0    conda-forge
numba                     0.52.0           py38h51da96c_0    conda-forge
numpy                     1.20.1           py38h18fd61f_0    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openssl                   1.1.1j               h7f98852_0    conda-forge
packaging                 20.9               pyh44b312d_0    conda-forge
pandas                    1.2.2            py38h51da96c_0    conda-forge
patsy                     0.5.1                      py_0    conda-forge
perl                      5.26.2            h36c2ea0_1008    conda-forge
pillow                    8.1.0            py38ha0e1e83_2    conda-forge
pip                       21.0.1             pyhd8ed1ab_0    conda-forge
pluggy                    0.13.1           py38h578d9bd_4    conda-forge
prodigal                  2.6.3                h516909a_2    bioconda
psutil                    5.8.0            py38h497a2fe_1    conda-forge
py                        1.10.0             pyhd3deb0d_0    conda-forge
pyparsing                 2.4.7              pyh9f0ad1d_0    conda-forge
pysam                     0.16.0.1         py38hbdc2ae9_1    bioconda
pytest                    6.2.2            py38h578d9bd_0    conda-forge
python                    3.8.6           hffdb5ce_5_cpython    conda-forge
python-dateutil           2.8.1                      py_0    conda-forge
python_abi                3.8                      1_cp38    conda-forge
pytz                      2021.1             pyhd8ed1ab_0    conda-forge
readline                  8.1                  h27cfd23_0
samtools                  1.11                 h6270b1f_0    bioconda
scikit-learn              0.24.1           py38h658cfdd_0    conda-forge
scipy                     1.6.0            py38hb2138dd_0    conda-forge
seaborn                   0.11.1               ha770c72_0    conda-forge
seaborn-base              0.11.1             pyhd8ed1ab_1    conda-forge
setuptools                52.0.0           py38h06a4308_0
six                       1.15.0             pyh9f0ad1d_0    conda-forge
sqlite                    3.34.0               h74cdb3f_0    conda-forge
statsmodels               0.12.2           py38h5c078b8_0    conda-forge
threadpoolctl             2.1.0              pyh5ca1d4c_0    conda-forge
tk                        8.6.10               hed695b0_1    conda-forge
toml                      0.10.2             pyhd8ed1ab_0    conda-forge
tornado                   6.1              py38h497a2fe_1    conda-forge
tqdm                      4.56.2             pyhd8ed1ab_0    conda-forge
uncertainties             3.1.5              pyhd8ed1ab_0    conda-forge
wheel                     0.36.2             pyhd3deb0d_0    conda-forge
xz                        5.2.5                h516909a_1    conda-forge
zlib                      1.2.11            h516909a_1010    conda-forge
zstd                      1.4.8                ha95c52a_1    conda-forge

Question about using instrain to compare metagenomes

Hello,

I was following your tutorial for MAGs and wanted to use MAGs to assess similarity of two metagenomes. I ran dRep compare and have subsetted the comparisons to those with popANI greater than 99.99% and percent bases compared of at least 0.5. Would I compare the absolute number of "same strains" to determine which metagenomes are similar or calculate the proportion of "same strains" over all strain level comparisons between the two samples. I am not sure if we can get a reliable denominator. Any other guidance you would have on showing the similarity between metagenomes using output from inStrain on a MAG catalogue such as that generated by Almeida et al would be appreciated.

Best wishes,

get_raw_coverage_table attribute

When trying to run the .get_raw_coverage_table attribute from the documentation, I get the following error:

Type "help", "copyright", "credits" or "license" for more information.
>>> import inStrain
>>> import inStrain.SNVprofile
>>> IS=inStrain.SNVprofile.SNVprofile("results/CAPIIB.IS/")
>>> coverage_table = IS.get_raw_coverage_table()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'SNVprofile' object has no attribute 'get_raw_coverage_table'

I looked in the SNVprofile.py script and couldn't find that attribute defined. The .get_nonredundant_scaffold_table method worked when I tried that.

"KeyError: 'mm'" during plotting

I'm getting an KeyError: 'mm' when inStrain is writing the output plots. I suspect that this may be caused by this genome having a low read coverage in this specific sample, as I didn't get this error with other genomes.

making plots 1, 2, 3, 4, 5, 6, 7, 8, 9
Plotting plot 1
Plotting plot 2
Failed to make plot #2: 'scaffold'
Plotting plot 3
Skipping plot 4 - you don't have all required information. You need to run inStrain genome_wide first
Traceback (most recent call last):
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py", line 1422, in allele_freq_plot_from_IS
    db = db.sort_values('mm').drop_duplicates(subset=['scaffold', 'position'], keep='last')\
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/frame.py", line 4927, in sort_values
    k = self._get_label_or_level_values(by, axis=axis)
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'mm'
Skipping plot 5 - you don't have all required information. You need to run inStrain genome_wide first
Traceback (most recent call last):
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py", line 1463, in linkage_decay_from_IS
    db = db.sort_values('mm').drop_duplicates(subset=['scaffold', 'position_A', 'position_B'], keep='last')\
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/frame.py", line 4927, in sort_values
    k = self._get_label_or_level_values(by, axis=axis)
  File "/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/pandas/core/generic.py", line 1692, in _get_label_or_level_values
    raise KeyError(key)
KeyError: 'mm'
Plotting plot 6
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: divide by zero encountered in double_scalars
  '{:1.0f}%'.format((width/total)*100),
/global/homes/a/acamargo/SCRATCH/.environments/instrain/lib/python3.6/site-packages/inStrain/plottingUtilities.py:762: RuntimeWarning: invalid value encountered in double_scalars
  '{:1.0f}%'.format((width/total)*100),
Plotting plot 7
Failed to make plot #7: 'scaffold'
Skipping plot 8 - you don't have all required information. You need to run inStrain profile_genes first
Skipping plot 9 - you don't have all required information. You need to run inStrain profile_genes first

AssertionError Could not load the scaffold to bin file!

Hi, thanks so much for your effort in developing and maintaining inStrain, it's incredibly useful.

I'm running inStrain v 1.4.0 installed with conda.

I am using two bacterial genomes as references and running profile across multiple metagenomes mapped to these references with bowtie2. I'm supplying a sorted bam, the prodigal nucleotide file, (concatenated) reference genomes, and and reference scaffolds but getting this AssertionError with some (but not all) of inStrain profile analyses.

Is there an obvious reason why neither -s argument method would work for a subset of metagenome analyses? I recognize this is somewhat similar to #15 but not exactly.

thanks!
Mike

## RUN submitting space-separated scaffold list
inStrain profile ${bam_dir}/${ID}.sorted.bam ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fasta -o ${outdir}/${ID}_PM04-derep-isolate-genomes.IS -p 20 -g ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fna -s ${f_dir}/p20105-s030_scaffolds.fasta  ${f_dir}/p20105-s034_scaffolds.fasta

## RUN submitting scaffold2bin.json file (pointing to successful profile run) instead of space-separated scaffold list
inStrain profile ${bam_dir}/${ID}.sorted.bam ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fasta -o ${outdir}/${ID}_PM04-derep-isolate-genomes.IS -p 20 -g ${cat_dir}/PM04_derep_Ecoli_isolate_genomes.fna -s ${stb_dir}/scaffold2bin.json


############# Identical Assertion errors below

***************************************************
    ..:: inStrain profile Step 1. Filter reads ::..
***************************************************
        
Filtering reads: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 377/377 [00:04<00:00, 77.89it/s]
15.7% of reads were removed during filtering
68,100 read pairs remain (0.02590 Gbp)
***************************************************
.:: inStrain profile Step 2. Profile scaffolds ::..
***************************************************
        
Profiling splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:25<00:00,  1.20s/it]
Merging splits and profiling genes: 100%|████████████████████████████████████████████████████████████████████████| 20/20 [01:57<00:00,  5.86s/it]
***************************************************
.:: inStrain profile Step 4. Make genome-wide ::..
***************************************************
        
Could not load the scaffold to bin file!
Traceback (most recent call last):
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 50, in main
    self.profile_operation(args)
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 76, in profile_operation
    ProfileController(args).main()
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 142, in main
    self.profile_genome_wide()
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 356, in profile_genome_wide
    Controller().genome_wide_operation(copy.deepcopy(args))
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 88, in genome_wide_operation
    inStrain.genomeUtilities.Controller().main(args)
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 40, in main
    self.prepare_genome_wide(IS, vargs)
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 127, in prepare_genome_wide
    stb = load_scaff2bin(vargs.get('stb'), IS)
  File "/storage/home/hcoda1/0/mwoodworth8/.conda/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 875, in load_scaff2bin
    assert False
AssertionError

license clarification

Hi there,

I just packaged inStrain for GNU Guix and noticed that the LICENSE file in the repository is for GPL version 3 or later while the setup.py claims that the license is MIT (Expat). Could you please clarify which license actually applies?

Thank you!

missing mean clonality in genomeWide_scaffold_info.tsv

Hi,

I just realised than in some cases the mean clonality value is missing in genomeWide_scaffold_info.tsv. This thing seems to happen in populations with one or more scaffolds having 0 conANI, 0 consensus_SNP, and 0 unmaskedBreadth in the scaffold_info.tsv file. In those scaffolds the mean_clonality | median_clonality | mean_microdiversity | median_microdiversity columns are empty in the scaffold_info.tsv file.

Maybe one solution is to take only into account those scaffolds with coverage > 5, or unmaskedBreadth > 0.5 for the mean clonality calculation?

Scaffold profiling error

I have a genome from IMG that has 'N' characters at some positions. The error when trying to profile scaffolds is:

ValueError: 'N' is not in list
whole scaffold exception- Ga0131788_11
We had a failure profiling a scaffold! Not sure which!

Should I remove the N's from the reference genome, or replace with some other character?

inStrain compare: inaccurate time reporting for tqdm

I've noticed that tqdm isn't really accurate for the group-level processing. It usually quickly reaches N-1/N and then "stalls" for quite a while before finishing and moving onto the next group. Maybe it's taking a long time to close the processes:

    # Close multi-processing
    for proc in processes:
        proc.terminate()

As an example of non-accurate tqdm time reporting, this took ~5 hours:

Running group 27 of 148
Comparing scaffolds: 100%|██████████| 687/687 [15:16<00:00,  1.33s/it]
Running group 28 of 148
Comparing scaffolds:  85%|████████▌ | 1371/1608 [1:04:07<14:14,  3.60s/it]

Maybe putting it all in tqdm.process_map() will help?

Could not load the scaffold to bin file

Hi, this is a fantastic tool! I have a question about the scaffold to bin file.
In the tutorial it says 'One is to give inStrain a list of the .fasta files that went into making the concatenated .fasta file. The other is to provide inStrain with a “scaffold to bin” file, which lists the genome assignment of each scaffold in a tab-delimited file. In this case we’re going to use the scaffold to bin file provided by inStrain.."
I tried to supply a .txt file that has the universal location of the .fasta files present in the concatenated fasta as per your tutorial, space separated, but I keep getting the following error:

Could not load the scaffold to bin file!
Traceback (most recent call last):
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 35, in main
self.profile_operation(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 61, in profile_operation
ProfileController().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 224, in main
Controller().genome_wide_operation(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/controller.py", line 73, in genome_wide_operation
inStrain.genomeUtilities.Controller().main(args)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 468, in main
stb = load_scaff2bin(args.stb, IS)
File "/users/PAS1331/osu7930/miniconda3/envs/instrain/lib/python3.6/site-packages/inStrain/genomeUtilities.py", line 568, in load_scaff2bin
assert False
AssertionError

My command:
inStrain profile bt2_peru_mapped/peru_drepbins_SRR2938072.sam peru_drepbins.fasta -o peru_drepbins_SRR2938072.IS -p 6 -s perubins_space.txt

I have a set of ~270 dereplicated MAGs from about 100 samples, and another set of ~270 MAGs from about 180 samples, so I'd prefer not to go through and grep the scaffolds from every MAG then match them to MAG ID -- I am sure there's an easier way to pull scaffolds from fasta and associate them with the bin ID w/ perl or something but I'm not super saavy in that arena. Am I just formatting my .txt file incorrectly or is there something I am missing that can solve this issue? Since the tutorial says inStrain provides the scaffold to bin file I was wondering if there is a way to generate it? I've included one of my fasta files that goes into the concat fasta file (I figured the concat file is too big to upload here) and then the list of fasta files in the concat fasta file.
debug.zip

Thanks!

Bio.codonalign.codonalphabet not in biopython 1.78

Came across this when doing a fresh install of inStrain via pip. The Bio.codonalign.codonalphabet class is not present in biopython version 1.78 from what I can tell... no mention of it in the docs for that version.

I was able to install 1.74 via conda, then inStrain functions normally.

inStrain compare fails at Step3

Hello, thank you for creating a great tool! I get this error when running inStrain compare with three input objects. Could you help me figure out what the issue is?

..:: inStrain compare Step 3. Auxiliary processing ::..

Could not cluster genomes; heres a traceback:
Traceback (most recent call last):
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/compare_controller.py", line 356, in run_genome_clustering
Cdb = inStrain.compare_utils.cluster_genome_strains(Mdb, kwargs)
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/inStrain/compare_utils.py", line 212, in cluster_genome_strains
arr = scipy.spatial.distance.squareform(arr, checks=True)
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2215, in squareform
is_valid_dm(X, throw=True, name='X')
File "/n/home12/yhwang/.conda/envs/instrain/lib/python3.8/site-packages/scipy/spatial/distance.py", line 2290, in is_valid_dm
raise ValueError(('Distance matrix '%s' must be '
ValueError: Distance matrix 'X' must be symmetric.

Traceback (most recent call last): File "/var/spool/slurm/d/job2216591/slurm_script", line 4, in <module> import('pkg_resources').run_script('inStrain==1.5.3', 'inStrain')

..:: inStrain profile Step 1. Filter reads ::..

Traceback (most recent call last):
File "/var/spool/slurm/d/job2216591/slurm_script", line 4, in
import('pkg_resources').run_script('inStrain==1.5.3', 'inStrain')
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/pkg_resources/init.py", line 651, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/pkg_resources/init.py", line 1448, in run_script
exec(code, namespace, namespace)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/EGG-INFO/scripts/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 53, in main
self.profile_operation(args)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 82, in profile_operation
ProfileController(args).main()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 144, in main
self.profile_filter_reads()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/controller.py", line 228, in profile_filter_reads
inStrain.filter_reads.Controller().main_from_profile(self.ISP,
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 64, in main_from_profile
Rdic, RR = load_paired_reads(bam, scaffolds, **kwargs)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 164, in load_paired_reads
scaff2pair2info = get_paired_reads_multi(bam, scaffolds, **kwargs)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/site-packages/inStrain-1.5.3-py3.8.egg/inStrain/filter_reads.py", line 764, in get_paired_reads_multi
proc.start()
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in init
super().init(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init
self._launch(process_obj)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
prep_data = spawn.get_preparation_data(process_obj._name)
File "/public3/home/sc52870/miniconda3/envs/instrain/lib/python3.8/multiprocessing/spawn.py", line 183, in get_preparation_data
main_mod_name = getattr(main_module.spec, "name", None)
AttributeError: module 'main' has no attribute 'spec'
slurmstepd: error: *** JOB 2216591 ON r1513 CANCELLED AT 2021-04-12T10:53:58 ***

KeyError: 'mm'

I'm using instrain 1.2.8 (pypi) and getting the following error when running inStrain profile:

Traceback (most recent call last):
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/bin/inStrain", line 31, in <module>
    inStrain.controller.Controller().main(args)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 32, in main
    self.profile_operation(args)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 56, in profile_operation
    ProfileController().main(args)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 181, in main
    Controller().profile_genes_operation(args)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/controller.py", line 65, in profile_genes_operation
    inStrain.GeneProfile.Controller().main(args)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 541, in main
    Gdb = get_gene_info(IS)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/inStrain/GeneProfile.py", line 479, in get_gene_info
    db = db[db['mm'] <= mm].sort_values('mm').drop_duplicates(subset=['gene'], keep='last')
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/pandas/core/frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/ebio/abt3_projects/Anxiety_Twins_Metagenomes/bin/llmgps/.snakemake/conda/c47fe51b/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'mm'

I'm guessing that it's due to the fact that the input *.bam file has almost nothing in it, but then it would be good if instrain profile exited gracefully, especially since I'm running it in parallel on ~300 samples; some of which complete, and some are dying with this KeyError.

position in gene for reverse strand

Hi there, I might have another issue. For the SNV output table, I also tried to manually calculate the position of the SNP in the gene it is located in, and I compared it to the position provided by the "mutation" row, and I noticed something.

For inStrain, the calculations are done as followed I think? Please correct me if I'm wrong, I'd very much like to learn:

SNV position in the contig, from the SNV file, is 0-based
The 1-based and inclusive Prodigal start and end positions are converted to 0-based positions.
The SNV position within the gene is calculated as [SNV position within contig] - [0-based start position]

So, for example, a SNV at position 8067 is at position 0 in the gene if the 1-based Prodigal gene starts at position 8068 (in the Prodigal file)
A SNV at position 9067 is at position 1000 in the gene if the Prodigal gene starts at position 8068 and goes until 9068.
A SNV at position 9068 will not be in the gene.

And now for my question:
A gene can be on the (+)-strand or (-)-strand (1 or -1 in the Prodigal genes.fna file).
Currently, the SNV position in the gene is calculated in the same way for both, namely as described above, and counting from the 0-based start position.
However, should the SNV position on the (-)-strand not be calculated with respect to the end position?
For example:
A SNV at position 234 will be at position 111 if the Prodigal gene end positions (which is really the start position of the gene) is at 345.

I am also wondering whether this has implications for the determination of the N or S annotations for "mutation" and "mutation_type"?

I hope you could take a look at this issue.

Best wishes,
Elmar

Failed to make plot #10: Distance matrix 'X' must be symmetric.

Hello,

I am trying to run the inStrain plot to generate the dendrogram but I am getting the following error:

making plots 10
/shared/centos7/anaconda3/3.7/lib/python3.7/site-packages/numpy/lib/arraysetops.py:569: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
mask |= (ar1 == a)
Plotting plot 10
Note: NumExpr detected 40 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
/home/icotto25/.local/lib/python3.7/site-packages/inStrain/plottingUtilities.py:1065: RuntimeWarning: Mean of empty slice
combo2value["{0}-vs-{1}".format(samp2,samp1)]]))
/home/icotto25/.local/lib/python3.7/site-packages/inStrain/plottingUtilities.py:1067: RuntimeWarning: Mean of empty slice
combo2value2["{0}-vs-{1}".format(samp2,samp1)]]))
Failed to make plot #10: Distance matrix 'X' must be symmetric.

I am not sure how to fix it. Attached is the inStrain genome_wide output.
Thank you,
NEU_nitrifiers_bins.fa.IS.COMPARE_genomeWide_compare.xlsx

Unexpected error at end of run; not reported in log file

I ran the inStrain profile command resulting in this error (screenshot), which is not reported in the log file and the output is as expected.

The exact command I used:
inStrain profile MoCom-vs-gtdb.sam gtdb.fasta -o MoCom -p 10 -g gtdb_genes/gtdb.gene.fna -s scaf2bin.tsv --database_mode

What could cause this error?
Can I still use the output?

warning FAILURE StbError

Hi @MrOlm ,

I have started to use inStrain and thanks for developing such a great tool.

I have mapped reads with bowtie to a subset of the UHGG database, as described in the tutorial, but then I am only interested in one species and am using inStrain with only one reference genome.

After using inStrain profile, I am getting multiple warnings such as this:
FAILURE StbError GUT_GENOME284629_65 GUT_GENOME284629.fna no_length will not be considered as part of the genome

Is this simply because my sam/bam files have been mapped against the UHGG, but then I have a single reference genome when using inStrain?

The code I used was:

bowtie2 -p 10 -X 2000 --no-mixed --very-sensitive \
  -x UHGG_reps_selected.fasta.bt2 \
  -1 cat_trimmed_1.fastq.gz \
  -2 cat_trimmed_2.fastq.gz > test.sam

inStrain profile test.sam reference.fna \
  -o test.IS -p 10 -g reference.genes.fna \
  -s reference.stb

the reference.stb only has the contigs for my genome of interest reference.fna.

Thanks for any help,

Ramiro

minimap2 instead of bowtie2?

This is a question versus an issue: do you think that minimap2 could be used instead of bowtie2? minimap2 is 2-3X faster than bowtie2, so it would be nice to use it, if possible. You have done any testing comparing the mappers to see if it affects the inStrain results?

inStrain compare not scaling

I'm running instrain compare (v1.2.8) with the following params: inStrain compare --min_cov 5 --min_freq 0.05 --fdr 1e-06 -p 8 -o $OUTDIR -i $INDIRS, with ~40 genome references and ~280 samples (~1 mil reads per sample). According to the progress report for the command, each of the 8736 iterations is taking ~5.5 hours. More threads will help a bit, but I'm guessing that there still will be a huge time requirement. Is there a good way to substantially increase the speed, or does instrain compare just not scale well to many samples?

Also, requiring one to list all paths to instrain objects as the input can lead to total command length limitations, at least on some operating systems. I would be great to be able to provide a file that lists all objects instead of actually listing them all in the command.

instrain compare: dynamic output

Running instrain compare in a snakemake kept failing for certain jobs. I realized that instrain compare will compress the output table if it is of a certain size:

        if store:
            out_base = self.get_output_base()
            ft = '.tsv'

            # If this file is going to be huge, compress it
            if len(db) > 1e6:
                ft = '.tsv.gz'

Can you please remove this dynamic output (ie., always *.tsv or *.tsv.gz)? It makes integrating instrain into a snakemake pipeline much more complicated.

samtools view not multithreaded

Hi Matt,

I realized that when converting sam to bam, inStrain is using samtools view with a single thread. Is there a specific reason for this? just wondering if it could speed this step a little by adding more threads.

Best,

Ramiro

long-running time of read filtering step with 200+ mags

Hi Matthew

Thanks for the brianliant tool for the de-novo based population analysis.

I followed the tutorial and had a bechmark of several mags under version 1.2.4. It worked fine.
But when I applied to a bigger set of dereplicated mags, it got stuck at the read-filtering step, running for several hours.
Here is the std and log.log file
std.txt
log.log

From the log.log, it didn't move for several hours since below.
03-03 08:42 INFO Making read report
03-03 08:43 DEBUG running on all reads
03-03 08:54 DEBUG running on individual scaffolds

I am not sure whether it's still running but just slowly. Is there any ways to solve this issue or just speed up? From the top command, I can see instrain is still running but without utilize those defined threads at this step.

Best
Yan

building RG fasta file; only certain microbes of interest

Dear authors,

Many thanks for creating inStrain, and the very thorough documentation that goes with it. I finished the tutorial, and now want to use it on some of our own samples.

In my case, I have metagenomic samples from very different environments (think from complex soil, to more simple dairy samples). So I expect to find many, many different microbes. I also have a set of target microbial genomes (includes multiple strains for some species) that I am interested in and trying to find in all those environments. I can think of the following ways to use inStrain:

build a very, very big RG fasta file, with many genomes of commonly occuring microbes and my microbes of interest, and run instrain as usual
like above, build a very big .fasta file that is used to create the .bam file. And then pass a .fasta file for only my genomes of particular interest.
build a RG fasta file of only my genomes of interest. this has a higher risk of mismapping, which I maybe can counter by using the options as explained in the tutorial.
go for the probably more time consuming option of de novo assembly before running instrain

Would you be able to give me guidance on if these options make sense, and if you have an idea on which one would be the best approach?

Kind regards

inStrain compare KeyError: 'covT'

Hello,

I ran inStrain compare using: inStrain compare -i XXXX -o YYYY and got the following error in the first on one of the genomes:

Loading Profiles into RAM: 0%| | 0/4 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2891, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'covT'The above exception was the direct cause of the following exception:Traceback (most recent call last):
File "/home/srodriguez/.local/bin/inStrain", line 31, in
inStrain.controller.Controller().main(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 53, in main
self.compare_operation(args)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/controller.py", line 79, in compare_operation
inStrain.compare_controller.CompareController(args).main()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 53, in main
self.create_scaffoldcomparison_objects()
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 162, in create_scaffoldcomparison_objects
valid_SCs, scaffold2length = make_scaffoldcomparison_objects(self.inputs, scaffolds_to_compare)
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/compare_controller.py", line 513, in make_scaffoldcomparison_objects
scaffolds = list(ISP._get_covt_keys())
File "/home/srodriguez/.local/lib/python3.8/site-packages/inStrain/SNVprofile.py", line 533, in _get_covt_keys
filename = self.get_location('raw_data') + '/' + os.path.basename(Adb.loc['covT', 'value'])
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 873, in getitem
return self._getitem_tuple(key)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1044, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 786, in _getitem_lowerdim
section = self._getitem_axis(key, axis=i)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1110, in _getitem_axis
return self._get_label(key, axis=axis)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexing.py", line 1059, in _get_label
return self.obj.xs(label, axis=axis)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/generic.py", line 3488, in xs
loc = self.index.get_loc(key)
File "/opt/ohpc/pub/apps/python/3.8.5/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2893, in get_loc
raise KeyError(key) from err
KeyError: 'covT'

However, the rest of the steps ran without errors but only one output was produced (*comparisonsTable.tsv). Is some extra argument or parameter needed?

Thank you,
Sergio

Could not find scaffold

I mostly use BBMap for most of my mapping needs. When I passed the BAM/SAM files to InStrain, I got errors about scaffolds couldn't be found and there weren't any reads in said scaffolds. This was resolved when I followed the tutorial and used Bowtie2 instead. I believe the error comes from the scaffold ID not being in the same column when BBMap outputs the BAM/SAM files, as I've had similar problems with how BBMAp outputs things and then using for other downstream programs. This is no longer an issue for me (since I switched to Bowtie2) - but I would maybe add a disclaimer to use a certain mapping software or to reformat the BAM/SAM files in a certain order than InStrain can take in, so future users know of this.

Question about filtering genomewidecompare

Hello,

I had a question about filtering the genome wide compare tsv file. I am comparing highly similar metagenomes and I am getting very low percent_compared (<0.005). I was hoping to filter based on a percent_compared of 0.5 and ANI of 99.99%. I am worried that nothing will be left after I filter. Do you know how to make sense of this low percent_compared given that these samples are expected to contain similar strains? Could it be an issue with my dereplicated isolates? I dereplicated a set of isolates at 97% ANI to get this dereplicated set that was inputted to instrain.

Thank you as always,

Just as a side question, do you ever filter based on coverage_overlap?

allele frequency

Thanks for the very nice tool! In the output, there is no parameter for allele frequency. Is there any way to calculate this based on the output data? Thanks!

mrolm / instrain Goto Github PK

instrain's People

Contributors

Stargazers

Watchers

Forkers

instrain's Issues

Recommend Projects

Recommend Topics

Recommend Org