jasonsahl / ls-bsr Goto Github PK

Large scale Blast Score Ratio (BSR) analysis

License: GNU General Public License v3.0

Python 4.37% Pep8 95.63%

ls-bsr's Introduction

LS-BSR (Large Scale Blast Score Ratio) is released under the GPL version 3 license. See "license.txt" for more information

LS-BSR is a method to compare all coding regions in a large set of genomes. Each peptide is compared against it's nucleotide sequence in order to obtain the maximum BLAST bit score. Each peptide is then aligned against each genome in order to find the query BLAST bit score. The query dividied by the reference provides one with the BSR, which can range from 0 to 1; scores slightly higher than 1.0 can be observed due to variable bit scores obtained by BLAST. In my opinion, they should be treated as 1.0. Due to the "-C F" flag added recently, values > 1.00 have not been observed.

contact: jasonsahl at gmail dot com

Minimum requirements, see manual.md for version information

Python >2.7 and <=3.9 (higher versions still work but tests may fail)
BioPython
Prodigal - Required for de novo gene prediction only
VSEARCH - Optional
mmseqs2- Optional
CD-HIT - Optional
Blast+ - Optional
Blat - Optional
Diamond - Optional

-To create an environment and run through conda:
conda create -n ls_bsr python=3.9
conda activate ls_bsr
conda install -c bioconda blast vsearch cd-hit prodigal ucsc-blat diamond biopython mmseqs2
#If you have problems with biopython, try: pip install Biopython
git clone https://github.com/jasonsahl/LS-BSR.git
python setup.py install

-To test the install:
python ls_bsr.py --version
python tests/test_all_functions.py

*See changelog.md for a list of changes
*See manual.md for run directions

ls-bsr's People

Contributors

Stargazers

Watchers

Forkers

gregcaporaso mgalardini tarah28 pryratess druvus joshgutman davised palc thexiyang pythseq wangdi2014 singinglia happywlu drhchen wangmz0617 azuriteglass

ls-bsr's Issues

Error: NameError: global name 'logging' is not defined

Hi there,

I got this while mistyping one of the usage examples. This works fine:
ls_bsr -d genomes -g genes/ecoli_markers.fasta -b blastn -x test

This leads to a Python error (using tblastn as per default):
ls_bsr -d genomes -g genes/ecoli_markers.fasta -x test

The error is:

LOG: 2017/08/30 14:08:08 - starting matrix building
Traceback (most recent call last):
File "/opt/biosw/bin/ls_bsr", line 511, in
options.filter_scaffolds,options.prefix,options.min_pep_length,options.intergenics)
File "/opt/biosw/bin/ls_bsr", line 380, in main
divide_values("bsr_matrix", ref_scores)
File "/opt/biosw/LS-BSR.git/ls_bsr/util.py", line 77, in divide_values
logging.logPrint("The following genes had no hits in datasets or are too short, values changed to 0, check names and output:%s" % "\n".join(nr))
NameError: global name 'logging' is not defined

Best,
Bastien

Use BLAST+ instead of BLAST ?

I notice you need the old legacy blastall 2.2.x instead of the BLAST+ package.

Is there any reason for this?

Error with test 3

Dear Dr Sahl,
I installed LS-BSR and performed tests 1 and 2. But with test 3 I got the following error:
LOG: 2017/10/27 13:06:55 - Prodigal done
Traceback (most recent call last):
File "/Users/AnaPaula/Desktop/LS-BSR/ls_bsr.py", line 596, in
options.filter_scaffolds,options.prefix,options.min_pep_length,options.intergenics)
File "/Users/AnaPaula/Desktop/LS-BSR/ls_bsr.py", line 268, in main
dup_ids = test_duplicate_header_ids("consensus.fasta")
File "/Users/AnaPaula/Desktop/LS-BSR/ls_bsr/util.py", line 759, in test_duplicate_header_ids
for line in open(fasta_file):
IOError: [Errno 2] No such file or directory: 'consensus.fasta'
The missing output file is the consensus.fasta, I have the all_gene_seqs.out, for each strain _all.fasta.new, _all.fasta.new_genes.pep and _all.fasta.new_genes.seqs
I am OS Sierra 10.12.6 with Prodigal 2.6, blast2.2.28, vsearch 2.5.1.
Do you have a clue of what's going wrong?
Thanks in advance,
Nana

Errors with select_seqs_by_IDs.py and filter_BSR_variome.py

Hi,

I was surprised to get "Must provide matrix" erro while executing erroselect_seqs_by_IDs.py and filter_BSR_variome.py even though bsr_matrix_values.txt was in the path of these two programs. I was also able to use bsr_matrix_values.txt for other LS_BSR codes without any problem. Why is this so? I'm I missing something?

A separate issue is:
I would like to know if I have used select_by_IDs.py correctly for getting core genes and unique genes as follows:

python /usr/share/LS-BSR-master/tools/select_seqs_by_IDs.py -i consensus.fasta -d core_gene_ids.txt -o core_genes.fasta

python /usr/share/LS-BSR-master/tools/select_seqs_by_IDs.py -i consensus.fasta -d unique_gene_ids.txt -o unique_genes.fasta

Thanks

AssertionError: Tuples differ

I tested LS-BSR as follows:

# -To create an environment and run through conda:

conda create -n ls_bsr python=3.5

conda activate ls_bsr

conda install -c bioconda blast vsearch cd-hit prodigal ucsc-blat diamond biopython mmseqs2

git clone https://github.com/jasonsahl/LS-BSR.git

cd LS-BSR/

chmod +x ls_bsr.py 

python setup.py install

# -To test the install:

python ls_bsr.py --version

# ls_bsr.py 1.2.2

(python tests/test_all_functions.py &) >& log.test_all_functions.$(date +%F).txt

As shown in attachment, running python tests/test_all_functions.py printed the following messages:

AssertionError: Tuples differ: ([], [], []) != ([[2]], [[2]], [[2]])
AssertionError: Tuples differ: ([], [], []) != ([[2]], [], [])
AssertionError: Tuples differ: ([], [], []) != ([], [], [[2]])
AssertionError: Tuples differ: ([], [], []) != ([], [[2]], [])
AssertionError: Tuples differ: ([], [], []) != ([[1]], [[1]], [[1]])

log.test_all_functions.2020-11-25.txt

Inconsistent interpretation of "processors" argument

The processors argument is interpreted to mean two different things by LS-BSR. It is both the number of threads a program should use and the number of parallel processes to use.

processors used as number of threads a program should use, e.g. blast searches:

"-num_threads", str(processors),

And as number of parallel processes to run:

num_workers=processors))

This means that if we want to cluster using 12 threads in vsearch we also wind up spawning 12 blast processes each with -num_threads set to 12 (A nominal 144 cores required). If run on a system with 12 threads available, the runtime of blast in (the current) 12 blastn x 12 threads is significantly slower than 12x (1 blastn x 12threads) because of contention.

Ideally this flag would be broken into -t (threads) and -p (processors) with -t being the per process thread count and -p taking on number of parallel processes to run.

igs problem

Hello,

using release tarball 1.011
this one is related to issue #16

igs is used in ls_bsr.py
igs is used in tools/extract_core_genome.py

see

c467b007697e LS_BSR-1.011]$ find . -type f | xargs grep igs   
./tools/extract_core_genome.py:    from igs.utils import functional as func
./tools/extract_core_genome.py:    from igs.utils import logging
./tools/extract_core_genome.py:    from igs.threading import functional as p_func
./ls_bsr.py:import igs_logging as logging

igs is missing from the archive, adding igs to the pythonpath, breaks the import

LS-BSR igs_igs_logging is the exact same file as igs/utils/logging.py

either set a dependency to igs in order to have it installed and in the PYTHONPATH of the user
and in this cas in lsbsr use import igs.utils.logging instead of hte embeded igs_logging one.

either embed the necessary igs files as it is done for ig_logging

I prefer the 1st option.

regards

Eric

BSR matrix not as expected

The latest version of LS-BSR outputs a matrix with a different format of what is present in the test directory.

python ../ls_bsr.py -d genomes -g genes/ecoli_markers.fasta

Output:

        O157_H7_sakai_all       E2348_69_all    H10407_all      SSON_046_all
IpaH3   0.00
LT      0.00    0.00
ST2     0.00    0.00    0.00
bfpB    0.00    0.00    0.00    0.00
stx2a

Expected output (gene_screen_output/bsr_matrix_values.txt):

       E2348_69_all    H10407_all      O157_H7_sakai_all       SSON_046_all
IpaH3   0.03    0.03    0.03    1.00
LT      0.00    1.00    0.00    0.00
ST1     0.00    1.00    0.12    0.12
ST2     0.00    0.92    0.11    0.00
bfpB    1.00    0.00    0.00    0.00
stx2a   0.07    0.08    0.98    0.07

Maybe I'm missing something? :)
Thanks for the great work

error message "Clustering chosen, but no method selected...exiting"

have installed LS-BSR on Biolinux8 this weekend, first tests work fine, but the De Novo part does not work

usearch, vsearch and cd-hit-est are all in the path and available, prodigal does the job of predicting the ORFs, but then LS-BSR exits with the error message:
"Clustering chosen, but no method selected...exiting"

This also happened when using the command from the manual on page 5.

Command line:
python /home/arnoud/software/LS-BSR/ls_bsr.py -d lsbsr –c usearch

Same with vsearch and cd-hit-est.

Also, where do the prodigal outputs go? Can't find them :-)

Manual in Markdown format?

It took me a while to find the documentation, I was expecting it to be in the README.md, but I found it in the PDF. I know Github now renders PDF files, but it would be great if it was in Markdown and easy to search and in a single page.

igs missing? (Your environment is not set correctly. Please add LS-BSR to your PYTHONPATH and try again)

Hi there,

I cloned LS-BSR from github; did a: python setup.py install; added path to PYTHONPATH

I still got the "environment is not set correctly" message.

Looking into the code, I find it throws here:
try: from igs.utils import functional as func

Now, doing a `find . -name "* igs *" on the LS-BSR directory, all I got was:
./igs_logging.py

Digging around github, I find files in
https://github.com/jasonsahl/Phylomark/tree/master/igs/utils

to which this seems to refer to. But it's not in LS-BSR.

What am I missing?

Best,
Bastien

Bug: softlinks as input to -d can lead to problems

Hi there,

I ran into an issue when using softlinks in a -d directory. How to reproduce:

cd test_data
mkdir -p lngenomes/level2
cd lngenomes/level2
ln -s ../../genomes/* .
cd ../../
ls_bsr -d lngenomes/level2 -g genes/ecoli_markers.fasta -x test

With that I get at one point:

LOG: 2017/08/30 15:41:10 - starting BLAST
problem found in formatting genome /mnt/work/tst/test_data/test/O157_H7_sakai_all.fasta.new
problem found in ...
...

And the resulting BSR matrix is all 0.00 values.

When copying the files into the genomes/level2 directory instead of softlinking, everything works.

Best,
Bastien

PS: I'm sorry for the number of issues filed today.

Problem with test_all_functions.py

Could you help me with this issue?

..................# of conserved genes = 1

of unique genes = 4

of unique genes per genome = 1.0

.E......./usr/local/lib/python2.7/dist-packages/Bio/Seq.py:1976: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future.
BiopythonWarning)
......accumulation means
1 2.0
unique means
1 2.0
core means
1 2.0
.accumulation means
1 2.0
.core means
1 2.0
.unique means
1 2.0
.accumulation means
1 1.0
unique means
1 1.0
core means
1 1.0

......................

ERROR: test_get_core_gene_stats_empty_line (main.Test17)

tests the case where an empty line is found

Traceback (most recent call last):
File "/home/simone/jsahl/LS-BSR/tests/test_all_functions.py", line 610, in test_get_core_gene_stats_empty_line
self.assertRaises(TypeError, get_core_gene_stats, fpath, 0.8, 0.4)
File "/usr/lib/python2.7/unittest/case.py", line 475, in assertRaises
callableObj(_args, *_kwargs)
File "/home/simone/jsahl/LS-BSR/ls_bsr/util.py", line 460, in get_core_gene_stats
if int(len(presents))/int(totals)>=1:
ZeroDivisionError: division by zero

Ran 60 tests in 0.093s

FAILED (errors=1)

Fix all the markdown headings

1-2 years go github changed markdown spec

##Heading    is wrong
## Heading   is correct, with the space.

Consider free VSEARCH instead of USEARCH

The requirement for the click-through licence version of USEARCH makes it hard to package LS-BSR into Homebrew etc

You may wish to consider this open alternative: https://github.com/torognes/vsearch

FileNotFoundError: Errno 2 No such file or directory: 'duplicate_ids.txt'

Dear all:
today, I try to use ls-bsr to perform several bacterial genomes. however, the error occurs, can someone help me? thank you very much!
the following error information:
python ~/biosoft/LS-BSR/ls_bsr.py -d ../KP_Zeng -i 0.8 -f T -p 40 -c cd-hit -b blastp -t T -e T
LOG: 2021/07/02 17:37:31 - Testing paths of dependencies
/home/anaconda3/envs/pgcgap/bin/blastp
citation: Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, and Lipman DJ. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402
/home/anaconda3/envs/pgcgap/bin/prodigal
citation: Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, and Hauser LJ. 2010. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11:119
/home/anaconda3/envs/pgcgap/bin/cd-hit
citation: Li, W., Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nuceltodie sequences. Bioinformatics 22(13):1658-1659
LOG: 2021/07/02 17:37:31 - predicting genes with Prodigal
LOG: 2021/07/02 17:38:23 - Prodigal done
LOG: 2021/07/02 17:38:23 - Converting genbank files
LOG: 2021/07/02 17:40:11 - clustering with cd-hit at an ID of 0.8, length percentage of 0.9, using 40 processors
Duplicate header IDs:
626_3
626_4
738_22
...
duplicate headers identified, renaming..
LOG: 2021/07/02 17:41:25 - starting blastp
LOG: 2021/07/02 17:46:13 - BLAST done
LOG: 2021/07/02 17:46:13 - Duplicate searching turned off
LOG: 2021/07/02 17:46:15 - starting matrix building
LOG: 2021/07/02 17:46:16 - The following genes had no hits in datasets or are too short, values changed to 0, check names and output:centroid_1530
centroid_2453
centroid_311
centroid_4813
centroid_5366
centroid_5638
centroid_6109
centroid_6695
LOG: 2021/07/02 17:46:16 - filtering duplicates
Traceback (most recent call last):
File "/home/biosoft/LS-BSR/ls_bsr.py", line 710, in
options.filter_scaffolds,options.prefix,options.intergenics,options.min_len,options.dup_toggle)
File "/home/biosoft/LS-BSR/ls_bsr.py", line 580, in main
num_filtered = filter_paralogs("%s/bsr_matrix_values.txt" % start_dir, "duplicate_ids.txt")
File "/home/biosoft/LS-BSR/ls_bsr/util.py", line 504, in filter_paralogs
with open(ids) as genomes_file:
FileNotFoundError: [Errno 2] No such file or directory: 'duplicate_ids.txt'

Above species level?

Is this software suited for genomes above the species level?

Visualization of output

Dear jason:
Could you give me some tips about how to visualize the LS-BSR output directly in itols?
I used itols several times and seems the heatmap template can be associated with the bsr matrix output.Or there is sth I don't know?
Thanks so much!

ls-bsr > 1

Dear Jason Sahl,
I used ls-bsr with nucleotide data using blastn and got bsr values higher than 1.

Then I used same dataset but protein data with blast separately (not with ls-bsr) with blastp and still got few BSRs higher than 1.

Could you explain why is that or what could I do to avoid BSR values higher than 1 ?

Thanks?

can i retrieve homologues from LS-BSR results or temp files?

I saw that the option -g where i can pass my genes to screen is incompatible with the option -c that allows the user to cluster genes.
I have my list of genes and the genomes, the program gives a cluster file of genes in the temporary folder only with the option -c. I was wondering if there is some way to retrieve homologues from the run with the -g option since there are 2 options to set MAX_PLOG for paralogs and MIN_HLOG for homologs.

thank you

LS-BSR pipeline query

Hello Dr Jason,

I am currently using LS-BSR pipeline for my analysis as implemented in your paper "The Effects of Signal Erosion and Core Genome Reduction on the Identification of Diagnostic Markers". I understand the complete paper and able to implement the analysis till the pan and core genome analysis. However, I am still a bit confused about the implementation of core genome reduction and signal erosion. This is how I am implementing this:-

A) Core Genome Reduction

Random samples -> LS-BSR matrix -> "BSR_to_gene_accumulation-scatter.py" -> accumu-core-replicates.txt-> plotting

B) Signal Erosion

Random Samples+core CDS -> LS-BSR matrix -> "BSR_to_gene_accumulation-scatter.py" ->accumu-unique-replicates.txt -> plotting

I would like to ask - am I implementing it correctly for the analysis of core genome size and signal erosion? Did you plot the replicates value generated from "BSR_to_gene_accumulation-scatter.py" script? I really want to make sure that the way I am using this analysis is correct and would highly appreciate if you could please shed light on this.

I look forward to the reply!

Best Regards,
Reema,

shutil.copy may be better than os.link in line 111 in ls_bsr.py

Dear Devs

When I try to use LS-BSR, an error occur:

Traceback (most recent call last):
File "ls_bsr.py", line 492, in <module>
options.filter_scaffolds,options.prefix,options.temp_dir,options.debug)
File "ls_bsr.py", line 111, in main
os.link("%s" % infile, "%s/%s.new" % (fastadir,name))
OSError: [Errno 18] Invalid cross-device link

I try myself to use shutil.copy to replace os.link in main script. No such error occur again.

Best,

Xiangchen Li

Fix the reference genome to compare

Hi,

I would like to compare the Blast Score Ratio (BSR) of a single reference genome against 10 closely related genomes using their NCBI annotated locus_tags. Currently, when I run the ls_bsr.py script, the reference genome is switching among all the genomes I want to compare. I need help to fix the reference genome and obtain the BSR for the rest of the genomes in comparison. Furthermore, I would like to export the Kegg annotations of each compared gene to an Excel file.

Thank you so much!

BSR values >1

Hello Jason,

From my understanding, LS-BSR should never return BSR values greater than 1 since values are scaled to the highest value within their row. However, we frequently see BSR values >1 regardless of using blastn/tblastn (We've seen values up to ~10 in the past).

We run vsearch outside of LS-BSR to make use of our gene prediction/annotation pipeline. We do annotation > vsearch consensus calls (ID = 0.9) > LS-BSR (using -g)

Command used:

vsearch --cluster_fast $data/predicted_genes/nthi/all_genes.fna --consout $data/vsearch/consensus_nthi.fasta --threads 24  --id 0.9 --clusters $data/vsearch/clusters_nthi_20160706/
python $LSBSR/ls_bsr.py -d $data/nthi/genomes -p 24 -g $data/vsearch/consensus_nthi.fasta

While this doesn't interfere with the downstream analysis we use the BSR matrix for, I'm not sure if this is the intended behavoir.

//Aroon

Error Installing LS_BSR

I had previously installed LS-BSR and everything worked fine, but my system was formated and now when I try to install LS-BSR, this is the error I am seeing-

~/LS-BSR$ python setup.py install
Traceback (most recent call last):
File "setup.py", line 12, in
with open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
TypeError: 'encoding' is an invalid keyword argument for this function

Please let me know what might be the reason for the error. Thanks

Your environment is not set correctly. Please add LS-BSR to your PYTHONPATH and try again

I am having this error: Your environment is not set correctly. Please add LS-BSR to your PYTHONPATH and try again.
I already added the PATH to PYTHONPATH and I still experience same error.
What do you mean with add LS-BSR to you PYTHONPATH? I am trying to install it in /Users/myuser/Downloads/LS-MBR-master/.
When I tried the installation in /opt/LS-BSR I got this problem. Would you please help me with this?
Thanks,
sadiel

Is LS_BSR in conda or brew?

These are the key packagaing systems in bioinfo these days.
THey will increase uptake of your tool greatly.

Transpose a BSR matrix

https://github.com/jasonsahl/LS-BSR/blob/master/manual.md#post-matrix-scripts

I would like to transpose a BSR matrix but do not need to re-order the matrix based on the order of the taxa in a newick-formatted tree. I would be grateful if you could provide example commands to do it. Running the following command printed the message Must provide tree.

touch test.tree # create empty file
python LS-BSR/tools/reorder_BSR_matrix_by_tree.py -b lsbsr_g_bsr_matrix.txt –t test.tree

vsearch vs usearch

Hi,
I tried to run a comparison of two PHAST prediction regions with prodigal and vsearch and there is nothing in the consensus file, but if I use usearch there is. vsearch is in the path and runs, but the output is empty. Are the settings different for vsearch vs usearch for the cutoff?
/home/brigida.rusconi/vsearch/bin/vsearch
LOG: 2016/01/04 17:57:37 - clustering with VSEARCH at an ID of 0.9, using 2 processors
LOG: 2016/01/04 17:57:37 - VSEARCH clustering finished
Best,
Brigida

jasonsahl / ls-bsr Goto Github PK

ls-bsr's Introduction

ls-bsr's People

Contributors

Stargazers

Watchers

Forkers

ls-bsr's Issues

python /usr/share/LS-BSR-master/tools/select_seqs_by_IDs.py -i consensus.fasta -d unique_gene_ids.txt -o unique_genes.fasta

of unique genes = 4

of unique genes per genome = 1.0

......................

tests the case where an empty line is found

Recommend Projects

Recommend Topics

Recommend Org