snayfach / mgv Goto Github PK

View Code? Open in Web Editor NEW

62.0 62.0 18.0 7.62 MB

Supporting code for uncultivated gut virus manuscript

Python 87.53% R 0.20% Shell 12.27%

mgv's People

Contributors

Stargazers

Watchers

Forkers

yanhui09 zongzhiwu liupfskygre vikash84 bgrygorcewicz caojiabao biotovarx zjyzjjzmt huan1025 nanyw123 rocknhu lonestarling junwu302 guanxiangliang summer-yangqin alienzj chaoxiansen icco123

mgv's Issues

identify_crispr.py: error: unrecognized arguments: -p crt

It seems that -p didn't work in the py, how to handle it

Genus- and family-level clustering

Hi Stephen,

Thank you so much! I loved the article, and your code is so useful!

Two questions:

There are thresholds recommended by MIUViG to use to define species level vOTUs. Are there for family- and genus-level cluster? I would be very grateful if you could share anything on this.
I could be wrong, but I assume it may be useful in lines 34 and 36 save only second element of the list, and then change _[2] to _[o] in line 48. Otherwise, the dictionary created during iterations over an all-vs-all blastp table could be quiet large, preventing people from using your code.

Virus classification annotation

Hello, I am very interested in your work, and I would like to express my thanks to him for his great help. May I ask how to classify viruses into DNA, dsDNA, ssDNA, SSDNA-RT, dsRNA, RNA, ssRNA, SSNA-RT, and what tools are available? Thank you again

I can not use virfinder,can you help me?

Could you please offer codes for taxonomic annotation in article？

Hi, Stephen! I recently read your article 'Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome'. It is a great work to unravel the virosphere in human gut. I want to realize the method of taxonomic annotation in your article. Could you please offer codes for taxonomic annotation in article？Thanks a lot! Wish your greater academic achievements.

Virome profiling database recommendation

@snayfach Hi, Stephen, thank you for exploring the viral genome.

I have some questions, maybe you can help me.

awk '{print $1}' mgv_contig_info.tsv | sort | uniq | wc -l
189681

awk '{print $2}' mgv_contig_info.tsv | sort | uniq | wc -l
54119

You provided total of 189680 virus genomes, which can be clusted into 54118 vOTU, right?
If I want to do virome profiling based on the MGV database,
should I select all viral genomes or the representative genomes in vOTU?

Thanks for your work!

I want to know how to make annotation

Thanks!

I cannot use Marker gene phylogenetic tree.py

select protein clusters (PCs)
write seqs for each PC
build multiple seq alignment (MSA) for each PC, round 1
build hmm from each MSA
search hmms versus original proteins

Error: File format problem in trying to open HMM file ou/all.hmms.
File exists, but appears to be empty?

extract proteins for top hmm hits
Traceback (most recent call last):
File "/mnt/24t/MGV/marker_gene_tree/marker_tree1.py", line 113, in
for line in open('%s/hmmsearch.txt' % args['out_dir']):
FileNotFoundError: [Errno 2] No such file or directory: 'ou/hmmsearch.txt'

the protein cluster issues

Hi !
Snayfach
I am studying your excellent MGV work. i have a little bit confusion about the Protein cluster.
In your mgv_pc_function.tsv file, the each protein in PC were annotated differential function, how to define the function of the whole PC.
Looking forward to your reply.
Sincerely
L

cluster result question

Hi, I have run the ani_cluster process in its entirety, how should the generated clusters.tsv and ani.tsv be interpreted, and how can I pick out the VOTUs that meet the following criteria: ≥95% ANI across ≥85% of the shorter sequence to form vOTUs using
the longest representative sequence.
Looking forward to your answer!
Best Wishs!

Hello teacher,I can not use Virfinder

hello tercher,I can not use Virfinder,how do you use it?

identify_crisper.py error: TypeError: str() takes at most 1 argument (2 given)

Hello, thanks for this fantastic virome analysis tutorial !
When I run the crispr_spacers, and the identify_crisper.py file went wrong as follows:

Traceback (most recent call last):
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 485, in <module>
    arrays = run_pipeline(args, offsets, append_ns=50, program=program)
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 394, in run_pipeline
    out, err, code = run_crt(tmp_path, args['tempdir'], xmx=args['xmx'])
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 99, in run_crt
    return str(o,encoding='utf-8'), e, c
TypeError: str() takes at most 1 argument (2 given)

my python version is 2.7, and biopython version is 1.76. Great thanks for your assistance.

identify_crispr.py TypeError: argument should be integer or bytes-like object, not 'str'

Hello, and thanks for this great viral detection pipeline! Earlier steps worked seamlessly for me.

Running the example code python identify_crispr.py -i example/GUT_GENOME147678.fna -o out in a cloned repository gives me the following error:

Traceback (most recent call last):
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 485, in
arrays = run_pipeline(args, offsets, append_ns=50, program=program)
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 405, in run_pipeline
my_arrays = parse_crt(out, contigs)
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 108, in parse_crt
if out.find('\nCRISPR') == -1:
TypeError: argument should be integer or bytes-like object, not 'str'

The out directory contains one file, out/temp/seqs/0d239010-1666-403a-9b21-f06c24f1d762, which is a fasta with the 50 Ns at either end.

I'm not overly skilled at Python, but adding some print commands let me see the output of

out, err, code = run_crt(tmp_path, args['tempdir'], xmx=args['xmx']).

out, which seems to be causing the problem, prints like it's a big string...

b'\n\nReading file out/temp/seqs/17b75e28-a677-44b1-bc28-e9d7592c72e4\nReading file complete\ntmpfile\n2833987 bases\nSearching for repeats...\nTime to search for repeats: 453 ms\n2 possible CRISPR(s) found\n\nORGANISM: tmpfile\nBases: 2833987\n\n\nCRISPR 1 Range: 682229 - 685495\nPOSITION\tREPEAT\t\t\t\tSPACER\n--------\t----
-----------------------------\t----------------------------------\n682229\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAATAATCCAGTGCTCATTTTTTGATCTCCTTCGGT\t[ 33, 35 ]
\n682297\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tATTCCATTTTACATTCCTATGCACTTAGGAGACCC\t[ 33, 35 ]\n682365\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAAGACAGTTGCAACGCGATTGATGCCGGAAGAAT\t[ 33, 34 ]\n682432\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCGATCTGGAAAACAATGTGTATGAGACGAACGA\t[ 33, 33 ]\n682498\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCTGCGTCTGCGGCCGGAGCATACCGAAGCTCACGTG\t[ 33, 36 ]\n682567\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCGCCTCCGGTGCGCTGTTGCCGTCGTACAGCGTCG\t[ 33, 35 ]\n682635\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTGCTCCGATTTAATGCACCGGACCGCATACCGG\t[ 33, 33 ]\n682701\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAGCCGCTCTATCGGGACGTTCCGCAGCCCGGACCG\t[ 33, 35 ]\n682769\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tATAAGCAGCCTGCGGGTCAGGGAACAGACCGGGCG\t[ 33, 35 ]\n682837\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTCGCTACCAGTGGCACTAGGCATGGCAGTCAGGCT\t[ 33, 35 ]\n682905\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTCTTGTGTCTCCTTCGATAACGCCCTTGCATCTGT\t[ 33, 35 ]\n682973\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCTCAGTCGTACGGCTCAGTGAGCAAGAGCTGAG\t[ 33, 33 ]\n683039\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTTTTCCAGGCCGCACTCCACCATCTTGCTGCTGGCC\t[ 33, 36 ]\n683108\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACAAATGAAACGGGTAAAAAATTAGTTGGAAATG\t[ 33, 34 ]\n683175\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAATCAGGATAACACCCTTGCAGACACGCTTATCC\t[ 33, 34 ]\n683242\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGACGAACGCAACGGCAATTTCGAGGAAGCAAAGGA\t[ 33, 35 ]\n683310\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCTCCTTGTCATGTAGCGCGTGATAAATCGCGTCC\t[ 33, 34 ]\n683377\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCCGTTCTTCCCGAAAGAAACGGCAGACCAGTTGG\t[ 33, 34 ]\n683444\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCCACCCGGCGGGCATAGGCCGCGCTGGCAAGAGC\t[ 33, 36 ]\n683513\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGTAAGAAGAAGCTCGGTGCAATCAAAATCGGAAAG\t[ 33, 35 ]\n683581\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGTCATCAGAGTAGCGGACGGCAAAGTAGCCGCCG\t[ 33, 35 ]\n683649\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCCAGCTCAGGCGCTCAGTGTTGATACCTGCAT\t[ 33, 34 ]\n683716\t\tGTCGCCCTCCTCGCAGAGGGCGTGGATAGAAAT\tACCTCATAGCCTTTGCCAAGCGTCGCGTCGCAA\t[ 33, 33 ]\n683782\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGGGTCAAACTTGAGGCCGTCCCGGTAGCCCTTC\t[ 33, 33 ]\n683848\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGCGGTGCAGGCTCAACCACGGTGGGAGTGCTGG\t[ 33, 34 ]\n683915\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGCGAGCTTTACAGCATCCCGGCATGAACGCCG\t[ 33, 33 ]\n683981\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCTGCTGGCTCGGCTCAACGCCCAGCGGCATCCCA\t[ 33, 34 ]\n684048\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTTTGCGTTGTACCGCGCCGGTCCTGCGACGGCG\t[ 33, 33 ]\n684114\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCCCCCAAGCGTGTGGTAGCTGCGATAGAGGTACT\t[ 33, 34 ]\n684181\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGTGCGCTGCGCCGCTCCGGTGCAGGCAGGTCAA\t[ 33, 34 ]\n684248\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACCCACGGACGGGAGCTCACATGGCAACCCTCAG\t[ 33, 34 ]\n684315\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGGCTGTTGTAGTCGCTGAATACGCCGCTCCGGC\t[ 33, 34 ]\n684382\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTACTCTCGTCGTGGATGCAATACCTACAATTTTT\t[ 33, 34 ]\n684449\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGCCTCGTCACGGGTCAGATGCGGATGCGGCATC\t[ 33, 34 ]\n684516\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATGTACCCGTTAGCAAAGGCCACCAGGCCGTTAGT\t[ 33, 35 ]\n684584\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATATTATCATTGATCGTATGAAGAGATCCAGAAATA\t[ 33, 36 ]\n684653\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACGACAACAGCATACTTTCGCACACCACGGCGG\t[ 33, 33 ]\n684719\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTCGTGGTGGCCTTGCCCATATCCAGTGCTGCTTG\t[ 33, 34 ]\n684786\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTCGCTGGCCAGAGGGCAGCCATGCGCACTGCATC\t[ 33, 34 ]\n684853\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCAGACCGAAAGCATCCAACCAGTCGAAGAAGTC\t[ 33, 35 ]\n684921\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGACCTCGAAGCTGCAGGCAGTTTACCGCAAGTTT\t[ 33, 35 ]\n684989\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACGCTGTATTTACGGTTGGATAGCTGGCTGATTTG\t[ 33, 35 ]\n685057\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGACGGCAAGACCCAGCGCGTGGATGACCTCTTT\t[ 33, 34 ]\n685124\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATGAGAAGCTGGGCATCAAGGATAAGTATACCGAT\t[ 33, 35 ]\n685192\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTTCGATCTTCTCGCGCAGATGGAACGGCTCATAGGC\t[ 33, 36 ]\n685261\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGGCCAGCCTGACCCTCCAGACCCGCCGCAGAAC\t[ 33, 34 ]\n685328\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGGCAAGAAGTTTGGCTTCGCACTGGATGTTCTCAC\t[ 33, 35 ]\n685396\t\tGTCGCCTTCCTCGCGGAGGGCGTGGATAGAAAT\tACCGCATAACCCCACTGGAAAACGCCATCACGCA\t[ 33, 34 ]\n685463\t\tGTCGCCCTCCTCGCGGAGGGCGCAAGTAGAAAT\t\n--------\t---------------------------------\t----------------------------------\nRepeats: 49\tAverage Length: 33\t\tAverage Length: 34\n\n\n\nCRISPR 2 Range: 1072592 - 1073272\nPOSITION\tREPEAT\t\t\t\tSPACER\n--------\t--------------------------------\t--------------------------------\n1072592\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGCGAGAGCATAGAAGATGAGCTGTCCGACCTCG\t[ 32, 33 ]\n1072657\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tTCCCCGGTGCAGCCCCGGCGGATGGTCGGCAGAAT\t[ 32, 35 ]\n1072724\t\tATTGCAGTTTATCCCCGCAAGGGGACGATAAC\tAGAACTGCAGTATGGATGTCGTTCACGCGGTAG\t[ 32, 33 ]\n1072789\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tATCAAAGATCTTTCTAAACTCCTCGATATCGTA\t[ 32, 33 ]\n1072854\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGAGCAAGTGCTTTCTGTTCGTCGTAAGTCATCAC\t[ 32, 34 ]\n1072920\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tATCAAAGATCTTTCTAAACTCCTCGATATCGTA\t[ 32, 33 ]\n1072985\t\tATTGCAGCTTATCCCTGCAAGGGGACGATAAT\tTTCACTTTCTTGATCTGCTTCTTTTCGGCACGG\t[ 32, 33 ]\n1073050\t\tATTGCAGCTTATCCCCGCAAAGGGACGATAAC\tGTCACAGCTTGCTGCTTGTCCAGTTTGGCGTCC\t[ 32, 33 ]\n1073115\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGACCTTTGCCATTGCCTTCTTGTCCTTGCGGCTC\t[ 32, 34 ]\n1073181\t\tATTGCAGCTTATCCCCGCAAGGGGGCGAAAAC\tAAACTCCATTCTTATTTCAGATTAAATA\t[ 32, 28 ]\n1073241\t\tAATAAGTACCATCCCCGTAAGGGGGCAAAAAC\t\n--------\t--------------------------------\t--------------------------------\nRepeats: 11\tAverage Length: 32\t\tAverage Length: 32\n\n\n\nTime to find repeats: 453 ms\n\n\n\n'

...but if I try to write out to file, just before if out.find('\nCRISPR') == -1, I get:

TypeError: write() argument must be str, not bytes

Any help would be much appreciated! Thank you :)

Expected output for the example file (SRS1735492) in viral prediciton pipeline?

Thanks for creating this useful tool. Is it possible to provide the expected output for the SRS1735492 file you used as an example for your viral prediction pipeline?

master_table.py error

Hi,

Thanks for developing this pipeline.

Now, I was able to generate all files, up to the generation of the mastertable. It seems here that the virfinder results is parsed incorrectly, returning the following error:

Traceback (most recent call last):
File "/Volumes/mikami/bio/apps/12.04/sw/mgv/1.0/bin/master_table.py", line 21, in
info[id]['vfr_pvalue'] = r[-1]

Would it be possible to provide any insights as to what may be going on? My python version is 3.9.7.

Thanks in advance.

Dieter

Step 4 of Cluster viral genomes based on ANI is not understood

Hello, I got 2465 clusters in the clusters.tsv file through 'Cluster viral genomes based on ANI'. Do I still need to extract 2465 virus sequences and execute blastani.py and cluster.py again? If the operation needs to continue, what conditions can be met before it is terminated?

Thank you!

Identify_crispr.py error bin/pilecr

Hi,
Thanks for the great database and github repository. Really useful.

I am not a regular python user so struggling a bit with the identify_crispr.py script. I am running the script with the example data (within a conda environment with installed crt and pilecr). This is the error I am receiving:

Warning: skipping sequences due to error: /bin/sh: bin/pilecr: No such file or directory

I have installed the programs via conda and also downloaded the programs specified in your bin folder.

Any help would be greatly appreciated.
Thanks

Could you please provide a pipeline to help taxonomic annotation of viruses?

Hi,
I read your article on viruses from the human gut microbiome, which is very meaningful. And these pipeline are helpful for the study of viral metagenome. Here, I would like to ask if you can provide a pipeline to help taxonomic annotation of viruses, as you did in your article.

How can i updata exclude_hmms files

Hi Stephen，
Thank you so mush for your useful code，it helps a lot！
It can be seen in your article that 1,440 commonly found in microbial genomes or plasmids and 452 commonly found in viruses protein families has been removed from IMG/VR and Pfam-A database respectively during your viral signatures identify strategy.
But i have a problem now.
We are concerned about the update of those two databases, which include ‘https://academic.oup.com/nar/article/49/D1/D764/5952208?login=true’ and ‘http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/’. Do we need to exclude more protein families if we want to use the new database. If so, may I know how you got the excluded files ’viral_detection_pipeline/viral_detection_pipeline/input/exclude_hmms‘ so i can update them by myself.
Looking forward to your reply！
Best wishes！

question about virus taxonomic annotation

Hi, Stephen.

I have a question about the virus taxonomic annotation. Could you please help to explain?

"For each viral genome, we aggregated annotations across proteins after weighting by bit-scores. Each viral genome was then annotated at the lowest taxonomic rank having >70% agreement across annotated proteins." I am not sure about the "70% agreement". Does it mean we only retail the mappings with over 70% coverage? Or we need 70% proteins in the viral genome that can be mapped to the reference?

Thanks.

So fast for clustering!

Will the genome IDs be stable/persistent in the new version?

Hi Stephen, I plan to create taxdump files for MGV so it could be used in taxonomic profiling tools including KMCP.

If the genome IDs are stable like these in GTDB (actually the ID are from NCBI), I could create trackable TaxIds for MGV genomes.

I check them and find they are not consecutive. Any information is appreciated.

# first five IDs after sorting
MGV-GENOME-0000594
MGV-GENOME-0000601
MGV-GENOME-0000693
MGV-GENOME-0000845
MGV-GENOME-0000895

# last 5 IDs
MGV-GENOME-4435975
MGV-GENOME-4435977
MGV-GENOME-4435989
MGV-GENOME-4435994
MGV-GENOME-4436002

Can Viral detection pipeline be applied to metagenomic bins？

Hello, I have some metagenomic data and got some bin after metabat2 processing. Can I apply Viral detection pipeline on these bin to get the virus in these bin? Thanks!

perform all-vs-all blastn of sequences

Perform all-vs-all BLAST using megablast utility:
blastn -query SRS1735492.fna -db blastdb -out blast.tsv -outfmt '6 std qlen slen' -max_target_seqs 25000

’-max_target_seqs 25000‘ ， If I make this parameter smaller（eg: -max_target_seqs 100）, does it make any difference to the later analysis? Whether it will affect the analysis of clustering?

Looking forward to your reply

delete

I can not use "count_hmm_hits_py"

said: File"/xx/xx/count_hmm_hits_py". line 7

  <!DOCTYPE html>

SyntaxError: invalid syntax