Giter Site home page Giter Site logo

mgv's People

Contributors

rocknhu avatar snayfach avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mgv's Issues

Genus- and family-level clustering

Hi Stephen,

Thank you so much! I loved the article, and your code is so useful!

Two questions:

  1. There are thresholds recommended by MIUViG to use to define species level vOTUs. Are there for family- and genus-level cluster? I would be very grateful if you could share anything on this.
  2. I could be wrong, but I assume it may be useful in lines 34 and 36 save only second element of the list, and then change _[2] to _[o] in line 48. Otherwise, the dictionary created during iterations over an all-vs-all blastp table could be quiet large, preventing people from using your code.

Virus classification annotation

Hello, I am very interested in your work, and I would like to express my thanks to him for his great help. May I ask how to classify viruses into DNA, dsDNA, ssDNA, SSDNA-RT, dsRNA, RNA, ssRNA, SSNA-RT, and what tools are available? Thank you again

Could you please offer codes for taxonomic annotation in article?

Hi, Stephen! I recently read your article 'Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome'. It is a great work to unravel the virosphere in human gut. I want to realize the method of taxonomic annotation in your article. Could you please offer codes for taxonomic annotation in article?Thanks a lot! Wish your greater academic achievements.

Virome profiling database recommendation

@snayfach Hi, Stephen, thank you for exploring the viral genome.

I have some questions, maybe you can help me.

awk '{print $1}' mgv_contig_info.tsv | sort | uniq | wc -l
189681

awk '{print $2}' mgv_contig_info.tsv | sort | uniq | wc -l
54119

You provided total of 189680 virus genomes, which can be clusted into 54118 vOTU, right?
If I want to do virome profiling based on the MGV database,
should I select all viral genomes or the representative genomes in vOTU?

Thanks for your work!

I cannot use Marker gene phylogenetic tree.py

select protein clusters (PCs)
write seqs for each PC
build multiple seq alignment (MSA) for each PC, round 1
build hmm from each MSA
search hmms versus original proteins

Error: File format problem in trying to open HMM file ou/all.hmms.
File exists, but appears to be empty?

extract proteins for top hmm hits
Traceback (most recent call last):
File "/mnt/24t/MGV/marker_gene_tree/marker_tree1.py", line 113, in
for line in open('%s/hmmsearch.txt' % args['out_dir']):
FileNotFoundError: [Errno 2] No such file or directory: 'ou/hmmsearch.txt'

the protein cluster issues

Hi !
Snayfach
I am studying your excellent MGV work. i have a little bit confusion about the Protein cluster.
In your mgv_pc_function.tsv file, the each protein in PC were annotated differential function, how to define the function of the whole PC.
Looking forward to your reply.
Sincerely
L

cluster result question

Hi, I have run the ani_cluster process in its entirety, how should the generated clusters.tsv and ani.tsv be interpreted, and how can I pick out the VOTUs that meet the following criteria: ≥95% ANI across ≥85% of the shorter sequence to form vOTUs using
the longest representative sequence.
Looking forward to your answer!
Best Wishs!

identify_crisper.py error: TypeError: str() takes at most 1 argument (2 given)

Hello, thanks for this fantastic virome analysis tutorial !
When I run the crispr_spacers, and the identify_crisper.py file went wrong as follows:

Traceback (most recent call last):
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 485, in <module>
    arrays = run_pipeline(args, offsets, append_ns=50, program=program)
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 394, in run_pipeline
    out, err, code = run_crt(tmp_path, args['tempdir'], xmx=args['xmx'])
  File "/public/home/zycheng/Applications/MGV/crispr_spacers/identify_crispr.py", line 99, in run_crt
    return str(o,encoding='utf-8'), e, c
TypeError: str() takes at most 1 argument (2 given)

my python version is 2.7, and biopython version is 1.76. Great thanks for your assistance.

identify_crispr.py TypeError: argument should be integer or bytes-like object, not 'str'

Hello, and thanks for this great viral detection pipeline! Earlier steps worked seamlessly for me.

Running the example code python identify_crispr.py -i example/GUT_GENOME147678.fna -o out in a cloned repository gives me the following error:

Traceback (most recent call last):
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 485, in
arrays = run_pipeline(args, offsets, append_ns=50, program=program)
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 405, in run_pipeline
my_arrays = parse_crt(out, contigs)
File "/home/ebel/user_data/FeFiFo/MGV/crispr_spacers/identify_crispr.py", line 108, in parse_crt
if out.find('\nCRISPR') == -1:
TypeError: argument should be integer or bytes-like object, not 'str'

The out directory contains one file, out/temp/seqs/0d239010-1666-403a-9b21-f06c24f1d762, which is a fasta with the 50 Ns at either end.

I'm not overly skilled at Python, but adding some print commands let me see the output of

out, err, code = run_crt(tmp_path, args['tempdir'], xmx=args['xmx']).

out, which seems to be causing the problem, prints like it's a big string...

b'\n\nReading file out/temp/seqs/17b75e28-a677-44b1-bc28-e9d7592c72e4\nReading file complete\ntmpfile\n2833987 bases\nSearching for repeats...\nTime to search for repeats: 453 ms\n2 possible CRISPR(s) found\n\nORGANISM: tmpfile\nBases: 2833987\n\n\nCRISPR 1 Range: 682229 - 685495\nPOSITION\tREPEAT\t\t\t\tSPACER\n--------\t----
-----------------------------\t----------------------------------\n682229\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAATAATCCAGTGCTCATTTTTTGATCTCCTTCGGT\t[ 33, 35 ]
\n682297\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tATTCCATTTTACATTCCTATGCACTTAGGAGACCC\t[ 33, 35 ]\n682365\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAAGACAGTTGCAACGCGATTGATGCCGGAAGAAT\t[ 33, 34 ]\n682432\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCGATCTGGAAAACAATGTGTATGAGACGAACGA\t[ 33, 33 ]\n682498\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCTGCGTCTGCGGCCGGAGCATACCGAAGCTCACGTG\t[ 33, 36 ]\n682567\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCGCCTCCGGTGCGCTGTTGCCGTCGTACAGCGTCG\t[ 33, 35 ]\n682635\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTGCTCCGATTTAATGCACCGGACCGCATACCGG\t[ 33, 33 ]\n682701\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tAGCCGCTCTATCGGGACGTTCCGCAGCCCGGACCG\t[ 33, 35 ]\n682769\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tATAAGCAGCCTGCGGGTCAGGGAACAGACCGGGCG\t[ 33, 35 ]\n682837\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTCGCTACCAGTGGCACTAGGCATGGCAGTCAGGCT\t[ 33, 35 ]\n682905\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTCTTGTGTCTCCTTCGATAACGCCCTTGCATCTGT\t[ 33, 35 ]\n682973\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tCTCAGTCGTACGGCTCAGTGAGCAAGAGCTGAG\t[ 33, 33 ]\n683039\t\tGTCGCCCTCCTTGCGGAGGGCGTGGATAGAAAT\tTTTTCCAGGCCGCACTCCACCATCTTGCTGCTGGCC\t[ 33, 36 ]\n683108\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACAAATGAAACGGGTAAAAAATTAGTTGGAAATG\t[ 33, 34 ]\n683175\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAATCAGGATAACACCCTTGCAGACACGCTTATCC\t[ 33, 34 ]\n683242\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGACGAACGCAACGGCAATTTCGAGGAAGCAAAGGA\t[ 33, 35 ]\n683310\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCTCCTTGTCATGTAGCGCGTGATAAATCGCGTCC\t[ 33, 34 ]\n683377\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCCGTTCTTCCCGAAAGAAACGGCAGACCAGTTGG\t[ 33, 34 ]\n683444\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCCACCCGGCGGGCATAGGCCGCGCTGGCAAGAGC\t[ 33, 36 ]\n683513\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGTAAGAAGAAGCTCGGTGCAATCAAAATCGGAAAG\t[ 33, 35 ]\n683581\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGTCATCAGAGTAGCGGACGGCAAAGTAGCCGCCG\t[ 33, 35 ]\n683649\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCCAGCTCAGGCGCTCAGTGTTGATACCTGCAT\t[ 33, 34 ]\n683716\t\tGTCGCCCTCCTCGCAGAGGGCGTGGATAGAAAT\tACCTCATAGCCTTTGCCAAGCGTCGCGTCGCAA\t[ 33, 33 ]\n683782\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGGGTCAAACTTGAGGCCGTCCCGGTAGCCCTTC\t[ 33, 33 ]\n683848\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGCGGTGCAGGCTCAACCACGGTGGGAGTGCTGG\t[ 33, 34 ]\n683915\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGCGAGCTTTACAGCATCCCGGCATGAACGCCG\t[ 33, 33 ]\n683981\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCTGCTGGCTCGGCTCAACGCCCAGCGGCATCCCA\t[ 33, 34 ]\n684048\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTTTGCGTTGTACCGCGCCGGTCCTGCGACGGCG\t[ 33, 33 ]\n684114\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCCCCCAAGCGTGTGGTAGCTGCGATAGAGGTACT\t[ 33, 34 ]\n684181\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGTGCGCTGCGCCGCTCCGGTGCAGGCAGGTCAA\t[ 33, 34 ]\n684248\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACCCACGGACGGGAGCTCACATGGCAACCCTCAG\t[ 33, 34 ]\n684315\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tAGGCTGTTGTAGTCGCTGAATACGCCGCTCCGGC\t[ 33, 34 ]\n684382\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTACTCTCGTCGTGGATGCAATACCTACAATTTTT\t[ 33, 34 ]\n684449\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGCCTCGTCACGGGTCAGATGCGGATGCGGCATC\t[ 33, 34 ]\n684516\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATGTACCCGTTAGCAAAGGCCACCAGGCCGTTAGT\t[ 33, 35 ]\n684584\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATATTATCATTGATCGTATGAAGAGATCCAGAAATA\t[ 33, 36 ]\n684653\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACGACAACAGCATACTTTCGCACACCACGGCGG\t[ 33, 33 ]\n684719\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTCGTGGTGGCCTTGCCCATATCCAGTGCTGCTTG\t[ 33, 34 ]\n684786\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTCGCTGGCCAGAGGGCAGCCATGCGCACTGCATC\t[ 33, 34 ]\n684853\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATCAGACCGAAAGCATCCAACCAGTCGAAGAAGTC\t[ 33, 35 ]\n684921\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGACCTCGAAGCTGCAGGCAGTTTACCGCAAGTTT\t[ 33, 35 ]\n684989\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tACGCTGTATTTACGGTTGGATAGCTGGCTGATTTG\t[ 33, 35 ]\n685057\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTGACGGCAAGACCCAGCGCGTGGATGACCTCTTT\t[ 33, 34 ]\n685124\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tATGAGAAGCTGGGCATCAAGGATAAGTATACCGAT\t[ 33, 35 ]\n685192\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tTTCGATCTTCTCGCGCAGATGGAACGGCTCATAGGC\t[ 33, 36 ]\n685261\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tCGGCCAGCCTGACCCTCCAGACCCGCCGCAGAAC\t[ 33, 34 ]\n685328\t\tGTCGCCCTCCTCGCGGAGGGCGTGGATAGAAAT\tGGCAAGAAGTTTGGCTTCGCACTGGATGTTCTCAC\t[ 33, 35 ]\n685396\t\tGTCGCCTTCCTCGCGGAGGGCGTGGATAGAAAT\tACCGCATAACCCCACTGGAAAACGCCATCACGCA\t[ 33, 34 ]\n685463\t\tGTCGCCCTCCTCGCGGAGGGCGCAAGTAGAAAT\t\n--------\t---------------------------------\t----------------------------------\nRepeats: 49\tAverage Length: 33\t\tAverage Length: 34\n\n\n\nCRISPR 2 Range: 1072592 - 1073272\nPOSITION\tREPEAT\t\t\t\tSPACER\n--------\t--------------------------------\t--------------------------------\n1072592\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGCGAGAGCATAGAAGATGAGCTGTCCGACCTCG\t[ 32, 33 ]\n1072657\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tTCCCCGGTGCAGCCCCGGCGGATGGTCGGCAGAAT\t[ 32, 35 ]\n1072724\t\tATTGCAGTTTATCCCCGCAAGGGGACGATAAC\tAGAACTGCAGTATGGATGTCGTTCACGCGGTAG\t[ 32, 33 ]\n1072789\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tATCAAAGATCTTTCTAAACTCCTCGATATCGTA\t[ 32, 33 ]\n1072854\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGAGCAAGTGCTTTCTGTTCGTCGTAAGTCATCAC\t[ 32, 34 ]\n1072920\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tATCAAAGATCTTTCTAAACTCCTCGATATCGTA\t[ 32, 33 ]\n1072985\t\tATTGCAGCTTATCCCTGCAAGGGGACGATAAT\tTTCACTTTCTTGATCTGCTTCTTTTCGGCACGG\t[ 32, 33 ]\n1073050\t\tATTGCAGCTTATCCCCGCAAAGGGACGATAAC\tGTCACAGCTTGCTGCTTGTCCAGTTTGGCGTCC\t[ 32, 33 ]\n1073115\t\tATTGCAGCTTATCCCCGCAAGGGGACGATAAC\tGACCTTTGCCATTGCCTTCTTGTCCTTGCGGCTC\t[ 32, 34 ]\n1073181\t\tATTGCAGCTTATCCCCGCAAGGGGGCGAAAAC\tAAACTCCATTCTTATTTCAGATTAAATA\t[ 32, 28 ]\n1073241\t\tAATAAGTACCATCCCCGTAAGGGGGCAAAAAC\t\n--------\t--------------------------------\t--------------------------------\nRepeats: 11\tAverage Length: 32\t\tAverage Length: 32\n\n\n\nTime to find repeats: 453 ms\n\n\n\n'

...but if I try to write out to file, just before if out.find('\nCRISPR') == -1, I get:

TypeError: write() argument must be str, not bytes

Any help would be much appreciated! Thank you :)

master_table.py error

Hi,

Thanks for developing this pipeline.

Now, I was able to generate all files, up to the generation of the mastertable. It seems here that the virfinder results is parsed incorrectly, returning the following error:

Traceback (most recent call last):
File "/Volumes/mikami/bio/apps/12.04/sw/mgv/1.0/bin/master_table.py", line 21, in
info[id]['vfr_pvalue'] = r[-1]

Would it be possible to provide any insights as to what may be going on? My python version is 3.9.7.

Thanks in advance.

Dieter

Step 4 of Cluster viral genomes based on ANI is not understood

Hello, I got 2465 clusters in the clusters.tsv file through 'Cluster viral genomes based on ANI'. Do I still need to extract 2465 virus sequences and execute blastani.py and cluster.py again? If the operation needs to continue, what conditions can be met before it is terminated?

Thank you!

Identify_crispr.py error bin/pilecr

Hi,
Thanks for the great database and github repository. Really useful.

I am not a regular python user so struggling a bit with the identify_crispr.py script. I am running the script with the example data (within a conda environment with installed crt and pilecr). This is the error I am receiving:

Warning: skipping sequences due to error: /bin/sh: bin/pilecr: No such file or directory

I have installed the programs via conda and also downloaded the programs specified in your bin folder.

Any help would be greatly appreciated.
Thanks

How can i updata exclude_hmms files

Hi Stephen,
Thank you so mush for your useful code,it helps a lot!
It can be seen in your article that 1,440 commonly found in microbial genomes or plasmids and 452 commonly found in viruses protein families has been removed from IMG/VR and Pfam-A database respectively during your viral signatures identify strategy.
But i have a problem now.
We are concerned about the update of those two databases, which include ‘https://academic.oup.com/nar/article/49/D1/D764/5952208?login=true’ and ‘http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/’. Do we need to exclude more protein families if we want to use the new database. If so, may I know how you got the excluded files ’viral_detection_pipeline/viral_detection_pipeline/input/exclude_hmms‘ so i can update them by myself.
Looking forward to your reply!
Best wishes!

question about virus taxonomic annotation

Hi, Stephen.

I have a question about the virus taxonomic annotation. Could you please help to explain?

"For each viral genome, we aggregated annotations across proteins after weighting by bit-scores. Each viral genome was then annotated at the lowest taxonomic rank having >70% agreement across annotated proteins." I am not sure about the "70% agreement". Does it mean we only retail the mappings with over 70% coverage? Or we need 70% proteins in the viral genome that can be mapped to the reference?

Thanks.

Will the genome IDs be stable/persistent in the new version?

Hi Stephen, I plan to create taxdump files for MGV so it could be used in taxonomic profiling tools including KMCP.

If the genome IDs are stable like these in GTDB (actually the ID are from NCBI), I could create trackable TaxIds for MGV genomes.

I check them and find they are not consecutive. Any information is appreciated.

# first five IDs after sorting
MGV-GENOME-0000594
MGV-GENOME-0000601
MGV-GENOME-0000693
MGV-GENOME-0000845
MGV-GENOME-0000895

# last 5 IDs
MGV-GENOME-4435975
MGV-GENOME-4435977
MGV-GENOME-4435989
MGV-GENOME-4435994
MGV-GENOME-4436002

perform all-vs-all blastn of sequences

Perform all-vs-all BLAST using megablast utility:
blastn -query SRS1735492.fna -db blastdb -out blast.tsv -outfmt '6 std qlen slen' -max_target_seqs 25000

’-max_target_seqs 25000‘ , If I make this parameter smaller(eg: -max_target_seqs 100), does it make any difference to the later analysis? Whether it will affect the analysis of clustering?

Looking forward to your reply

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.