Giter Site home page Giter Site logo

jessieren / virhostmatcher Goto Github PK

View Code? Open in Web Editor NEW
27.0 4.0 5.0 217.02 MB

VirHostMatcher: matching hosts of viruses based on oligonucleotide frequency (ONF) comparison

License: Other

C++ 3.33% CSS 42.02% JavaScript 4.71% Python 0.56% HTML 49.37% Makefile 0.01%

virhostmatcher's Introduction

VirHostMatcher: matching hosts of viruses based on oligonucleotide frequency (ONF) comparison

Basic tools for computing various oligonucleotide frequency (ONF) based distance/dissimialrity measures

Requirements

The source code is written by C++ and wrapped by a python script. Thus it requires python and C++ compiler. It works under Windows, Linux or Mac environment. Precompiled executables are provided for theese three platforms, and users have the optioin of compiling the source code for their specific platform if desired (see the Compilation section below).

Usage

This program is used to compute various oligonucleotide frequency (ONF) based distance/dissimialrity measures between a pair of DNA sequences. Computing these measures with VirHostMatcher is specifically used to predict the potential host of a query virus by identifying the host to which it has the strongest similarity. Predictions are based on the observation that viruses and hosts often share similar ONF patterns. The measures computed by VirHostMatcher include Euclidian distance (Eu), Manhattan distance (Ma), Chebyshev distance (Ch), Jensen-Shannon divergence (JS), d2 dissimilarity, d2* dissimilarity, d2S dissimilarity, Hao dissimilarity, Teeling dissimilarity, EuF distance and Willner distance. There is also the option to only compute d2* dissimilarity. See paper "Alignment-free d2* oligonucleotide frequency dissimilarity measure improves accuracy of predicting virus-host interactions" (Ahlgren, Ren et al. submitted) for the definitions. The tool also provides user-friendly visualization of virus-host interactions based on the pairwise distance/dissimilarity between viruses and hosts.

To use the tool, please simply follow the steps and copy and paste the following commands to the terminal command line. Please do not forget to adjust the path variables to your own (i.e. replace <Path_to_XXX> with your own path).

  • Step 1: Download the whole package from https://github.com/jessieren/VirHostMatcher

  • Step 2: Prepare a folder containing virus fasta files and a folder containing host fasta files. No subfolders are allowed. All fasta files should be put in the same directory.

  • Step 3: Prepare a text file for taxonomy of the hosts. Please follow the format in /test/hostTaxa.txt. (One line for one host sequence and taxon names are tab delimited. There should be no missing taxon name entries, fill these with text such as 'NA' or 'unkown') The host names in the hostTaxa.txt file neet to exactly be the same as the host fasta file names, including the file name extensions otherwise the program cannot correspond the host fasta files and the host taxanomy information. If there is no taxonomy information, a hostTaxa.txt file will be generated in the output directory with all "unknown"s.

  • Step 4: Run the program use the following command.

      python /Path_to_VirHostMatcher/vhm.py -v <Path_to_virus_folder(required)> -b <Path_to_host_folder(required)> -o <Path_to_output(required)> -t <Path_to_hostTaxaFile> -d <1_if_only_compute_d2star>
    

    For detailed description of the paramter settings, python /Path_to_VirHostMatcher/vhm.py --help

  • Congratulations! The results can be find in the output folder. The output folder contains,

      [measure Name]_k[k-tuple length].csv	The dissimilarity/distance matrix for paris of virus and hosts;
    
      [measure Name]_k[k-tuple length].main.html	The html file for visulization of the virus-host interactions;
    

A test example

You can find a directory named "test" in the VirHostMatcher package. Two test examples, one small and one large, have been prepared for users. The small test set named "toyexample" contains the 12 viruses and 23 hosts. The larger test set named "352virus" contains 352 viruses and 71 hosts, which is the dataset used in the Figure 1 of the paper. For each of the two test data, there are three folders, "virus", "host" and "output" and one file "hostTaxa.txt". The folder "virus" contains the virus fasta files, "host" contains the host fasta files, and "output" contains all the output results where the distance matrix files (".csv") and visualization files (".html") can be found. The file hostTaxa.txt lists the taxonomy information of the host species. To run the program with the test data, use the following command after adjusting the path variables to your own (i.e. replace <Path_to_XXX> with your own path).

python /Path_to_VirHostMatcher/vhm.py -v /Path_to_VirHostMatcher/test/toyexample/virus -b /Path_to_VirHostMatcher/test/toyexample/host -o /Path_to_VirHostMatcher/test/toyexample/output -t /Path_to_VirHostMatcher/test/toyexample/hostTaxa.txt

python /Path_to_VirHostMatcher/vhm.py -v /Path_to_VirHostMatcher/test/352virus/virus -b /Path_to_VirHostMatcher/test/352virus/host -o /Path_to_VirHostMatcher/test/352virus/output -t /Path_to_VirHostMatcher/test/352virus/hostTaxa.txt

Visualization

VirHostMatcher provides a convenient way to visualize and analyze output result through browser. In particular, for each distance/dissimialrity measure, a corresponding webpage named '[measure Name]_k[k-tuple length].main.html' will be generated under the output folder. For example, 'Hao_k6_main.html' for the case of d2star dissimilarity when k=6. The visualization mainly contains three parts: (1) select the interested virus to manipulate from the left panel; (2) plot the distance heatmap between interested virus and top-ranked host in the middle. Users can further look at detailed information by moving the mouse over the corresponding grid; (3) The taxonomic consensus information is summarized in the right.

Compilation

VirHostMatcher has prepared executable (compiled) programs under <Path_to_VirHostMatcher/bin> for three main operating systems, Linux, Window and Mac. VirHostMatcher automatically detects the operating system and uses the corresponding compiled programs. For running on other operating systems, VirHostMatcher will automatically attempt compile the source code for your platform.

If users desire, the source code and Makefile are provided for compilation on their particular machine. VirHostMatcher can be compiled by running the following commands under the main directory,

cd <Path_to_VirHostMatcher>
make

When using this option the Makefile will compile the source code into executable files into the main folder <Path_to_VirHostMatcher> which then will be detected and used by the main python script.

Reference and Citation

If you use VirHostMatcher, please cite the following paper:

Ahlgren, Nathan A., Jie Ren, Yang Young Lu, Jed A. Fuhrman, and Fengzhu Sun. "Alignment-free d2* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences." Nucleic Acids Research (2016): gkw1002.

Note that the supplemental table referenced in the above paper published in Nucleic Acids Research that lists the accession numbers of viral and host genomes used in this paper is also provided here as the file "Supplemental_table_virus_and_host_genomes.xlsx".

Contacts and bug reports

Jie Ren [email protected]

Yang Lu [email protected]

Nathan Ahlgren [email protected]

Fengzhu Sun [email protected]

If you found a bug or mistake in this project, we would like to know about it. Before you send us the bug report though, please check the following:

  1. Are you using the latest version? The bug you found may already have been fixed.
  2. Check that your input is in the correct format and you have selected the correct options.
  3. Please reduce your input to the smallest possible size that still produces the bug; we will need your input data to reproduce the problem, and the smaller you can make it, the easier it will be.

Copyright and License Information

Copyright (C) 2017 University of Southern California, Jie Ren, Nathan Ahlgren, Yang Lu, Fengzhu Sun

This program is free software: you can redistribute it and/or modify it under the terms of USC-RL v1.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

virhostmatcher's People

Contributors

jessieren avatar younglululu avatar nahlgren avatar

Stargazers

Grigory Gladkov avatar  avatar  avatar Bipin Singh avatar Laboratorio de Bioinformática avatar  avatar  avatar Roland avatar  avatar Wu avatar  avatar Zhu Jiahui avatar  avatar git@wh avatar  avatar Mensur Dlakic avatar Biopig avatar  avatar Jie Zhu avatar Stefan avatar Cynthia Chibani avatar Tianqi Tang avatar Ryan Moore avatar  avatar Ben Decato avatar Jake avatar  avatar

Watchers

Cody Glickman avatar  avatar  avatar  avatar

virhostmatcher's Issues

ERROR: zero bytes of file annotation

Hi!
I am getting the following error when using virhostmatcher and I am not sure why:
head:

ERROR: zero bytes of file annotation
Step 1: counting kmers for virus HP-T1-3_contig.fa_sel1_provirus.fa
 (Average time for counting kmers for one virus: 0.1439s) 

tail:

Step 1: counting kmers for host HP-T1-3_bin3_.fa
 (Average time for counting kmers for one host: 1.033s) 
 (ETR for counting kmers for hosts: 61.9818s) 
Step 1: counting kmers for host HP-T1-6_bin5_.fa
 (Average time for counting kmers for one host: 1.0353s) 
 (ETR for counting kmers for hosts: 61.0812s) 

Current code throws IndexError

Trying to run the toy test example results in the following error:

` python vhm.py -v test/toyexample/virus -b test/toyexample/host -t test/toyexample/hostTaxa.txt -o ~/output

Traceback (most recent call last):
File "/users/PAS1117/osu8359/tools/VirHostMatcher/vhm.py", line 236, in
hostTaxaTable[hostTaxaTable=='']='unknown'
IndexError: in the future, 0-d boolean arrays will be interpreted as a valid boolean index`

Run using python 3.5.2, following a successful make.

Threshold of positive result

Hi,

I am using VirHostMatcher to predict the interaction between host and virus. But I don’t know the threshold to pick the positive interaction result from d2star_k6.csv. Could you give me some suggestions?

Thanks in advance

ERROR in counting kmers

Hi Ren:
I am a phd student from SYSU, China. I have asked you some questions before and you are so patient to answer my questions. Now I have encountered one problem when using the tool "VirHostMatcher". When I use the following the command "python /Path_to_VirHostMatcher/vhm.py -v <Path_to_virus_folder(required)> -b <Path_to_host_folder(required)> -o <Path_to_output(required)> -t <Path_to_hostTaxaFile>"
, the screen showed like "Step 1: counting kmers for virus (Average time for counting kmers for one virus: 0.4454s) (ETR for counting kmers for viruses: 4.8996s) Step 1: counting kmers for virus ERROR in counting kmers for" and the program aborted abnormally. I don't know why this has happened and how to solve this problem. Would it be possible for you to help me? I have sent the input files and output files to your e-mail"[email protected]".
Thanks very much.

Sincerely,
Yuan

Compiling on Mac with clang

Hey, I'm having some trouble when using the default clang compiler on the Mac. I get errors like this

$ ./countKmer.out -l -k 1 -i test/toyexample/virus/AJ609634.fasta -o OUTPUT/tmp/KC/AJ609634.fasta -s AJ609634.fasta

option -l: the input file is longseq (need to concatenate lines).
option -k, the value of k, with value `1'
option -i, the input filename, with value `test/toyexample/virus/AJ609634.fasta'
option -o, the output directory, with value `OUTPUT/tmp/KC/AJ609634.fasta'
option -s, the short name, with value `AJ609634.fasta'
outputFileDir: OUTPUT/tmp/KC/AJ609634.fasta/
combinefileName: OUTPUT/tmp/KC/AJ609634.fasta/AJ609634.fasta_combine
dyld: lazy symbol binding failed: Symbol not found: __ZNKSt9basic_iosIcSt11char_traitsIcEEcvbEv
  Referenced from: VirHostMatcher/./countKmer.out
  Expected in: /usr/lib/libstdc++.6.dylib

dyld: Symbol not found: __ZNKSt9basic_iosIcSt11char_traitsIcEEcvbEv
  Referenced from: VirHostMatcher/./countKmer.out
  Expected in: /usr/lib/libstdc++.6.dylib

Abort trap: 6

Default version of g++ is

$ /usr/bin/g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

When I compile the c++ code using the gnu gcc compiler from Homebrew (in this case g++-7 (Homebrew GCC 7.2.0) 7.2.0) like this

$ CC="$(brew --prefix)/bin/g++-7" make

and then re-run everything, it works fine.

So (at least on my Mac -- OSX 10.12.6), it looks like the default clang gcc doesn't work but the gnu gcc does work.

VirHostMatcher not producing an outfile

we have compiled the VirHostMatcher and ran the test. We get an out file running the test, but when we use our own data it runs and fails to produce and out file. We don't have an annotation file for our host data because they were also metagenomic assemblies. Is there a way to run VirHostMatcher without the host annotation file?

Error while running test example

Hi

Thanks for the tool!

I ran the test command given in your readme file:

But I encounter the following error:

Step 1: counting kmers for virus ABC.fasta
ERROR in counting kmers for ABC.fasta

It would be great if you could help!

ERROR: the format of taxaFile is not correct!

This is a very basic question, but it seems that VirHostMatcher won't recognize my HostTaxa file...

I keep getting the "ERROR: the format of taxaFile is not correct!" message and "ERROR: number of hosts in taxa file is not equal to number of host fasta files"
I've checked several times and the number of lines match the number of files. I made sure to get the column name with the right capitals (see attached) and I tried changing between tab and space as separator and nothing works...

What is exactly the format needed?

HostTaxa4.txt

hello I wan to know how to split my fasta to per taxon a fasta file.

my total virus fasta file like the following:

k141_110471||full
ATCTTTCTACTAAATATACATTTGATACCTTCTTTCATTTTAGGCAAATATAATTTGCTTTTATCATAAGAGAAATATACACCCTGATAAGTTTGATAAGATTACTAACATAGTATTCACCAGCGGGATTACGACTAATGGTACAAGTCTTGATTTTTCCCTCA
k141_50420||full
ATGCTAAAGTTAGATGTTAGTAATATAAGTCCCCGTAGTGACCGTGAGGCTACTGCGGGGACGCTTTTTCAATTTATAGTATATAGAAACAAAAAGGGCCGGTCTGCGAGGACG
k141_121756||full
GCCTTTGTGCTTCTTTGTTTGCCATTTGCTATCTTGCCTTTCTACTGCCTTGTCATCATATTTCACCATATTGATTTCTTGCATAGCCTCATTCGCAAAATCACGGAGCACAAAAATGGACGGTTGTCGTACACCCTGAATATGTGAGCACTACCAATGTAAACATTGATGTGAACAATATAAAGGTATGGGGTGCATTGGAGTTGCTCAATTCAAAATTTGGTGCGAACTTTGTTATTCGTGGCCGAACAATAACAATCGGTACTGCCGGTATTGCTGTGGGCAATATTTTCAAGTATGGACGTGGAAACGGTTTGTACGAAATTCAACGGCAGGCAGCCGCTGACTATTCACCTCAAGGAAAACTCACAGGCTCTTGGCGACGTGGTAGTAACAGC
k141_75580||full
AGACAGATGCAGAGGTGTTGACCGTTCGTTGCTGGAGTTTCTGAAGCCGAAAGCAGAGGAACGGTTAGCCTTATCAAAGGAAGCCTTAAATACATAGGGAATGCCCAGCTTGTCGCAGATTGCTTTTGTACGCTGACCAATCATCAGGCTGCGCTCATAGCCTTCGAGCACGCAGGGACCGGCCATCAGCATCAGGGGATGGCCCTGACCTACTTCATAATTACCAACTTTAACGATTTGCATAAGTATCCTCCAGCTTATTTATTCATTTTAGCAAACAGCTCGTTGACTTTG

so ,I dont know how to split them to the format needed.
thank you

Questions about threshold selection

Hi, jessieren!
We are using VirHostMatcher to analyze viruses and hosts in metagenomic data, but we do not know how to select the optimal threshold to analyze d2star_k6.csv, could you please give us some suggestions?

about result

Dear author!

After running VirHostMatcher, generated d2star_k6.csv(802M),d2star_k6_main.html(size:806M),when is opened through a browser without any useful display——just show consensus information. Relevant documents are as follows
vhm.log

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.