namphuon / vifi Goto Github PK

Pipeline for identifying viral integration and fusion mRNA reads from NGS data. Manuscript is currently in preparation.

License: GNU General Public License v3.0

Python 65.89% Perl 29.36% Shell 4.75%

vifi's Introduction

UPDATE

Please switch over to FastViFi (https://github.com/sara-javadzadeh/FastViFi). ViFi is no longer under active development and is discontinued

ViFi

ViFi is a tool for detecting viral integration and fusion mRNA sequences from Next Generation Sequencing data. Unlike standard approaches that use reference-based read mapping for identification of viral reads, ViFi uses both reference-based read mapping and a phylogenetic-based approach to identify viral reads. ViFi also incorporates mappability scores of the reads to filter out false positive integration detection. The end result is a tool that can accurately and precisely detect integrated viruses, even if the viruses are highly mutated or novel strains.

ViFi is currently in alpha testing, is is constantly undergoing revisions. High on the priority list is an easier installation process, as well as improve user interface. Please report any problems/bugs to Nam Nguyen ([email protected]) so that ViFi can be improved and problems can be quickly corrected.

UPDATE

Due to major issues with incompatibilities between versions of Pysam and Samtools, Python versions, as well as issues with software compatibility between different platforms, we highly recommend that users discontinue the use the Python version of ViFi, and instead, use the Dockerized version of ViFi. The Dockerized version is platform independent and only requires Python (either version 2.7 or 3.0) and Docker to be installed, and no other software package is needed. We outline below how to set up and install the Dockerized version, and how to run the Dockerized version.

In addition, we include a [Tutorial] for all the different options within ViFi below. We will include instructures on how to run ViFi from the source code, but again, strongly discourage against this usage.

Installation of ViFi for use in Docker

We provide instructions for preparing ViFi to be used for Docker below. If Perl is installed, the setup.sh script can be run that will automatically perform steps 3-7. Note that ViFi requires a large amount of diskspace to setup and run (10 Gb) due to the large size of the initial reference repositories.

Install Dependencies:
Python (2.7 or 3.0; instructions for 2.7 is shown)
Docker (https://docs.docker.com/install/)
Download and run setup.sh (If Perl is installed and on Mac/Linux system). Running this script will automatically download ViFi from GitHub, automatically download the repositories from Google Drive, pull the latest ViFi docker image, set all the environmental variables for ViFi, build the BWA index for hg19+HPV via Docker, and run a test run of ViFi via Docker. It can take up to an hour for the full set of tests to complete and run. Make sure you have at least 10 Gb of space free for the process to complete.

wget https://raw.githubusercontent.com/namphuon/ViFi/master/setup_linux_mac.sh
sh setup_linux_mac.sh

Run steps 3-7 are only necessary if Perl is not installed on the machine or on Windows machine. If Perl is on the machine, then setup_linux_mac.sh can be run to automatically set up ViFi (see to Step 2).

Clone the ViFi repository

git clone https://github.com/namphuon/ViFi.git

Set the ViFi directory and include the python source to your Python path

echo export VIFI_DIR=/path/to/ViFi >> ~/.bashrc
echo export PYTHONPATH=/path/to/ViFi:/path/to/ViFi/src:$PYTHONPATH >> ~/.bashrc

Download the data repositories: While we include some annotations, we are unable to host some large files in the git repository. These may be downloaded from https://drive.google.com/open?id=0ByYcg0axX7udUDRxcTdZZkg0X1k. Thanks to Peter Ulz for noticing incorrect link earlier.

tar zxf data_repo.tar.gz
echo export AA_DATA_REPO=$PWD/data_repo >> ~/.bashrc
source ~/.bashrc

Download the HMM models: We have pre-build HMM models for HPV and HBV. They can be downloaded from https://drive.google.com/open?id=0Bzp6XgpBhhghSTNMd3RWS2VsVXM.

unzip data.zip
echo export REFERENCE_REPO=$PWD/data >> ~/.bashrc

Build a BWA index on the reference sequences from human+viral sequences: We show an example of building an index of human+viral sequences using Hg19 and HPV and HBV below. However any reference organism+viral family could be used.

cat $AA_DATA_REPO//hg19/hg19full.fa $REFERENCE_REPO/hpv/hpv.unaligned.fas > $REFERENCE_REPO/hpv/hg19_hpv.fas
bwa index $REFERENCE_REPO/hpv/hg19_hpv.fas
cat $AA_DATA_REPO//hg19/hg19full.fa $REFERENCE_REPO/hbv/hbv.unaligned.fas > $REFERENCE_REPO/hbv/hg19_hbv.fas
bwa index $REFERENCE_REPO/hbv/hg19_hbv.fas

Running ViFi using Docker (RECOMMENDED)

We have also created a dockerized version of ViFi to enable easier time running (see previous section for installation and setup). To get the latest version of the Dockerized ViFo, run:

docker pull namphuon/vifi

To run the dockerized version of ViFi, first create the data repositories as above, including setting the environmental variables. Next, run the following command:

python $VIFI_DIR/scripts/run_vifi.py -f <READ1> -r <READ2> --docker

where and are the FASTQ files (gzipped or unzipped). Note that the $VIFI_DIR, $AA_DATA_REPO and $REFERENCE_REPO variables must be set in order for the script to find the necessary files.

Example (assuming that $VIFI_DIR is set):

python $VIFI_DIR/scripts/run_vifi.py -f $VIFI_DIR/test/data/test_R1.fq.gz -r $VIFI_DIR/test/data/test_R2.fq.gz  --docker

Note that because BAM files can be large (typically 100 Gb in size for 30x coverage) and the hg19+hpv reference genome is large (more than 3 Gb in size), Docker requires a lot of memory to run, as BWA requires a lot of memory. If running Docker on Mac or Windows, you may have to allocate more memory to Docker before r

ViFi Output

The output of ViFi is the list of read clusters discovered, and for each read cluster, the relaxed, stringent, and exact (if split reads are present) ranges are reported, aswell as the read names of the reads in the cluster.

The main output files of interest are

<prefix>.clusters.txt
<prefix>.clusters.txt.range

<prefix>.clusters.txt is a tab delimited file that reports the human integration range, the number of reads supporting the integration, and the number of reads mapped to the forward/reverse strand of the human region, as well as the number of viral reads mapping to the virus sequence. It also includes the names of each discordant read supporting the integration.

Below is the sample:

#chr    minpos  maxpos  #reads  #forward        #reverse
##================================================================
chr19   36212224        36212932        7       4       3
##ERR093797.9977893     chr19   36212224        True    False
##ERR093797.7073606     chr19   36212403        True    True

The first line is the header information. Afterward, each integration cluster is separated by a line containing =. The first line of an integration cluster describes the following:

Reference chromosome (chr19)
Minimum reference position of all mapped reads belonging to that cluster (36212224)
Maximum reference positions of all mapped reads belonging to that cluster (36212932)
Number of read pairs belonging to this cluster (7)
Number of reads mapped to the forward reference strand (4)
Number of reads mapped to the forward reference strand (3)

After this line, each read pair that mapped to this cluster is displayed. The information is

Read name (ERR093797.9977893)
Reference chromosome (chr19)
Starting read map location (36212224)
Read is on the reverse strand (True)
Read is read1 (False)

<prefix>.clusters.txt.range is a much more condensed summary of the results, showing just the integration range on the human reference (based upon discordant reads) and attempts to identify the exact integration point if split reads are available.

Below is a sample:

Chr,Min,Max,Split1,Split2
chr19,36212564,36212564,-1,-1

The first line is header information. Afterward, each line is information about the cluster. For example,

Reference chromosome (chr19)
Minimum reference position of all mapped reads belonging to that cluster (36212224)
Maximum reference positions of all mapped reads belonging to that cluster (36212932)
If split read exists, minimum split read mapped range, -1 if no split read exists (-1)
If split read exists, maximum split read mapped range, -1 if no split read exists (-1)

Finally, ViFi outputs several working files that can be deleted after a run. These are:

hmms.txt - The list of HMM files used during the run
<prefix>.bam - The aligned (name-sorted order) BAM file containing the input reads
<prefix>.unknown.bam - A BAM file containing all paired reads in which one or both paired end reads that did not align to any known reference. ViFi will then search these reads against the HMMs to identify any viral reads.
<prefix>.viral.bam - A BAM file containing all paired reads that only aligned to viral references
<prefix>.viral.cs.bam - A coordinate sorted BAM file containing all paired reads that only aligned to viral references
<prefix>.trans.bam - A BAM file containing all paired reads in which one read aligned to the human and the other aligned to the viral reference.
<prefix>.fixed.trans.bam - A BAM file created by merging 6. and any human/viral paired end reads discovered by running the viral HMMs on 3.
<prefix>.fixed.trans.cs.bam - A coordinate sorted BAM file of 7.

References

Nguyen ND, Deshpande V, Luebeck J, Mischel PS, Bafna V (2018) ViFi: accurate detection of viral integration and mRNA fusion reveals indiscriminate and unregulated transcription in proximal genomic regions in cervical cancer. Nucleic Acids Res (April):1–17.

Advanced Notes

Building evolutionary models

ViFi can be run with and without evolutionary models (i.e., the HMMs). We outline the steps in building the HMMs below. However, we also include a Docker pipeline that will automatically build the HMMs for the users to use. The pipeline only requires docker to be installed for use.

Using Docker pipeline to build HMMs for use in ViFi

The following command will create HMMs from a set of unaligned sequences. The sequences are assumed to share a common viral ancestor (i.e., don't mix viral families together when running the pipeline).

bash $VIFI_DIR/scripts/build_references.sh <INPUT_SEQ> <OUTPUT_DIR> <PREFIX>

The output in the OUTPUT_DIR folder will be a set of HMMs (suffix with *.hmmbuild) and a file containing the list of HMMs.

Using customized reference

If you want to use a customized reference or a reference for a different organism, you can inform ViFi of the reference sequences by supplying a chromosome file to ViFi using the --chromosome_list. The file format is a single line that has the sequence names delimited by spaces. For example:

mouse_chr1 mouse_chr2

would inform ViFi that any other sequences found in the BAM file that does not match mouse_chr1 and mouse_chr2 are considered viral sequences.

Installation from source code (Depreciated):

We provide instructions for installing ViFi on Linux below.

ViFi download (if you have not already cloned this source code):

git clone https://github.com/namphuon/ViFi.git

Install Dependencies:
1. Python 2.7
```
sudo dnf install python2
```
1. Pysam verion 0.9.0 or higher (https://github.com/pysam-developers/pysam):
```
sudo pip install pysam
```
1. Samtools 1.3.1 or higher (www.htslib.org/)
```
sudo apt-get install samtools
```
1. BWA 0.7.15 or higher (bio-bwa.sourceforge.net/)
```
sudo apt-get install bwa
```
1. Install HMMER v3.1b2 and have it on the path (http://hmmer.org/)
```
sudo apt-get install hmmer
```
Set the ViFi directory and include the python source to your Python path

echo export VIFI_DIR=/path/to/ViFi >> ~/.bashrc
echo export PYTHONPATH=/path/to/ViFi:/path/to/ViFi/src:$PYTHONPATH >> ~/.bashrc

Download the data repositories: While we include some annotations, we are unable to host some large files in the git repository. These may be downloaded from https://drive.google.com/open?id=0ByYcg0axX7udUDRxcTdZZkg0X1k. Thanks to Peter Ulz for noticing incorrect link earlier.

tar zxf data_repo.tar.gz
echo export AA_DATA_REPO=$PWD/data_repo >> ~/.bashrc
source ~/.bashrc

Download the HMM models: We have pre-build HMM models for HPV and HBV. They can be downloaded from https://drive.google.com/open?id=0Bzp6XgpBhhghSTNMd3RWS2VsVXM.

unzip data.zip
echo export REFERENCE_REPO=$PWD/data >> ~/.bashrc

Build a BWA index on the reference sequences from human+viral sequences: We show an example of building an index of human+viral sequences using Hg19 and HPV and HBV below. However any reference organism+viral family could be used.

cat $AA_DATA_REPO//hg19/hg19full.fa $REFERENCE_REPO/hpv/hpv.unaligned.fas > $REFERENCE_REPO/hpv/hg19_hpv.fas
bwa index $REFERENCE_REPO/hpv/hg19_hpv.fas

cat $AA_DATA_REPO//hg19/hg19full.fa $REFERENCE_REPO/hbv/hbv.unaligned.fas > $REFERENCE_REPO/hbv/hg19_hbv.fas
bwa index $REFERENCE_REPO/hbv/hg19_hbv.fas

Running ViFi (Depreciated)

We show the most basic example of running ViFi below. This version assumes that the user has followed all the previous steps. More advanced options, such as using a customized reference organism/viral family is provided in the Advanced Notes section.

python run_vifi.py -f <input_R1.fq.gz> -r <input_R2.fq.gz> -o <output_dir>

Note that this version defaults to searching for HPV. To search for HBV, run the following command.

python run_vifi.py -f <input_R1.fq.gz> -r <input_R2.fq.gz> -o <output_dir> -v hbv

vifi's People

Stargazers

Watchers

Forkers

cidule ddemaeyer meissnert wangdi2014 adsalgue anchiiheart vladsavelyev channing-zeng szsctt chebuu sara-javadzadeh prga-eng kelsi-kw alexjacobscds nailouzhang

vifi's Issues

EBV support

Dear ViFi team,
Do you plan to support detection of EBV? If not, could you please direct me on how to build HMM models for EBV?

Error; hg19util

Hello,
I am trying to run vifi, but everytime I am getting this error.

Traceback (most recent call last):
File "run_vifi.py", line 4, in
import hg19util as hg19
ImportError: No module named hg19util

Kindly help me with this error.

Thanks
AKS

Problems with understanding output.clusters.txt and *.range files

Hi!

I have launched vifi on a TCGA patient and now I have a couple of questions. Shouldn't the min and max columns of the output.clusters.txt and output.clusters.txt.range be the same? According to the docs for output.clusters.txt:

The first line is the header information. Afterward, each integration cluster is separated by a line containing =. The first line of an integration cluster describes the following:

Reference chromosome (chr19)

Minimum reference position of all mapped reads belonging to that cluster (36212224)

Maximum reference positions of all mapped reads belonging to that cluster (36212932)

And for output.clusters.txt.range:

The first line is header information. Afterward, each line is information about the cluster. For example,
Reference chromosome (chr19)
Minimum reference position of all mapped reads belonging to that cluster (36212224)
Maximum reference positions of all mapped reads belonging to that cluster (36212932)

However, the min, max columns of output.clusters.txt and output.clusters.txt.range in the test sample provided on github do not correspond to each other. Similarly, on my sample vifi output was the following:
output.clusters.txt (grepped only header lines):

chr9 102677385 102677660 54 53 1
chr9 102691056 102691123 6 5 1
chr9 102713338 102713581 54 52 2
chr9 102714438 102714529 80 2 78
chr9 102716962 102717461 1091 1087 4
chr9 102719829 102720390 265 265 0
chr9 103083283 103084075 886 1 885
chr9 103088570 103088778 157 0 157
chr9 103090067 103090202 13 0 13

And output.clusters.txt.range:

Chr,Min,Max,Split1,Split2
chr9,102677612,102677643,-1,-1
chr9,102691059,102691122,-1,-1
chr9,102713404,102713566,-1,-1
chr9,102714438,102714472,-1,-1
chr9,102716962,102717446,-1,-1
chr9,102720073,102720373,-1,-1
chr9,103083283,103083410,-1,-1
chr9,103088570,103088870,-1,-1
chr9,103090067,103090367,-1,-1

Also it seems that there were no split reads found in the TCGA sample despite having many reads in integration clusters - could you suggest what might have been the reason for that?

PS: I created a custom docker image from namphuon/vifi and launched vifi through python inside the container without --docker flag so in principle it should work OK this way, does it?

Thanks in advance,
Sergei

hg19util module not found

Hi,
When trying to run for the first time after installing all dependencies in the README, I got an error that the hg19util module could not be found. The only reference to a python module with this name on google is one in AmpliconArchitect. You reference AmpliconArchitect in the ViFi paper but it is not listed as a dependency. Is this a dependency that has gone unlisted or does the hg19util module come from somewhere else?

Problems with --disable_hmms

Hi,

I have encountered problems running ViFi on EBV genome with --disable_hmms. The possible bug leads to 0 clusters in output.clusters.txt and output.clusters.txt.range for several samples clearly having traces of EBV integration. The outputs.trans.bam files contain hundreds of reads though, so that ViFi seems to have successfully identified the integrations.

This is a possible duplicate of issue #4 but the issue doesn't seem to have been answered explicitly. The traceback generated indicates that there's script merge_viral_reads.py that is being run regardless of --disable_hmms option, nevertheless it seems to require the file tmp/temp/reduced.csv that can only be generated in run_hmms.py.

The actual command and the lines of traceback:
python ${VIFI_DIR}/scripts/run_vifi.py -f ${FQ1} -r ${FQ2} -o ${vifi_output_dir} -v ebv --cpus 8 --disable_hmms 1

4017.630011 45100000 reads done: #(Trans reads) = 995 38 D7ZQJ5M1:683:C4BGFACXX:6:2315:14661:71501 D7ZQJ5M1:683:C4BGFACXX:6:2315:14367:71714
4026.487438 45200000 reads done: #(Trans reads) = 998 38 D7ZQJ5M1:683:C4BGFACXX:6:2316:9952:35091 D7ZQJ5M1:683:C4BGFACXX:6:2316:9757:35047
Traceback (most recent call last):
File "/home/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'
[Finished identifying chimeric reads]: 6156.258875
[Cluster and identify integration points]: 6156.258919
scores = read_scores_file(args.reducedName[0])
Traceback (most recent call last):
File "/home/scripts/merge_viral_reads.py", line 128, in
IOError: [Errno 2] No such file or directory: 'tmp/temp/reduced.csv'
File "/home/scripts/merge_viral_reads.py", line 21, in read_scores_file
input = open(hmm_file, 'r')
0
[Finished cluster and identify integration points]: 6158.720271

Thank you in advance,
Sergei

AlignmentHeader does not support item assignment

Hello !!,
I am getting this error while running vifi.
Any suggestion regarding this error will be really helpful.

Traceback (most recent call last):
File "/home/anurag/tools/ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()

I am using pysam version 0.14.1 and samtools version 1.7

Thank you,
AKS

No chimeric results generated from the test data

Hello, after running the test data with default setting, I got no result in the .clusters.txt and .clusters.txt.range files. I also got some warnings: File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.__setitem__TypeError: AlignmentHeader does not support item assignment" during the run. Also, is it normal that "[M::bwa_idx_load_from_disk] read 0 ALT contigs" ?

When I tested other samples, the results were similar to this. And I got another warning:
"File ".../ViFi/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'".

I am new to python and currently have no idea how to fix them. Could you help with this? Thank you in advance.

Below are the running record for the test data:
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 39718 sequences (4964750 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (0, 986, 34, 0)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (251, 281, 307)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (139, 419)
[M::mem_pestat] mean and std.dev: (279.25, 41.03)
[M::mem_pestat] low and high boundaries for proper pairs: (83, 475)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (7613, 7658, 7687)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (7465, 7835)
[M::mem_pestat] mean and std.dev: (7650.35, 41.77)
[M::mem_pestat] low and high boundaries for proper pairs: (7391, 7909)
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_pestat] skip orientation RF
[M::mem_process_seqs] Processed 39718 reads in 19.878 CPU sec, 19.911 real sec
[main] Version: 0.7.12-r1039
[main] CMD: bwa mem -t 1 -M .../ViFi/data//hpv/hg19_hpv.fas .../ViFi/test/data/test_R1.fq.gz .../ViFi/test/data/test_R2.fq.gz
[main] Real time: 56.086 sec; CPU: 28.238 sec
19859 17365 13
Prepared sequences for searching against HMMs: 0.120441s
Running HMMs
Running HMM .......

Finished running against HMMs: 6113.827747s
Processing results

Traceback (most recent call last):
File ".../ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()
0
[Running BWA]: 0.032352
[Finished BWA]: 56.216900
[Identifying chimeric reads]: 56.231233
[Finished identifying chimeric reads]: 61.031476
[Running HMMS]: 61.031561
[Finished running HMMS]: 6177.171714
[Cluster and identify integration points]: 6177.172175
[Finished cluster and identify integration points]: 6184.909763

I got an error: File "/data/program/ViFi/ViFi/scripts/run_vifi.py", line 118, in <module> reference_dir = os.environ['REFERENCE_REPO'] File "/usr/lib/python2.7/UserDict.py", line 40, in getitem raise KeyError(key) KeyError: 'REFERENCE_REPO'

Hi
I install docker and execute setup_linux_mac.sh
my cmd: sudo python $VIFI_DIR/scripts/run_vifi.py --cpus 2 --hmm_list $VIFI_DIR/data/hbv/hmms/hmms.txt -f $VIFI_DIR/test/data/test_R1.fq.gz -r $VIFI_DIR/test/data/test_R2.fq.gz -o $VIFI_DIR/tmp/docker/ --docke
but, I got the following error
File "/data/program/ViFi/ViFi/scripts/run_vifi.py", line 118, in
reference_dir = os.environ['REFERENCE_REPO']
File "/usr/lib/python2.7/UserDict.py", line 40, in getitem
raise KeyError(key)
KeyError: 'REFERENCE_REPO'

I confirmed my environmental variables.

breakpoint position left shift and large mem required

Hi,
We ran ViFi on samples of HCC RNA-seq data downloaded from PRJNA337887 (only one sample test),
After manually checking result from ViFi and supplement table 3 of original papers, the ViFi reported integration sites are all left shift 2 bp from the breakpoint either from original paper reported or manually check of mapping bam file using samtools tview.
The ViFi reported:

The original paper reported:

The samtools tview on one of the integrated site:

Besides,
the step cluster_trans_new.py consumed about 60G memory on a small input bam file, is it normal?

Erros when running with other references

I would like to run ViFi with a different reference than the one provided (hg19). I concatenated the reference together with the viral sequence, and indexed it. I provided this index, as well as a list of chromosomes in the reference, to ViFi. However, it seems like there are some other files required - a bed file with mappability scores, gff files with genes, etc, which aren't currently documented. These are listed for hg19 in data_repo/hg19/file_list.txt:

fa_file 		                hg19full.fa
chrLen_file 		            hg19full.fa.fai
duke35_filename 		        wgEncodeDukeMapabilityUniqueness35bp_sorted.bedGraph
mapability_exclude_filename     wgMapabilityExcludable.bed
gene_filename 		            human_hg19_september_2011/Genes_July_2010_hg19.gff
exon_file 		                human_hg19_september_2011/Exon-Intron_July_2010_hg19.gff
oncogene_filename 		        cancer/oncogenes/Census_oncomerge.gff
centromere_filename 		    hg19_centromere.bed
conserved_regions_filename 		conserved.bed #conserved.gain5.bed (readdepth >2 samples in turner controls KT51-59) + lumpy XYM
segdup_filename 		        annotations/hg19GenomicSuperDup.tab

The fasta file and index are easy enough, so I made a file with this information for my reference and tried to run ViFi (without HMMs), but I ran into these warnings and errors:

WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.048	 interval_list: Unable to open interval file "/home/data_repo/test_human/".
WARNING:root:#TIME 0.050	 rep_content: Unable to open mapability file "/home/data_repo/test_human/".
Traceback (most recent call last):
  File "/home/scripts/cluster_trans_new.py", line 161, in <module>
    if hg.interval(a, bamfile=bamFile).rep_content() <= 3 and a.mapq >= 10:
  File "/home/scripts/hg19util.py", line 384, in rep_content
    m = interval(duke35[p])
  File "/home/scripts/hg19util.py", line 172, in __init__
    self.load_line(line, file_format)
  File "/home/scripts/hg19util.py", line 182, in load_line
    if len(line.strip().split()) == 1:
AttributeError: 'list' object has no attribute 'strip'

Seems like the issue is something to do with the missing mappability scores, which aren't available for my reference genome. Are these required for running ViFi, and are there any options for running with a reference for which they aren't available?

reduced.csv not being created

Hello,
When running my cases I am continually getting an empty table for output. I am faced with the following error:

"Traceback (most recent call last):
File "/home/dnygard/ViFi/scripts/merge_viral_reads.py", line 116, in
scores = read_scores_file(args.reducedName[0])
File "/home/dnygard/ViFi/scripts/merge_viral_reads.py", line 19, in read_scores_file
input = open(hmm_file, 'r')
IOError: [Errno 2] No such file or directory: 'tmp/temp/reduced.csv'
0"

I am wondering in what step is tmp/temp/reduced.csv produced so I might be able to trace the source of this error. If you have any suggestions they would be much appreciated. Thank you.

can convert bam version speed up

thanks a lot

Running ViFi with Custom Reference Files

Hello,

I would like to use ViFi with a specific reference file that I have. Therefore, I'm referring to the following content:
`#Set up reference for alignment
HUMAN_REF="GRCh38"
HUMAN_REF_FILE_NAME="hg38full.fa"
for virus in "hpv" "hbv" "hcv"; do
if [ ! -d $REFERENCE_REPO/${virus} ]; then
echo "Reference for virus $virus is not downloaded. Contact the author to get access to the viral references."
else
HUMAN_VIRAL_REF="grch38_${virus}.fas"
echo "Building the ${HUMAN_REF}+${virus} reference"
cat $AA_DATA_REPO//${HUMAN_REF}/${HUMAN_REF_FILE_NAME} $REFERENCE_REPO/${virus}/${virus}.unaligned.fas > $REFERENCE_REPO/${virus}/${HUMAN_VIRAL_REF}
docker run -v $REFERENCE_REPO/${virus}/:/home/${virus}/ docker.io/namphuon/vifi bwa index /home/${virus}/${HUMAN_VIRAL_REF}

    #Build reduced list of HMMs for testing
    echo "Creating the list of hmms for testing in $VIFI_DIR"
    ls $VIFI_DIR/viral_data/${virus}/hmms/*.hmmbuild > $VIFI_DIR/viral_data/${virus}/hmms/hmms.txt
    ls $VIFI_DIR/viral_data/${virus}/hmms/*.[0-9].hmmbuild > $VIFI_DIR/viral_data/${virus}/hmms/partial_hmms.txt`

running command
docker run -v $REFERENCE_REPO/AB033550/:/home/AB033550/ docker.io/namphuon/vifi bwa index /home/AB033550/hybrid_hg19nAB033550.fas

However, the HMM and TRE files were not generated, so I tried running "ViFi/scripts/build_references.sh". I ran the following command on June 13, but it still hasn't completed.

running command
sh /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/scripts/build_references.sh /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/viral_data/AB033550/hybrid_hg19nAB033550.fas /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/viral_data/AB033550/output hybrid /home/kde/PROJECTS/VirusIntegrationTools/download/ViFi/scripts

If it's appropriate to use "build_references.sh" to run ViFi with my reference files, why is it taking so long? Is there a solution?

Thanks

Inquiry

Hello!
May I ask for a help?
I am not sure what is the problem and I can not find pysam/libcalignmentfile.pyx, and I can only find a file called libcalignmentfile.pxd. I used the hmm files and test fa files provide by ViFi. I finished running HMMS. Could you please tell me how to deal with the following error?

Traceback (most recent call last):
File "/home/brz/wsy/vifi/ViFi/scripts/merge_viral_reads.py", line 118, in
outputFile.header['SQ'] = references
File "pysam/libcalignmentfile.pyx", line 537, in pysam.libcalignmentfile.AlignmentHeader.setitem
TypeError: AlignmentHeader does not support item assignment (use header.to_dict()

Usage: samtools sort [options] <in.bam> <out.prefix>

404 not found for google driver link

Hi，
I noted this url: https://drive.google.com/open?id=0ByYcg0axX7udUDRxcTdZZkg0X1k look like not a valid url now (404 not found), could you please update it.
Thanks for yur reading.
Best.
Zhang

Would ViFi remove duplicate reads?

Hi,

I would like to know whether ViFi would remove duplicate reads before determining the integration sites. if so, in which step would ViFi conduct deduplication?

Thanks in advance.

Gina

AttributeError: 'NoneType' object has no attribute 'write'

Hi, Nam,

I always got this warning when running my data with vifi:

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 4 -M ../db/data/hpv/hg19_hpv.fas ../CL100076810_L01_582_clean_1.fq.gz ../CL100076810_L01_582_clean_2.fq.gz
[main] Real time: 613.215 sec; CPU: 2120.682 sec
Traceback (most recent call last):
File "../program/ViFi/scripts/get_trans_new.py", line 238, in
miscFile.write(b)
AttributeError: 'NoneType' object has no attribute 'write'

Do you have any idea why I have this warning and how could I fix it?

Looking forward to your kind reply.

Best Regards,
Zhihua

Incorrect link of Step4

Hi,

I am trying to download the data repository for sSep4 and click the link but it's an empty link. May I have your help to fix it?

FYI, the cluster server in our school does not have docker installed and we are not allowed to install it as well. So I think I need to go through Step4?

Thanks,
Wenjin

ViFi For germline

Hi everyone,

I'have a question about ViFi, I want to detect a germline virus integrations in my samples, I want to know if its possible use ViFi for this prupose... Even more, If I have different files for fastq_1 and fastq_2 I'have to merge all of them in one for _1 and other for _2??

Thanks for your help

Jordi

Failed to open file output.bam

I follow the instructions(install dependencies, set paths, download data and index the reference sequence), and everything's going well. Then I run run_vifi.py using the test data(test_R1.fq.gz and test_R2.fq.gz) and ran into a problem: Failed to open file output.bam.

IOError: [Errno 2] No such file or directory: 'tmp/temp/hmmsearch.0'

Hi
I follow the install instructions (for source code version, not Dockerized version because my server platform is not in the platform list for docker installation) and everything seems going well. Then I run run_vifi.py using the test data(test_R1.fq.gz and test_R2.fq.gz) but ran into a problem:

I wonder what could be the cause of the problem, thanks

hg19util.py

Hi,
This script has a possible error at line 202. Is this a problematic for loop?
Thanks.

ViFi can be used for non-human genome and virus??

Dear ViFi Team

I have a Viral infected Fish sequence. I would like to try running ViFi for my Data.
I have both Reference Genome and Viral sequence.
is it possible for me to use ViFi and Identify the Fusion genes from Virus and Fish Genome?

Kind Regards
Sri

What is the source of prebuilt hbv hmm model

In the prebuilt HBV hmm model, the sequences are annotated with hbv_ref34, etc. Do you have a detailed annotation of these sequences? Can I download all of the HBV sequence data in NCBI and build the hmm model myself?

Change output to use position of most representative strain in cluster

As multiple different viral strains can exist in the reference database, a cluster of chimeric reads might end up having the viral portion map to multiple different strains if the region is highly similar. To get a better output, we should report the viral position of the most representative strain.

Please version release ViFi

Hi Nam

We were hoping to use ViFi, but it would be useful for organizational purposes if you could version the software.

For example, this would be useful for the Docker image at https://hub.docker.com/r/namphuon/vifi

As opposed to the tag latest, we could use a specific version.

Thanks, Evan

namphuon / vifi Goto Github PK

vifi's Introduction

UPDATE

ViFi

UPDATE

Installation of ViFi for use in Docker

Running ViFi using Docker (RECOMMENDED)

ViFi Output

References

Building evolutionary models

Using Docker pipeline to build HMMs for use in ViFi

Using customized reference

Installation from source code (Depreciated):

Running ViFi (Depreciated)

vifi's People

Stargazers

Watchers

Forkers

vifi's Issues

Recommend Projects

Recommend Topics

Recommend Org