dhcai21 / rvhaplo Goto Github PK
View Code? Open in Web Editor NEWA viral haplotype reconstruction tool for long reads.
License: GNU General Public License v3.0
A viral haplotype reconstruction tool for long reads.
License: GNU General Public License v3.0
Hello,
I noticed that every exit in the shell script is an exit 1
indicating an error. However, I am running RVHaplo in a pipeline and even when the program looks like it completes correctly it throws an error halting the whole pipeline. I am wondering if some of the exit commands should be exit 0
instead of exit 1
.
Cheers,
Johnny
Hello, my name is Andrea and I am a bioinformatician from Spain.
Currently, I am working with SARS-CoV-2 data obtained using Nanopore sequencing, and I am interested in using RVHaplo for haplotype identification. Is it reliable for SARS-CoV-2 data already?
Thanks in advance!
Hi, I was running RVHaplo with my own data, but i met the following errors, the log was:
count nucleotide occurrence
SNV detection
Traceback (most recent call last):
File "./src/two_binomial.py", line 44, in
beta = ref_init[1]
IndexError: invalid index to scalar variable
Hi @dhcai21
I am trying to run my ONT single reads after alignment with my reference fasta file through minimap2, but continuously getting the same error
error:
..Killed
___ [mcxIO] cannot open file </home/ngs/RVHaplo/resul/haplo_reads_graph.mci> in mode r
___ [mcl] no jive
___ [mcl] failed
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_reads_graph.mci': No such file or directory
___ [mcxIOopen] r stream </home/ngs/RVHaplo/resul/haplo_mcl_result.icl> cannae be opened
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_mcl_result.icl': No such file or directory
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_reads_graph.tab': No such file or directory
hierarchical clustering
Traceback (most recent call last):
File "./src/hierarchical_cluster.py", line 222, in
haplo_cluster = hierarchical_clustering(c_more,seq_mat,snv_base,major_base,weight_cluster,depth)
File "./src/hierarchical_cluster.py", line 156, in hierarchical_clustering
contig_mat = hierarchical_weight(res_cluster,seq_mat,snv_base,major_base,thres,depth)
File "./src/hierarchical_cluster.py", line 148, in hierarchical_weight
mat = contig_weight(seqs,node_se)
File "./src/hierarchical_cluster.py", line 116, in contig_weight
index = np.argsort(nod[:, 0])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
haplotypes reconstruction
Traceback (most recent call last):
File "./src/out_haplotypes.py", line 47, in
f = open(file_final,'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
haplotypes polishment (medaka)
Traceback (most recent call last):
File "./src/extract_reads.py", line 18, in
f = open(f'{file_path}/{file_prefix}_clusters.pickle','rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
Traceback (most recent call last):
File "./src/run_medaka.py", line 15, in
f = open(f'{file_path}/{file_prefix}_clusters.pickle','rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_clusters.pickle': No such file or directory
complete reconstructing haplotypes
The command is given here :
(rvhaplo) ngs@ngs-ThinkCentre-M93p:~/RVHaplo$ ./rvhaplo.sh -i alignedBPC2.sam -r final_consensus_JEV_stock.FST -o '/home/ngs/RVHaplo/resul' -p haplo -t 8
I have attached a snapshot of the same for your review.
Thanks
Hello,
I have noticed that some of my outputs do not create a consensus or haplotype fasta file. I wanted to make sure I understood the issue so I looked at the rvhaplo.sh
script and captured the output. I added some echo
commands to the script so that I could double check where things were happening in the code.
From what I can see it looks like the size variable is being calculated incorrectly using wc -l
. (I had to look this up) Because there is no newline character in the rvhaplo_snv.txt
file (snv output) wc
is returning the size
to be 0
.
Here is the "modified" rvhaplo.sh
file code:
#! /bin/bash
### required arguments
file_sam=""
file_ref=""
### optional arguments
file_path='./result'
prefix="rvhaplo"
mq=0
thread=8
error_rate=0.1
signi_level=0.05
cond_pro=0.65
fre_snv=0.8
num_read_1=10
num_read_2=5
gap=15
smallest_snv=20
only_snv=0
ovlap_read=5
weight_read=0.85
mcl_inflation=2
lar_cluster=50
ovlap_cluster=10
depth=5
weight_cluster=0.8
abundance=0.005
s_pos=1
e_pos=10000000000
sub_graph=1
function help_info() {
echo "Usage: $0 -i alignment.sam -r ref_genome.fasta [options]"
echo ""
echo "RVHaplo: Reconstructing viral haplotypes using long reads"
echo ""
echo "Author: Dehan CAI"
echo "Date: May 2022"
echo "Version 2: Support mutli-thread processing; Use a C package of MCL; Cost less memory "
echo ""
echo " -i | --input: alignment file (sam)"
echo " -r | --refernece: reference genome (fasta)"
echo ""
echo " Options:"
echo " -o | --out: Path where to output the results. (default:./result)"
echo " -p | --prefix STR: Prefix of output file. (default: rvhaplo)"
echo " -t | --thread INT: Number of CPU cores for multiprocessing. (default:8)"
echo " -e | --error_rate FLOAT: Sequencing error rate. (default: 0.1)"
echo " -mq | --map_qual INT: Smallest mapping quality for reads . (default:0)"
echo " -s | --signi_level FLOAT: Significance level for binomial tests. (default: 0.05)"
echo " -c | --cond_pro FLOAT: A threshold of the maximum conditional probability for SNV sites. (default: 0.65)"
echo " -f | --fre_snv FLOAT: The most dominant base' frequency at a to-be-verified site should >= fre_snv. (default: 0.80)"
echo " -n1 | --num_read_1 INT: Minimum # of reads for calculating the conditional probability given one conditional site. (default:10)"
echo " -n2 | --num_read_2 INT: Minimum # of reads for calculating the conditional probability given more than one conditional sites. (default: 5)"
echo " -g | --gap INT: Minimum length of gap between SNV sites for calculating the conditional probability. (default:15)"
echo " -ss | --smallest_snv INT: Minimum # of SNV sites for haplotype construction. (default:20)"
echo " -os | --only_snv (0 or 1) : Only output the SNV sites without running the haplotype reconstruction part. (default: 0)"
echo " -or | --overlap_read INT: Minimum length of overlap for creating edges between two read in the read graph. (default: 5)"
echo " -wr | --weight_read FLOAT: Minimum weights of edges in the read graph. (default:0.85)"
echo " -sg | --sub_graph INT: Number of subgraphs to run MCL (default:1)"
echo " -m | --mcl_inflation FLOAT: Inflation of MCL algorithm. (default:2)"
echo " -l | --lar_cluster INT: A threshold for seperating clusters into two groups based on sizes of clusters. (default:50)"
echo " -oc | --overlap_cluster INT: A parameter related to the minimum overlap between consensus sequences. (default:10) "
echo " -d | --depth INT: Depth limitation for consensus sequences generated from clusters. (default:5) "
echo " -wc | --weight_cluster FLOAT: Minimum weights between clusters in the hierarchical clustering. (default: 0.8)"
echo " -sp | --start_pos INT: Starting position for generating consensus sequences (default: 1)"
echo " -ep | --end_pos INT: Ending position for generating consensus sequences. (default: 1e10)"
echo " -a | --abundance FLOAT: A threshold for filtering low-abundance haplotypes. (default: 0.005)"
echo " -h | --help : Print help message."
echo ""
echo " For further details of above arguments, please refer to https://github.com/dhcai21/RVHaplo "
echo ""
exit 0
}
if [[ "$1" == "" ]];then
help_info
exit 1
fi
while [[ "$1" != "" ]]; do
case "$1" in
-h | --help ) ## print help message
help_info
exit 1
;;
-i | --input ) ### input sam file
case "$2" in
"" )
echo "Error: no sam file as input"
exit 1
;;
*)
file_sam="$2"
if [[ "${file_sam:0:1}" == "-" ]]
then
echo "Error: no sam file as input"
exit 1
fi
shift 2
;;
esac
;;
-r | --ref_genome) ### input reference genome
case "$2" in
"")
echo "Error: no fasta file as input"
exit 1
;;
*)
file_ref="$2"
if [[ ""${file_ref:0:1}"" == "-" ]]
then
echo "Error: no fasta file as input"
exit 1
fi
shift 2
;;
esac
;;
-o | --out ) ### output path
case "$2" in
"" )
echo "Error: no output path"
exit 1
;;
*)
file_path="$2"
if [[ "${file_sam:0:1}" == "-" ]]
then
echo "Error: no output path"
exit 1
fi
shift 2
;;
esac
;;
-p | --prefix ) ### prefix
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
prefix="$2"
shift 2
;;
esac
;;
-mq | --map_qual ) ### mapping quality
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
mq="$2"
shift 2
;;
esac
;;
-t | --thread ) ### threads
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
thread="$2"
shift 2
;;
esac
;;
-e | --error_rate ) ### error_rate
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
error_rate="$2"
shift 2
;;
esac
;;
-s | --signi_level ) ### significance_level
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
signi_level="$2"
shift 2
;;
esac
;;
-c | --cond_pro ) ### conditional_probability
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
cond_pro="$2"
shift 2
;;
esac
;;
-f | --fre_snv ) ### determine the set of to-be-verified SNV sites
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
fre_snv="$2"
shift 2
;;
esac
;;
-n1 | --num_read_1 ) ### number of reads for p(ai|aj)
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
num_read_1="$2"
shift 2
;;
esac
;;
-n2 | --num_read_2 ) ### number of reads for p(ai|aj1,aj2,...,ajp)
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
num_read_2="$2"
shift 2
;;
esac
;;
-g | --gap ) ### Minimum distance between SNV sites
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
gap="$2"
shift 2
;;
esac
;;
-ss | --smallest_snv ) ### Minimum number of SNV sites for haplotype reconstruction
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
smallest_snv="$2"
shift 2
;;
esac
;;
-os | --only_snv ) ### Only output the SNV sites without running the haplotype reconstruction part.
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
only_snv="$2"
shift 2
;;
esac
;;
-or | --ovlap_read ) ### overlap_read
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
ovlap_read="$2"
shift 2
;;
esac
;;
-wr | --weight_read ) ### weight_read
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
weight_read="$2"
shift 2
;;
esac
;;
-sg | --sub_graph ) ### Number of subgraphs
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
sub_graph="$2"
shift 2
;;
esac
;;
-m | --mcl_inflation ) ### inflation of MCL
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
mcl_inflation="$2"
shift 2
;;
esac
;;
-oc | --ovlap_cluster ) ### overlap_cluster
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
ovlap_cluster="$2"
shift 2
;;
esac
;;
-wc | --weight_cluster ) ### weight_cluster
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
weight_cluster="$2"
shift 2
;;
esac
;;
-d | --depth ) ### depth limitation
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
depth="$2"
shift 2
;;
esac
;;
-l | --lar_cluster ) ### large cluster size
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
lar_cluster="$2"
shift 2
;;
esac
;;
-sp | --start_pos ) ### start_pos
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
s_pos="$2"
shift 2
;;
esac
;;
-ep | --end_pos ) ### end_pos
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
e_pos="$2"
shift 2
;;
esac
;;
-a | --abundance ) ### smallest abundance
case "$2" in
"" )
echo "Error: no input for $1"
exit 1
;;
*)
abundance="$2"
shift 2
;;
esac
;;
*)
echo "Error: unknow parameter $1"
exit 1
esac
done
if [[ "$file_sam" == "" ]];then
echo "Error: no sam file input"
exit 1
fi
if [[ "$file_ref" == "" ]];then
echo "Error: no reference genome input"
exit 1
fi
if [[ ${file_path:0-1:1} == "/" ]];then
path_len=`expr ${#file_path}-1`
file_prefix=$file_path$prefix
file_path=${file_path:0:path_len}
else
file_prefix=$file_path"/"$prefix
fi
########## count nucleotide occurrence ##########
echo "count nucleotide occurrence"
if [[ "$file_path" != "." ]];then
rm -rf $file_path
mkdir $file_path
fi
rm -rf $file_path"/alignment"
mkdir $file_path"/alignment"
file_len=`expr ${#file_sam}-4`
unique_sam=$file_path"/alignment/"$prefix".sam"
samtools view -h -F 0x900 -q $mq $file_sam > $unique_sam
file_bam=$file_path"/alignment/"$prefix".bam"
samtools view -b -S $unique_sam > $file_bam
rm $unique_sam
file_bam_sorted=$file_path"/alignment/"$prefix"_sorted.bam"
samtools sort $file_bam -o $file_bam_sorted
samtools index $file_bam_sorted
file_acgt=$file_prefix"_acgt.txt"
python ./RVHaplo/src/count_frequency.py $file_bam_sorted $file_acgt
########## two binomial tests ##########
echo "SNV detection"
file_snv=$file_prefix"_snv.txt"
python ./RVHaplo/src/two_binomial.py $error_rate $signi_level $file_acgt $file_snv $thread $s_pos $e_pos
## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
echo "size one: ${size}"
echo "snv file one\n"
cat $file_snv
echo " "
if [[ $size != "0" ]];then
echo "haplo one"
python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt 1 $file_prefix"_consensus.fasta" $s_pos $e_pos
python ./RVHaplo/src/extract_reads.py $file_path $prefix 1
python ./RVHaplo/src/run_medaka.py $file_path $prefix 1
exit 0
fi
## maximum conditional probability and construct reads graph
python ./RVHaplo/src/mcp_read_graph.py $file_bam_sorted $file_snv $cond_pro $smallest_snv $num_read_1 $num_read_2 $gap \
$weight_read $ovlap_read $file_prefix $fre_snv $thread $only_snv $sub_graph
## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
echo "size two: ${size}"
echo "snv file two\n"
cat $file_snv
echo " "
if [[ $size != "0" ]];then
echo "haplo two"
python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt 1 $file_prefix"_consensus.fasta" $s_pos $e_pos
python ./RVHaplo/src/extract_reads.py $file_path $prefix 1
python ./RVHaplo/src/run_medaka.py $file_path $prefix 1
exit 0
fi
if [[ $only_snv != 0 ]];then
exit 0
fi
## check the number of reads with overlaps
size="$(wc -l <"$file_prefix"_reads_graph.txt)"
size="${size:0-1:1}"
if [[ $size == "0" ]];then
echo "No enough reads with overlaps"
exit 0
fi
# MCL clustering
echo "MCL clustering"
python ./RVHaplo/src/run_mcl.py "$file_prefix" "$thread" "$mcl_inflation" "$sub_graph"
## hierarchical clustering
echo "hierarchical clustering"
python ./RVHaplo/src/hierarchical_cluster.py $file_prefix"_matrix.pickle" $lar_cluster $depth \
$ovlap_cluster $weight_cluster $abundance $file_prefix
## reconstruct haplotypes
rm -rf $file_path"/clusters"
mkdir $file_path"/clusters"
echo "haplotypes reconstruction"
python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt x $file_prefix"_consensus.fasta" $s_pos $e_pos
echo "haplotypes polishment (medaka)"
python ./RVHaplo/src/extract_reads.py $file_path $prefix x
python ./RVHaplo/src/run_medaka.py $file_path $prefix x
rm $file_prefix"_matrix.pickle"
rm $file_prefix"_reads_cluster.txt"
rm $file_prefix"_clusters.pickle"
rm -rf $file_path/medaka
echo "complete reconstructing haplotypes"
exit 0
Attached is the output for SAMPLE1 virus KM403390.1 - hasn't finished running yet, but you can see what I mean, and the output for SAMPLE1 virus KY382409 - has finished. I am fairly confident that there should be an output fasta for both because we make sure to do a viral strain selection before running RVHaplo where we select for at minimum 80% genome fraction (coverage along the length of the reference) and then we select the best viral strain based on depth.
rvhaplo_KM403390.1_std.txt
rvhaplo_KY382409_std.txt
Cheers,
Johnny
Hey Dehan,
I was processing some of my samples and they are taking much longer than what was mentioned in the paper for run time (w/ 8 CPU ~100 min). 3 of the samples I was testing did not finish creating the mcp_read_graph and they had been running for over 8 hours with 32 CPU cores. I also made sure that if the samples had > 50k reads that the -sg n
flag was used.
I was wondering if you might be able to help me figure out why it could be taking so long.
Cheers,
Johnny
Hello! I am currently running the RVHaplo with my own data and encountered the following error! Any help would be greatly appreciated! Thanks so much!
SNV detection
Maximum conditional probability and Graph clustering
Traceback (most recent call last):
File "./src/read_graph_mcl.py", line 96, in <module>
to_be_verified_sites = to_be_verified_sites[nucl_fre[:,-1]>=fre_most_base]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
hierarchical clustering
Traceback (most recent call last):
File "./src/hierarchical_cluster.py", line 176, in <module>
f = open(sys.argv[1], 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_graph.pickle'
haplotypes reconstruction
Traceback (most recent call last):
File "./src/out_haplotypes.py", line 10, in <module>
f = open(file_final,'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_final.pickle'
rm: cannot remove ‘/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_graph.pickle’: No such file or directory
rm: cannot remove ‘/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_final.pickle’: No such file or directory
complete reconstructing haplotypes```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.