dhcai21 / rvhaplo Goto Github PK

View Code? Open in Web Editor NEW

10.0 10.0 4.0 2.37 MB

A viral haplotype reconstruction tool for long reads.

License: GNU General Public License v3.0

Python 70.44% Shell 29.56%

rvhaplo's People

Stargazers

Watchers

Forkers

sebastiendazy genostack zdk427 jonathan-bravo

rvhaplo's Issues

All exit commands are `exit 1`

Hello,

I noticed that every exit in the shell script is an exit 1 indicating an error. However, I am running RVHaplo in a pipeline and even when the program looks like it completes correctly it throws an error halting the whole pipeline. I am wondering if some of the exit commands should be exit 0 instead of exit 1.

Cheers,
Johnny

RVHaplo and SARS-CoV-2 data

Hello, my name is Andrea and I am a bioinformatician from Spain.

Currently, I am working with SARS-CoV-2 data obtained using Nanopore sequencing, and I am interested in using RVHaplo for haplotype identification. Is it reliable for SARS-CoV-2 data already?

Thanks in advance!

IndexError: invalid index to scalar variable

Hi, I was running RVHaplo with my own data, but i met the following errors, the log was:

count nucleotide occurrence
SNV detection
Traceback (most recent call last):
File "./src/two_binomial.py", line 44, in
beta = ref_init[1]
IndexError: invalid index to scalar variable

___ [mcxIO] cannot open file </home/ngs/RVHaplo/resul/haplo_reads_graph.mci> in mode r

Hi @dhcai21
I am trying to run my ONT single reads after alignment with my reference fasta file through minimap2, but continuously getting the same error
error:
..Killed
___ [mcxIO] cannot open file </home/ngs/RVHaplo/resul/haplo_reads_graph.mci> in mode r
___ [mcl] no jive
___ [mcl] failed
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_reads_graph.mci': No such file or directory
___ [mcxIOopen] r stream </home/ngs/RVHaplo/resul/haplo_mcl_result.icl> cannae be opened
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_mcl_result.icl': No such file or directory
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_reads_graph.tab': No such file or directory
hierarchical clustering
Traceback (most recent call last):
File "./src/hierarchical_cluster.py", line 222, in
haplo_cluster = hierarchical_clustering(c_more,seq_mat,snv_base,major_base,weight_cluster,depth)
File "./src/hierarchical_cluster.py", line 156, in hierarchical_clustering
contig_mat = hierarchical_weight(res_cluster,seq_mat,snv_base,major_base,thres,depth)
File "./src/hierarchical_cluster.py", line 148, in hierarchical_weight
mat = contig_weight(seqs,node_se)
File "./src/hierarchical_cluster.py", line 116, in contig_weight
index = np.argsort(nod[:, 0])
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
haplotypes reconstruction
Traceback (most recent call last):
File "./src/out_haplotypes.py", line 47, in
f = open(file_final,'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
haplotypes polishment (medaka)
Traceback (most recent call last):
File "./src/extract_reads.py", line 18, in
f = open(f'{file_path}/{file_prefix}_clusters.pickle','rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
Traceback (most recent call last):
File "./src/run_medaka.py", line 15, in
f = open(f'{file_path}/{file_prefix}_clusters.pickle','rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/ngs/RVHaplo/resul/haplo_clusters.pickle'
rm: cannot remove '/home/ngs/RVHaplo/resul/haplo_clusters.pickle': No such file or directory
complete reconstructing haplotypes

The command is given here :
(rvhaplo) ngs@ngs-ThinkCentre-M93p:~/RVHaplo$ ./rvhaplo.sh -i alignedBPC2.sam -r final_consensus_JEV_stock.FST -o '/home/ngs/RVHaplo/resul' -p haplo -t 8

I have attached a snapshot of the same for your review.

Thanks

SNV count issue?

Hello,

I have noticed that some of my outputs do not create a consensus or haplotype fasta file. I wanted to make sure I understood the issue so I looked at the rvhaplo.sh script and captured the output. I added some echo commands to the script so that I could double check where things were happening in the code.

From what I can see it looks like the size variable is being calculated incorrectly using wc -l. (I had to look this up) Because there is no newline character in the rvhaplo_snv.txt file (snv output) wc is returning the size to be 0.

Here is the "modified" rvhaplo.sh file code:

#! /bin/bash
### required arguments
file_sam=""
file_ref=""
### optional arguments
file_path='./result'
prefix="rvhaplo"
mq=0
thread=8
error_rate=0.1
signi_level=0.05
cond_pro=0.65
fre_snv=0.8
num_read_1=10
num_read_2=5
gap=15
smallest_snv=20
only_snv=0
ovlap_read=5
weight_read=0.85
mcl_inflation=2
lar_cluster=50
ovlap_cluster=10
depth=5
weight_cluster=0.8
abundance=0.005
s_pos=1
e_pos=10000000000
sub_graph=1


function help_info() {
	echo "Usage: $0 -i alignment.sam -r ref_genome.fasta [options]"
	echo ""
	echo "RVHaplo: Reconstructing viral haplotypes using long reads"
	echo ""
	echo "Author: Dehan CAI"
	echo "Date:   May 2022"
	echo "Version 2: Support mutli-thread processing; Use a C package of MCL; Cost less memory   "
	echo ""
	echo "    -i | --input:                     alignment file (sam)"
	echo "    -r | --refernece:                 reference genome (fasta)"
	echo ""
	echo "    Options:"
	echo "    -o  | --out:                      Path where to output the results. (default:./result)"
	echo "    -p  | --prefix STR:               Prefix of output file. (default: rvhaplo)"
	echo "    -t  | --thread INT:               Number of CPU cores for multiprocessing. (default:8)"
	echo "    -e  | --error_rate FLOAT:         Sequencing error rate. (default: 0.1)"
	echo "    -mq | --map_qual INT:             Smallest mapping quality for reads . (default:0)"
	echo "    -s  | --signi_level FLOAT:        Significance level for binomial tests. (default: 0.05)"
	echo "    -c  | --cond_pro FLOAT:           A threshold of the maximum conditional probability for SNV sites. (default: 0.65)"
	echo "    -f  | --fre_snv FLOAT:            The most dominant base' frequency at a to-be-verified site should >= fre_snv. (default: 0.80)"
	echo "    -n1 | --num_read_1 INT:           Minimum # of reads for calculating the conditional probability given one conditional site. (default:10)"
	echo "    -n2 | --num_read_2 INT:           Minimum # of reads for calculating the conditional probability given more than one conditional sites. (default: 5)"
	echo "    -g  | --gap INT:                  Minimum length of gap between SNV sites for calculating the conditional probability. (default:15)"
	echo "    -ss | --smallest_snv INT:         Minimum # of SNV sites for haplotype construction. (default:20)"
	echo "    -os | --only_snv (0 or 1) :       Only output the SNV sites without running the haplotype reconstruction part. (default: 0)"
	echo "    -or | --overlap_read INT:         Minimum length of overlap for creating edges between two read in the read graph. (default: 5)"
	echo "    -wr | --weight_read FLOAT:        Minimum weights of edges in the read graph. (default:0.85)"
	echo "    -sg | --sub_graph INT:            Number of subgraphs to run MCL (default:1)"
	echo "    -m  | --mcl_inflation FLOAT:      Inflation of MCL algorithm. (default:2)"
	echo "    -l  | --lar_cluster INT:          A threshold for seperating clusters into two groups based on sizes of clusters. (default:50)"
	echo "    -oc | --overlap_cluster INT:      A parameter related to the minimum overlap between consensus sequences. (default:10) "
	echo "    -d  | --depth INT:                Depth limitation for consensus sequences generated from clusters. (default:5) "
	echo "    -wc | --weight_cluster FLOAT:     Minimum weights between clusters in the hierarchical clustering. (default: 0.8)"
	echo "    -sp | --start_pos INT:            Starting position for generating consensus sequences (default: 1)"
	echo "    -ep | --end_pos INT:              Ending position for generating consensus sequences. (default: 1e10)"
	echo "    -a  | --abundance FLOAT:          A threshold for filtering low-abundance haplotypes. (default: 0.005)"
	echo "    -h  | --help :                    Print help message."
	echo ""
	echo "    For further details of above arguments, please refer to https://github.com/dhcai21/RVHaplo   "
	echo ""
	exit 0
}

if [[ "$1" == "" ]];then
	help_info
	exit 1
fi

while [[ "$1" != "" ]]; do
	case "$1" in
		-h | --help ) ## print help message
		help_info
		exit 1
		;;
		-i | --input ) ### input sam file
		case "$2" in 
		"" )
			echo "Error: no sam file as input"
			exit 1
			;;
		*)
			file_sam="$2"
			if [[ "${file_sam:0:1}" == "-" ]]
			then
				echo "Error: no sam file as input"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-r | --ref_genome) ### input reference genome
		case "$2" in 
		"")
			echo "Error: no fasta file as input"
			exit 1
			;;
		*)
			file_ref="$2"
			if [[ ""${file_ref:0:1}"" == "-" ]]
			then
				echo "Error: no fasta file as input"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-o | --out )  ### output path
		case "$2" in 
		"" )
			echo "Error: no output path"
			exit 1
			;;
		*)
			file_path="$2"
			if [[ "${file_sam:0:1}" == "-" ]]
			then
				echo "Error: no output path"
				exit 1
			fi
			shift 2
			;;
		esac
		;;
		-p | --prefix )  ### prefix
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			prefix="$2"
			shift 2
			;;
		esac
		;;
		-mq | --map_qual )  ### mapping quality
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			mq="$2"
			shift 2
			;;
		esac
		;;
		-t | --thread )  ### threads
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			thread="$2"
			shift 2
			;;
		esac
		;;
		-e | --error_rate )  ### error_rate
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			error_rate="$2"
			shift 2
			;;
		esac
		;;
		-s | --signi_level )  ### significance_level
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			signi_level="$2"
			shift 2
			;;
		esac
		;;
		-c | --cond_pro )  ### conditional_probability
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			cond_pro="$2"
			shift 2
			;;
		esac
		;;
		-f | --fre_snv )  ### determine the set of to-be-verified SNV sites
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			fre_snv="$2"
			shift 2
			;;
		esac
		;;
		-n1 | --num_read_1 )  ### number of reads for p(ai|aj)
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			num_read_1="$2"
			shift 2
			;;
		esac
		;;
		-n2 | --num_read_2 )  ### number of reads for p(ai|aj1,aj2,...,ajp)
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			num_read_2="$2"
			shift 2
			;;
		esac
		;;
		-g | --gap )  ### Minimum distance between SNV sites
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			gap="$2"
			shift 2
			;;
		esac
		;;
		-ss | --smallest_snv )  ### Minimum number of SNV sites for haplotype reconstruction
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			smallest_snv="$2"
			shift 2
			;;
		esac
		;;
		-os | --only_snv )  ### Only output the SNV sites without running the haplotype reconstruction part.
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			only_snv="$2"
			shift 2
			;;
		esac
		;;
		-or | --ovlap_read )  ### overlap_read
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			ovlap_read="$2"
			shift 2
			;;
		esac
		;;
		-wr | --weight_read )  ### weight_read
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			weight_read="$2"
			shift 2
			;;
		esac
		;;
		-sg | --sub_graph )  ### Number of subgraphs
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			sub_graph="$2"
			shift 2
			;;
		esac
		;;
		-m | --mcl_inflation )  ### inflation of MCL
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			mcl_inflation="$2"
			shift 2
			;;
		esac
		;;
		-oc | --ovlap_cluster )  ### overlap_cluster
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			ovlap_cluster="$2"
			shift 2
			;;
		esac
		;;
		-wc | --weight_cluster )  ### weight_cluster
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			weight_cluster="$2"
			shift 2
			;;
		esac
		;;
		-d | --depth )  ### depth limitation
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			depth="$2"
			shift 2
			;;
		esac
		;;
		-l | --lar_cluster )  ### large cluster size
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			lar_cluster="$2"
			shift 2
			;;
		esac
		;;
		-sp | --start_pos )  ### start_pos
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			s_pos="$2"
			shift 2
			;;
		esac
		;;
		-ep | --end_pos )  ### end_pos
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			e_pos="$2"
			shift 2
			;;
		esac
		;;
		-a | --abundance )  ### smallest abundance
		case "$2" in 
		"" )
			echo "Error: no input for $1"
			exit 1
			;;
		*)
			abundance="$2"
			shift 2
			;;
		esac
		;;
		*)
			echo "Error: unknow parameter $1"
			exit 1
	esac
done

if [[ "$file_sam" == "" ]];then
	echo "Error: no sam file input"
	exit 1
fi

if [[ "$file_ref" == "" ]];then
	echo "Error: no reference genome input"
	exit 1
fi

if [[ ${file_path:0-1:1} == "/" ]];then
	path_len=`expr ${#file_path}-1`
	file_prefix=$file_path$prefix
	file_path=${file_path:0:path_len}
else
	file_prefix=$file_path"/"$prefix
fi

##########  count nucleotide occurrence  ##########
echo "count nucleotide occurrence"
if [[ "$file_path" != "." ]];then
	rm -rf $file_path
	mkdir $file_path
fi
rm -rf $file_path"/alignment"
mkdir $file_path"/alignment"
file_len=`expr ${#file_sam}-4`
unique_sam=$file_path"/alignment/"$prefix".sam"
samtools view -h -F 0x900 -q $mq $file_sam > $unique_sam
file_bam=$file_path"/alignment/"$prefix".bam"
samtools view -b -S $unique_sam > $file_bam
rm $unique_sam
file_bam_sorted=$file_path"/alignment/"$prefix"_sorted.bam"
samtools sort $file_bam -o $file_bam_sorted
samtools index $file_bam_sorted
file_acgt=$file_prefix"_acgt.txt"
python ./RVHaplo/src/count_frequency.py $file_bam_sorted $file_acgt

########## two binomial tests  ##########
echo "SNV detection"
file_snv=$file_prefix"_snv.txt"
python ./RVHaplo/src/two_binomial.py $error_rate $signi_level $file_acgt $file_snv $thread $s_pos $e_pos

## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
echo "size one: ${size}"
echo "snv file one\n"
cat $file_snv
echo " "
if [[ $size != "0" ]];then
	echo "haplo one"
	python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt 1 $file_prefix"_consensus.fasta" $s_pos $e_pos
	python ./RVHaplo/src/extract_reads.py $file_path $prefix 1
	python ./RVHaplo/src/run_medaka.py $file_path $prefix 1
	exit 0
fi

## maximum conditional probability and construct reads graph
python ./RVHaplo/src/mcp_read_graph.py $file_bam_sorted $file_snv $cond_pro $smallest_snv $num_read_1 $num_read_2 $gap \
	$weight_read $ovlap_read $file_prefix $fre_snv $thread $only_snv $sub_graph

## judge number of detected SNV sites
size="$(wc -l <"$file_snv")"
size="${size:0-1:1}"
echo "size two: ${size}"
echo "snv file two\n"
cat $file_snv
echo " "
if [[ $size != "0" ]];then
	echo "haplo two"
	python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt 1 $file_prefix"_consensus.fasta" $s_pos $e_pos
	python ./RVHaplo/src/extract_reads.py $file_path $prefix 1
	python ./RVHaplo/src/run_medaka.py $file_path $prefix 1
	exit 0
fi

if [[ $only_snv != 0 ]];then
	exit 0
fi

## check the number of reads with overlaps
size="$(wc -l <"$file_prefix"_reads_graph.txt)"
size="${size:0-1:1}"
if [[ $size == "0" ]];then
	echo "No enough reads with overlaps"
	exit 0
fi

# MCL clustering
echo "MCL clustering"
python ./RVHaplo/src/run_mcl.py "$file_prefix" "$thread" "$mcl_inflation" "$sub_graph"

## hierarchical clustering
echo "hierarchical clustering"
python ./RVHaplo/src/hierarchical_cluster.py $file_prefix"_matrix.pickle" $lar_cluster $depth \
	$ovlap_cluster $weight_cluster $abundance $file_prefix

## reconstruct haplotypes
rm -rf $file_path"/clusters"
mkdir $file_path"/clusters"

echo "haplotypes reconstruction"

python ./RVHaplo/src/out_haplotypes.py $file_prefix"_clusters.pickle" $file_bam_sorted $file_path $file_acgt x $file_prefix"_consensus.fasta" $s_pos $e_pos

echo "haplotypes polishment (medaka)"
python ./RVHaplo/src/extract_reads.py $file_path $prefix x
python ./RVHaplo/src/run_medaka.py $file_path $prefix x

rm $file_prefix"_matrix.pickle"
rm $file_prefix"_reads_cluster.txt"
rm $file_prefix"_clusters.pickle"
rm -rf $file_path/medaka
echo "complete reconstructing haplotypes"

exit 0

Attached is the output for SAMPLE1 virus KM403390.1 - hasn't finished running yet, but you can see what I mean, and the output for SAMPLE1 virus KY382409 - has finished. I am fairly confident that there should be an output fasta for both because we make sure to do a viral strain selection before running RVHaplo where we select for at minimum 80% genome fraction (coverage along the length of the reference) and then we select the best viral strain based on depth.

rvhaplo_KM403390.1_std.txt
rvhaplo_KY382409_std.txt

Cheers,
Johnny

Over 8 Hour Processing Time for `mcp_read_graph`

Hey Dehan,

I was processing some of my samples and they are taking much longer than what was mentioned in the paper for run time (w/ 8 CPU ~100 min). 3 of the samples I was testing did not finish creating the mcp_read_graph and they had been running for over 8 hours with 32 CPU cores. I also made sure that if the samples had > 50k reads that the -sg n flag was used.

I was wondering if you might be able to help me figure out why it could be taking so long.

Cheers,
Johnny

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed hierarchical clustering

Hello! I am currently running the RVHaplo with my own data and encountered the following error! Any help would be greatly appreciated! Thanks so much!

SNV detection
Maximum conditional probability and Graph clustering
Traceback (most recent call last):
 File "./src/read_graph_mcl.py", line 96, in <module>
  to_be_verified_sites = to_be_verified_sites[nucl_fre[:,-1]>=fre_most_base]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
hierarchical clustering
Traceback (most recent call last):
 File "./src/hierarchical_cluster.py", line 176, in <module>
  f = open(sys.argv[1], 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_graph.pickle'
haplotypes reconstruction
Traceback (most recent call last):
 File "./src/out_haplotypes.py", line 10, in <module>
  f = open(file_final,'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_final.pickle'
rm: cannot remove ‘/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_graph.pickle’: No such file or directory
rm: cannot remove ‘/home/noyes046/nnoyes/RVHaplo_Test_Barcode05/rvhaplo_final.pickle’: No such file or directory
complete reconstructing haplotypes```

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.