ncbi-hackathons / novograph Goto Github PK

NovoGraph: building whole genome graphs from long-read-based de novo assemblies

License: MIT License

Perl 74.45% C++ 22.59% Makefile 0.31% Python 2.39% Dockerfile 0.26%

novograph's Introduction

NovoGraph: Genome Graph of Long-read De Novo Assemblies

An algorithmically novel approach to construct a genome graph representation of long-read-based de novo sequence assemblies. We then provide a proof of principle by creating a genome graph of seven ethnically-diverse human genomes.

Motivation

Employing a linear, monoploid reference genome is restrictive for genomic research, as such a reference both constrains and biases our understanding into the full diversity of subpopulation variation. For regions of high allelic and structural diversity, such as the major histocompatibility complex (MHC) on chromosome 6 in humans, mapping reads to a one-dimensional character string results in poor genomic characterisation for individuals who carry a sequence that is either missing or highly divergent from the single reference. Motivated by the potential of genome graphs to address these shortcomings, we present a pipeline for constructing a graph genome from multiple de novo assemblies.

The incentive for using de novo assembled genomes is to overcome the limitations posed by simply depending upon call sets derived from short-read sequencing. Constructing genome graphs using such call sets will result in genome graphs which contain SNPs but respectively few structural variants, especially large-scale structural variation. In order to correct this bias, our algorithm has been designed to employ de novo contigs directly---these contigs not only incorporate SNPs but intrinsically contain more structural variants and their breakpoints at base resolution. Using our approach, the resulting graph genome should be respectively enriched in large-scale structural variation. NovoGraph constructs a whole-genome graph by connecting the input assembly sequences at positions of homology; that is, it implements a model of homologous recombination between the input genomes.

Genome Graph of Seven Human Assemblies

We then focused on building a graph genome composed of seven human assemblies. The following assemblies were included within the genome graph:

AK1, Korean
CHM1, European
CHM13, European
HG003, Ashkenazim
HG004, Ashkenazim
HX1, Han Chinese
NA19240, Yoruba

Given that this genome graph has been designed to incorporate larger structural variation, we encourage this result to be used for future investigation and testing within the community.

Genome Graph Construction Pipeline

Recent changes

(August 2019)

NovoGraph now supports minimap2 for the initial mapping step and the PBSPro scheduler. The Bio::DB::HTS Perl module is not a dependency anymore.
The primary input is now an unsorted BAM.

Inputs:

reference file, GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa (GRCh38 without ALTs)
contigs file, allContigs.fa

Requirements:

SAMtools version >= 1.6
BWA version >= 0.7.15
MAFFT version >= 7
Perl dependencies:
- Set::IntervalTree, https://metacpan.org/release/Set-IntervalTree

Quick start:

perl suggestCommands.pl --inputContigs allContigs.fa --referenceGenome GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa --outputDirectory myGraph > commands_for_myGraph.txt

This will produce a file commands_for_myGraph.txt, which will contain the commands for a complete NovoGraph run.

Some important notes:

The commands produced by suggestCommands.pl are not meant for fully automated execution - instead, execute the commands manually (at least when you run NovoGraph for the first time), and read the integrated comments. The comments will also tell you how to use minimap2 or a PBSPro scheduling system.
The utilized reference genome (here: GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa) should always be a primary reference, i.e. it should not contain any ALT contigs. You can, however, include the ALT contigs in your input file (here: allContigs.fa).

Preparation:

Note: The following commands assume you are in the /scripts subdirectory of NovoGraph, and most intermediate output also goes there. In most circumstances, it is preferable to use a dedicated output directory. When using suggestCommands.pl, this is specified as a parameter.

## If you have long stretches of undetermined nucleotides (Ns) in your input fasta files or if you want
## to add a specific prefix to you input fastas you can use this helper script
python split_identify_fasta.py --Ns 5000 assembly1.fa prefix_for_assembly1 assembly1_split.fa

## Concatenate the input fasta files
cat assembly1_split.fa assembly2_split.fa assembly3_split.fa > allContigs.fa

## Align the contigs FASTA against the reference, outputting a single BAM
minimap2 -a -x asm20 GRCh38_full_plus_hs38d1_analysis_set_minus_alts allContigs.fa  | samtools view -Sb - > allContigs_unsorted.bam

## Check that there are no unmapped reads in the input BAM, as this might lead to unknown behaviour
samtools view -c -f 0x4 allContigs_unsorted.bam

## If there is no output with the above command, continue. 
## Otherwise, if you do find unmapped reads in the input BAM,
## please remove these as follows and use 'SevenGenomes.filtered.bam' for the remainder of the pipeline
samtools view -F 0x4 -bo allContigs_unsorted.filtered.bam allContigs_unsorted.bam

## Finally, check that these inputs are in the correct format for the MAFFT
perl checkBAM_SVs_and_INDELs.pl --BAM allContigs_unsorted.bam
                                --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                                --readsFasta allContigs.fa
				--sam2alignment_executable ../src/sam2alignment

Step 1: Find global alignments between individual input contigs and GRCh38

## Execute BAM2ALIGNMENT.pl
## This first step will output a text file '*.sortedWithHeader' which is to be input into the next script, FIND_GLOBAL_ALIGNMENTS.pl
## (Here we place outputs into the subdirectory '../intermediate_files'.)
perl BAM2ALIGNMENT.pl --BAM allContigs_unsorted.bam
                      --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                      --readsFasta allContigs.fa 
                      --outputFile ../intermediate_files/AlignmentInput.txt
		      --sam2alignment_executable ../src/sam2alignment

## Output:
## AlignmentInput.txt.sortedWithHeader


## Next, we perform local to global alignment with the calculation of a global alignment matrix. 
## The combined output for all contigs from all input assemblies is represented in a single SAM/CRAM file.
perl FIND_GLOBAL_ALIGNMENTS.pl --alignmentsFile ../intermediate_files/AlignmentInput.sortedWithHeader 
                               --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                               --outputFile forMAFFT.sam 
                               --outputTruncatedReads ../intermediate_files/truncatedReads 
                               --outputReadLengths ../intermediate_files/postGlobalAlignment_readLengths

## Output:
## forMAFFT.sam

Step 2: Multiple Sequence Alignment (MSA) computation

## Execute BAM2MAFFT.pl
perl BAM2MAFFT.pl --SAM forMAFFT.sam 
                  --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                  --readsFasta allContigs.fa 
                  --outputDirectory .../intermediate_files/forMAFFT 
                  --inputTruncatedReads .../intermediate_files/truncatedReads 
                  --sam2alignment_executable src/sam2alignment
                  --samtools_path /usr/local/bin/samtools

## The next step is to execute CALLMAFFT.pl
## This step assumes you are using the Sun Grid Engine (SGE) job scheduler to submit jobs
## If you use PBSPro, you can also add arguments like the following (modified for your local environment):
##    --PBSPro 1 --PBSPro_select 'select=1:ncpus=16:mem=48GB' --PBSPro_A IMMGEN --preExec 'module load Perl; module load SamTools; module load Mafft/7.407' --chunkSize 500
## The --chunkSize parameter can be used to control the number of multiple sequence alignments combined into a single cluster job
## (i.e. larger --chunkSize values lead to fewer submitted jobs)
perl CALLMAFFT.pl --action kickOff --mafftDirectory .../intermediate_files/forMAFFT --qsub 1
                  --mafft_executable /mafft/mafft-7.273-with-extensions/install/bin/mafft 
                  --fas2bam_path fas2bam.pl --samtools_path /usr/local/bin/samtools --bamheader windowbam.header.txt

## This script also contains commands to check submitted jobs and re-submit if necessary
perl CALLMAFFT.pl --action check --mafftDirectory .../intermediate_files/forMAFFT
                  --mafft_executable /mafft/mafft-7.273-with-extensions/install/bin/mafft 
                  --fas2bam_path fas2bam.pl --samtools_path /usr/local/bin/samtools --bamheader windowbam.header.txt
perl CALLMAFFT.pl --action reprocess --mafftDirectory .../intermediate_files/forMAFFT
                  --mafft_executable /mafft/mafft-7.273-with-extensions/install/bin/mafft 
                  --fas2bam_path fas2bam.pl --samtools_path /usr/local/bin/samtools --bamheader windowbam.header.txt
perl CALLMAFFT.pl --action processChunk --mafftDirectory .../intermediate_files/forMAFFT --chunkI 0
                  --mafft_executable /mafft/mafft-7.273-with-extensions/install/bin/mafft 
                  --fas2bam_path fas2bam.pl --samtools_path /usr/local/bin/samtools --bamheader windowbam.header.txt
                  
## If there are still individual alignment jobs that don't complete successfully, try adding --usePreClustering 1.
## When --usePreClustering is active, the algorithm will try increasingly aggressive multiple sequence alignment strategies. 
perl CALLMAFFT.pl --action reprocess --mafftDirectory .../intermediate_files/forMAFFT
                  --mafft_executable /mafft/mafft-7.273-with-extensions/install/bin/mafft 
                  --fas2bam_path fas2bam.pl --samtools_path /usr/local/bin/samtools --bamheader windowbam.header.txt
                  --usePreClustering 1
				  
## 
## Note: For the majority of use cases, this file 'windowbam.header.txt' should remain untouched. 
## Users should use the file 'windowbam.header.txt' as provided, unless there are assemblies with contigs 
## longer than chr1 in hg38, 248956422 bp. In this case, please change this value to be the size of the largest contig. 

## Next, we concatenate windows into a global MSA, outputting a single SAM file
perl globalize_windowbams.pl --fastadir .../intermediate_files/forMAFFT/ 
                             --msadir .../intermediate_files/forMAFFT/ 
                             --contigs .../intermediate_files/postGlobalAlignment_readLengths 
                             --output combined.sam


# Convert SAM to CRAM, and then index
samtools view -h -t GRCh38.headerfile.txt combined.sam > combined_with_header.sam
samtools sort combined_with_header.sam -o combined_with_header_sorted.sam
cat combined_with_header_sorted.sam | samtools view -C -T GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa - > combined.cram
samtools index combined.cram


## Validate that the CRAM is correct
perl checkMAFFT_input_and_output.pl --MAFFTdir .../intermediate_files/forMAFFT/ 
                                    --contigLengths .../intermediate_files/postGlobalAlignment_readLengths
                                    --preMAFFTBAM forMAFFT.sam
                                    --finalOutputCRAM combined.cram
                                    --fas2bam_path fas2bam.pl
                                    --samtools_path /usr/local/bin/samtools
                                    --bamheader windowbam.header.txt

Step 3: Compute an acyclic directed graph structure from the global MSA

## First, users are required compile the *cpp code within /src to create the executable 'CRAM2VCF'. 
## In order to successfully compile this code, execute 'make all' within /src
## Users then must link to this executable when running the script CRAM2VCF.pl

## Now we convert the CRAM into a VCF 
perl CRAM2VCF.pl --CRAM combined.cram 
                 --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                 --prefix graph 
                 --contigLengths .../intermediate_files/postGlobalAlignment_readLengths
                 --CRAM2VCF_executable ../src/CRAM2VCF
		 --sam2alignment_executable ../src/sam2alignment

 

## Next, execute launch_CRAM2VCF_C++.pl
perl launch_CRAM2VCF_C++.pl --prefix graph --number_processes 10


## Finally, run CRAM2VCF_createFinalVCF.pl 
perl CRAM2VCF_createFinalVCF.pl --CRAM combined.cram 
                                --referenceFasta GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa 
                                --prefix graph

Instructions to Download and Process Input Human Assemblies

The following commands were used to download the assembly FASTAs used for this project:

## AK1, Korean:
for file in `echo LPVO02.1.fsa_nt.gz LPVO02.2.fsa_nt.gz LPVO02.3.fsa_nt.gz LPVO02.4.fsa_nt.gz LPVO02.5.fsa_nt.gz LPVO02.6.fsa_nt.gz`; do wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/LP/VO/LPVO02/$file; done

## CHM1, European:
for file in `echo LJII01.1.fsa_nt.gz LJII01.10.fsa_nt.gz LJII01.11.fsa_nt.gz LJII01.12.fsa_nt.gz LJII01.13.fsa_nt.gz LJII01.14.fsa_nt.gz LJII01.15.fsa_nt.gz LJII01.2.fsa_nt.gz LJII01.3.fsa_nt.gz LJII01.4.fsa_nt.gz LJII01.5.fsa_nt.gz LJII01.6.fsa_nt.gz LJII01.7.fsa_nt.gz LJII01.8.fsa_nt.gz LJII01.9.fsa_nt.gz`; do echo $file; wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/LJ/II/LJII01/$file; done
 
## CHM13, European:
for file in `echo LDOC03.1.fsa_nt.gz LDOC03.2.fsa_nt.gz LDOC03.3.fsa_nt.gz LDOC03.4.fsa_nt.gz LDOC03.5.fsa_nt.gz LDOC03.6.fsa_nt.gz LDOC03.7.fsa_nt.gz`; do echo $file; wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/LD/OC/LDOC03/$file; done

## HX1, Han Chinese:
wget http://hx1.wglab.org/data/hx1f4.3rdfixedv2.fa.gz

## HG003, Ashkenazim father:
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/MtSinai_PacBio_Assembly_falcon_03282016/hg003_p_and_a_ctg.fa
 
## HG004, Ashkenazim mother:
wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/MtSinai_PacBio_Assembly_falcon_03282016/hg004_p_and_a_ctg.fa

## NA19240, Yoruba:
for file in `echo LKPB01.1.fsa_nt.gz LKPB01.2.fsa_nt.gz LKPB01.3.fsa_nt.gz LKPB01.4.fsa_nt.gz LKPB01.5.fsa_nt.gz LKPB01.6.fsa_nt.gz`; do wget ftp://ftp.ncbi.nlm.nih.gov/sra/wgs_aux/LK/PB/LKPB01/$file; done

Upon the successful download of these FASTAs, users should concatenate the individual assemblies into a single FASTA, allContigs.fa.

Docker

The most update Docker image can be found here: https://hub.docker.com/r/diltheygenomicslab/novograph

Please see the Dockerfile within the repository.

Data Availability

The genome graph of seven ethnically diverse human genomes can be accessed at the following Open Science Framework (OSF) project, NovoGraph

NovoGraph-Simple-VCF novograph_simple_v1.vcf.gz (202.6 MB)

The genome graphs of seven ethnically diverse human genomes in VCF format, overlapping variant alleles
NovoGraph-Universal-VCF novograph_universal_v1.vcf.gz (588.3 MB)

The genome graphs of seven ethnically diverse human genomes in VCF format, non-overlapping variant alleles
NovoGraph CRAM novograph_v1.cram (212.2 MB)

The global multiple sequence of all input sequences in CRAM format

Contributors

Evan Biederstedt (NYGC, WCM)
Alexander Dilthey (NHGRI-NIH, HHU/UKD)
Nathan Dunn (LBNL)
Aarti Jajoo (Baylor)
Nancy Hansen (NIH)
Jeff Oliver (Arizona)
Andrew Olson (CSHL)

This project was initiated at an NCBI-style hackathon held before the 2016 Biological Data Science meeting at Cold Spring Harbor Laboratory in October, 2016.

Citation

Biederstedt E, Oliver JC, Hansen NF, Jajoo A, Dunn N, Olson A, Busby B, Dilthey AT
NovoGraph: Genome graph construction from multiple long-read de novo assemblies.
F1000Research 2018, 7:1391 (doi: 10.12688/f1000research.15895.2)

novograph's People

Contributors

Stargazers

Watchers

Forkers

chaochihl kiwiroy nhansen monicagsol aditkashy wangdi2014 wook2014 emmadebayos

novograph's Issues

CSHL graph genome talk

Just heard talk by Jacob Pritt (student with Steven Salzberg and Ben Langmead at JHU) about their attempt at a graph genome. Other hackathon members were around as well, i.e. @nhansen @nathandunn @DCGenomics)

Discussed some challenges they were trying to tackle with regards to constructing a graph genome (e.g. should we treat all variants as equally relevant? If treated the same, what bias are we introducing?)

Introduced hisat2: https://github.com/infphilo/hisat2, with this background: http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html

Their project could benefit from ours (or at least, there's collaboration to be made here). They're naturally having issues with scaling. We could definite us their benchmarks for our own paper.

What details do you have to add, @DCGenomics? @nhansen? @nathandunn ?

Genomes other than human

Hi,

This tool is very interesting. I was wondering if NovoGraph is currently restricted to the human genome? If so, how hard do you think it would be to adapt it to other genomes?

Thanks,
Guilherme

running BAM2MAFFT. public.q issue

Hello there,

I am running he CALLMAFFT.pl script

I get this error message

Identified 16 directories.
 - chr11
 - chr5
 - chr6
 - chr15
 - chr1
 - chr10
 - chr14
 - chr4
 - chr12
 - chr16
 - chr13
 - chr3
 - chr9
 - chr7
 - chr2
 - chr8
Unable to run job: Job was rejected because job requests unknown queue "public.q".
Exiting.
qsub qsub temp_qsub failed at ../../../NovoGraph-master/scripts/CALLMAFFT.pl line 435.

It appears to be a problem due to the qsub command
It doesn't like the -q option given

When I check the grid engine being used here with "qsub -help"
It tells me I am running "GE 6.2u5" incase that is important

Thank you

Multiple Sequence Alignment tool?

MAFFT may be able to handle considerably longer sequences than 10kb: http://mafft.cbrc.jp/alignment/software/tips.html#longsequences
Perhaps windows could be entire chromosomes? #2
But also see note in MAFFT documentation: "Note that MAFFT assumes that the all input sequences share the order of homologous sites or blocks. If the sequences have repeat or inversion, use other tools such as FASTA and MUMmer." [emphasis added]

The suggested alternative MUMmer could be another option, but there is a steeper learning curve, and it is not clear if it is doing multiple sequence alignment or just several pairwise alignments.

vg call throwing error from tabix

REFFASTA="/home/data/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa"
VCFFILE="/home/devsci4/Graph_Genomes_CSHL/examples/vgInput.vcf.gz"
VGOUTFILE="/home/devsci4/Graph_Genomes_CSHL/examples/vgoutput.vg"
vg/bin/vg construct -r $REFFASTA -v $VCFFILE > $VGOUTFILE

Returns:

[tabix++] the index file is older than the vcf file. Please use '-f' to overwrite or reindex.

Create optimal window "size" algorithm

At the moment, we have the following ideas with regards to window size:

---try several options (e.g. -/+500 b, -/+200 b, -/+800 b) and then evaluate how these perform with metric by majority

The optimal solution would be to have varying window sizes, boundaries at locations with greatest coverage. How can we implement this? Any better algorithms come to mind? At the moment, this is a linear search algorithm, and would be the bottleneck for this code.

sam2alignment aborted

Hi,

When I run the Step 1 to get the AlignmentInput.txt file, it occurs with the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Missing primary non-supplementary alignment for read 000351F_pilon_pilon_pilon
sh: line 1: 22461 Aborted                 (core dumped) /home/cuixb/tools/biosoft/NovoGraph-1.0.0/src/sam2alignment all.genomes.filtered.bam.extract.filtered.SAM westar.fa > all.genomes.filtered.bam.extract.alignments
Cannot execute command: /home/cuixb/tools/biosoft/NovoGraph-1.0.0/src/sam2alignment all.genomes.filtered.bam.extract.filtered.SAM westar.fa > all.genomes.filtered.bam.extract.alignments at ./BAM2ALIGNMENT.pl line 67.

Before I run this step, I also filter the bam file with

samtools view -F 0x4 
samtools view -q 30 -F 256

So, how to solve this error?

Thank you in advance!

re-parallelize the MAFFT layer

The parallel command chokes when the number of entries is too high (though is fine for a single chromosome). May not be too big of a problem.

Moved back to doing sequentially, but with auto-thread count. However, that seems really slow.

Parallelization per chromosome or running through a proper parallelization language would make the most sense.

Samples argument for vcf2bam

@ajo2995 Could we get more info on the --samples argument in scripts/gg-wrapper-bam2vcf.pl? It sounds like a flat text file with assembly names? Is that right? Is there an example available?

CALLMAFFT.pl for torque6.1.12

Hello!

Is it okay to change the following line in CALLMAFFT.pl?

from

#PBS -J ${minJobID}-${maxJobID}

#PBS -t ${minJobID}-${maxJobID}

I tried using the changed CALLMAFFT.pl with the following command

time perl ${novograph}/CALLMAFFT.pl --action kickOff --mafftDirectory PM-AU-0011-N-A1_q7-nanofilt_adptrim.fastq.ctg.fa_novograph-02-forMAFFT_convert2 --qsub 1 \
                  --PBSPro 1 --PBSPro_select 'nodes=1:ppn=16,mem=48gb' --PBSPro_A root  --chunkSize 500 \
                  --mafft_executable ${mafft} \
                  --fas2bam_path ${novograph}/fas2bam.pl --samtools_path /appl/samtools/samtools-1.9/samtools --bamheader ${novograph_base}/windowbam.header.txt``` 

and this seemed to be working at first since a Job array showed up in my qstat output.

but most of the queues failed with the following error.

expr: syntax error
File /data2/hd00ljy/data_processing/PMI_Nanopore_WGS/7_analysis_jy/analysis_jy/PM-AU-0011-T-A1/Sniffles/PM-AU-0011-T-A1/test-cluster-length-mapq-distance-depth/over3000bp_mapq30_1000bp-dist_4support_tra-to
-bnd.vcf/test_graph_genome/minimap2/PM-AU-0011-N-A1_q7-nanofilt_adptrim.fastq.ctg.fa_novograph-02-forMAFFT_convert2/chr22_KI270734v1_random/chr22_KI270734v1_random_12.mfa does not contain any sequences. at
/data/hd00ljy/data_processing/PMI_Nanopore_WGS/tools/11_graph_genome/novograph/NovoGraph/scripts/CALLMAFFT.pl line 557.

could not find `dealWithTooManyCIGAROperations.pl`

Hi Hack,
I'm following the instruction , I reach the step FIND_GLOBAL_ALIGNMENTS.pl but an Error occured,
`
Statistics:
Alignments: 10
Avg chains per alignment: 154.7
Alignments with both ends trimmed: 9
Alignments with left end trimmed: 0
Alignments with right end trimmed: 1

Done. Produced SAM file forMAFFT.bam.sam.unfiltered

Filtering SAM with command: perl dealWithTooManyCIGAROperations.pl --input forMAFFT.bam.sam.unfiltered --output forMAFFT.bam.sam.unfiltered.filtered
Can't open perl script "dealWithTooManyCIGAROperations.pl": No such file or directory
`
And I couldn't find the perl scripts indeed.
Could you please help me with this?

Many thanks
WANG

bsub for CALLMAFFT.pl

Hi Evan,

Simply changing --qsub to --bsub won't help... can we have a script for bsub batch system?

Thanks

Cheers,
Chen

CALLMAFFT.pl with --qsub 1

Hello,

i found a small problem when I used CALLMAFFT in the reprocess mode with --qsub 1. When the number of regions in files_to_reprocess file is smaller than the $chunkSize then the calculation of my $n_chunks = ceil($n_files / $chunkSize); leads to $n_chunks being 1. This in turn means, that the -J parameter for the submission script is set to 1-1 and that is not allowed and this error is the result: qsub: illegal -J value

So probably it would be a good idea to check whether the number of files to process is smaller than $chunkSize and submit a job array only if necessary.

CRAM2VCF.cpp needs a lot of memory

When running launch_CRAM2VCF_C++.pl I have noticed that running 10 processes in parallel while(scalar(keys %still_running) >= 10)will use up a lot of mermory. Getting a job killed with 1024 Gbyte of RAM.

So I had to change it to while(scalar(keys %still_running) >= 2) in code.

IMHO it would be a good to make this an adjustable parameter.
it might be a good idea to look at why CRAM2VCF.cpp needs so much memory and if this can be changed

Name all the things

Obviously not the most pressing concern for this project, but we should come up with a name for this. This includes properly naming our "meta Needleman-Wunsch" algorithm.

It's something to think about, crew.

Died at ../../../NovoGraph-1.0.0/scripts/BAM2ALIGNMENT.pl line 246.

Hi there,

I have run the following
minimap2 -a -x asm20 ../ref/S288C_reference_sequence_R64-2-1_20150113_renamed.fsa ../homozygous_genomes/RENAMED/AAB.FINAL.canu_RENAMED.fa | samtools view -Sb - > AAB.bam
Mapping one genome to a reference
I checked and all contigs map
When I run the next step:
perl ../../../NovoGraph-1.0.0/scripts/BAM2ALIGNMENT.pl --BAM AAB.bam --referenceFasta ../ref/S288C_reference_sequence_R64-2-1_20150113_renamed.fsa --readsFasta ../homozygous_genomes/RENAMED/AAB.FINAL.canu_RENAMED.fa --outputFile intermediate_files/AlignmentInput.txt
I get
Died at ../../../NovoGraph-1.0.0/scripts/BAM2ALIGNMENT.pl line 246.

Any idea what the problem is?

Thanks
Samuel

parsedAlternates error on vg call

REFFASTA="/home/data/reference/GRCh38_full_plus_hs38d1_analysis_set_minus_alts.fa"
VCFFILE="/home/devsci4/Graph_Genomes_CSHL/examples/vgInput.vcf.gz"
VGOUTFILE="/home/devsci4/Graph_Genomes_CSHL/examples/vgoutput.vg"
vg/bin/vg construct -r $REFFASTA -v $VCFFILE > $VGOUTFILE

Returns:

parsedAlternates: alignment does not start with match over padded sequence
11M2D9M
ZZZZZZZZZZQTTZZZZZZZZZZ
ZZZZZZZZZZQZZZZZZZZZ

remove unnecessary copy operation

There is no reason to copy "ref only" files over. @nhansen Says that they produce empty BAM files anyway, so the header should not necessarily be written.

update CRAM2VCF_createFinalVCF.pl

The final script needs an update. It would probably best to modularize it, such that the user can choose which VCF he wants to merge.
So that he can choose --vcf [overlapping | separated | pseudoSample] . It should not fail when the other vcfs are not present

bam to vcf converter

Downstream tools like vg will need a vcf file to handle our graph genome.

Either find a tool that can export a vcf file, or write one.

Issues with alignment processing

Hi,
I've aligned cotigs from two non-human assemblies to a reference genome. All assemblies have been generated with pacbio reads. I've performed the alignment of the contigs using bwa (0.7.17) separately for the different contigs, joined everything, sorted and filtered out the unmapped contigs from the new bam file.
When I run the checkBAM_SVs_and_INDELs.pl i get the following output in fromBAM_lengthStatistics.txt:

Produced output file fromBAM_lengthStatistics.txt (only from reads that have primary information)
read_got_primary: 0
read_no_primary: 0

Then, when I run the next steps using BAM2ALIGNMENT.pl, I get no AlignmentInput.txt.sortedWithHeader file, and get the following error:

Mismatch query at /exports/cmvm/eddie/eb/groups/prendergast_grp/Andrea/NovoGraph/scripts/BAM2ALIGNMENT.pl line 351.

Do you know whether is a problem related to the length of the reads? Should I use an alternative aligner that works with extremely long contigs (>1Mb)?

Thank you in advance

Andrea

JGA---convert into Cython, and then add into pipeline

The current code will be far too slow with numpy arrays, though it will work.

Our implementation should speed up significantly with Cython (was on the to-do list). Otherwise, write in Julia/C/C++.

MetaNWAlgo Pipeline

Hello, I missed the conversation on google group about the Dec 9 deadline.

Evan, what are you working on presently? I had updated MetaNeedalmanWunchAlgo.py few weeks ago, it it complete now. It takes local assembly as input, and gives global assembly. But we need to give it correct input using Alex's file. I think you were writing some code for that. Let me know what part you are working on and/or you would want me to work on, so we can divide the work?

Best,
Aarti

Configuring tzdata stops with this interactive dialog

It stays forever at this step. I tried many alternatives ARG DEBIAN_FRONTEND=noninteractive and ENV TZ=Africa/Tunis RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

Is it possible to get the contig list for VCF records?

Hello!

I want to find contigs for individual rows of output VCF, so that I can identify the sample from which they are originated.

If I can get the list of contigs that generated a specific VCF record, I will be able to identify the sample.

Is there an option to do this task?

Thank you
Jinyoung

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.