Giter Site home page Giter Site logo

yagcloser's Introduction

YAGCloser

Yet Another Gap Closer based on spanning long reads

A program for correction of genome assemblies focused on gap closing and filling, based on gap-spanning of long reads.

Installation

Requires the following python modules:

  • pysam,biopython, argparse,collections,copy,csv,datetime,filetype, glob,gzip,logging,os,random,string,sys, numpy

Easy installation with pip:

pip3 install --user --upgrade pip
pip3 install --user pysam biopython numpy 

The input alignment

We recommend the input alignment to be generated following these instructions.

  1. Map the long reads against the genome assembly using minimap2 (Li, 2018) and tag secondary alignments (option --secondary=yes).
  2. Convert the alignment into its binary form (BAM file), sort it by coordinates and index it with samtools (Li et al., 2009) [Version tested: 1.7 (using htslib 1.8)].
minimap2 --secondary=yes -ax [map-pb | map-ont] reference.fasta reads.fastq | samtools view -hSb - | samtools sort - > aln.s.bam
  1. [Optional | Will help with processing speed later] Filter a subset of primary alignments with MAPQ > 20 to identify the closable gaps.
minimap2 --secondary=yes -ax [map-pb | map-ont] reference.fasta reads.fastq | samtools view --hSb -q 20 - | samtools sort - > aln.s.bam
  1. Index BAM file
samtools index aln.s.bam
  1. Get a description of the gaps present in the reference file, you can use the detgaps too from asset to do so. Check on the corresponding repository how to install it.
detgaps reference.fasta > .gaps.bed &

Usage

usage: yagcloser [-h] -g FASTA_FILE_PATH -b BED_FILE_PATH -a BAM_FILE_PATH 
                  -o FOLDER_PATH -s STR [-mins INT] [-f INT]
                 [-mapq <MAPQ_threshold>] [-mcc INT] [-prt FLOAT] [-eft FLOAT]
                 [-l <log_level>] [-v]

Required arguments:

  • -g FASTA FILE PATH, --genome FASTA FILE PATH: Filepath of the reference genome file to be used. Accepts compressed files (GZIP) (default: None)
  • -b BED FILE PATH, --bed BED FILE PATH: Filepath of the bed file describing the gaps of the genome. Accepts compressed files (GZIP) (default:None)
  • -a BAM FILE PATH, --aln BAM FILE PATH: Filepath of the alignment of reads to the reference genome in BAM format. This file needs to be indexed before running. (default: None)
  • -o FOLDER_PATH, --output FOLDER_PATH: Output path folder. (default: None)
  • -s STR, --samplename STR: Short sample name that will be used for naming OUTPUT files. (default: None)

Optional arguments:

  • -mins INT, --min-support INT: Minimum number of reads needed spanning a gap to be considered for closing or filling, (default: 5).
  • -f INT, --flanksize INT: Flank size to be used to select the reads that are in the surroundings of the gap and determine whether there are reads that span the gap or not, (default: 20).
  • -mapq <MAPQ_threshold>, --mapping-guality-threshold <MAPQ_threshold>: MAPQ value used to filter alignments for posterior processing. Discarding alingments where: alignment_mapq < MAPQ_threshold, (default: 20).
  • -mcc INT, --min-coverage-consensus INT: Require that more than INT sequences to be part of an alignment to put in the consensus. Modify if -mins/--min-support < default, (default: 2).
  • -prt FLOAT, --percent-reads-threshold FLOAT: Require that more than INT sequences to remain after the length agreement to be considered for consensus. (default: 0.5)
  • -eft FLOAT, --empty-flanks-threshold: FLOAT Percentage of empty flanks required to skip an ambiguous decision on a gap. (default: 0.2)
  • -l <log_level>, --log <log_level>: Verbosity levels, (default: INFO).

Information arguments:

  • -v, --version: Show program's version number and exit
  • -h, --help: show this help message and exit

Output

File/directory Name Description
D <samplename>.consensus Folder with FASTA files that are used in the consensus phase.
D <samplename>.flanks Folder with FASTA files with the sequences corresponding to the flank regions.
D <samplename>.fullsupport Folder with FASTA files. Content are the sequences that include the flanks and the sequencing between the flanks.
D <samplename>.log
D <samplename>.msa Output of the alignment of the subreads.
D <samplename>.pre
D <samplename>.reads Read subsequences that are aligned to the flank and sequences between the flanks.
D <samplename>.support Folder with FASTA files. Content are the sequences between the flanks.
D <samplename>.consensus.log.txt Log of the consensus result
F <samplename>.alignment.err Output of all the alignments generated by MAFFT
F <samplename>.ambiguous.txt List of gaps with ambiguous decisions
F <samplename>.edits.txt File that describes the final edits that should be made on the reference sequence
F <samplename>.gaps.closed.original_coordinates.txt
F <samplename>.no.length.agreement.txt List of gaps filtered out after the lenght agreement step
F <samplename>.no.support.gaps.txt List of gaps with not enough support
F <samplename>.potential.fillable.gaps.txt List of gaps that have enough support to be considered for the gap closing process.

Examples

  1. Run to identify potential gaps/edits that will be done to your reference
 python yagcloser -g reference.fasta \
    -a reference.sorted.bam \
    -b gaps.bed \
    -o yagcloser_output \
    -f 40 - mins 2 -s reference_data
  1. Edit the reference file
python update_assembly_edits_and_breaks.py \
    -i reference.fasta \
    -o reference.v2.fasta \
    -e yagcloser_output/yagcloser_output.edits.txt

yagcloser's People

Contributors

merlyescalona avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

yagcloser's Issues

curious what to use for describe-assembly

Hello,

Your repo looks like an interesting additional usage of HiFi reads and thought I'd give it a go with some of our hifiasm created assemblies.

I was having trouble determining how to create the gap bed file via describe-assembly -g reference.fasta -B > reference.gaps.bed

I suspect this is something simple I am missing. Any insight appreciated.

best,

Jim Henderson
Research Associate, California Academy of Sciences

ERROR when identifying potential gaps

Hello, I wanted to use yagcloser to fill gap in my genome( about 130MB),But when running the program to identify potential gaps, the following error appears. And I don't know how to fix it.
here is my log:
[10/03/2023 12:07:02 AM] INFO (identify_potential_gaps|215): ================================================================================
[10/03/2023 12:07:02 AM] INFO (identify_potential_gaps|216): Identifying potential gaps
[10/03/2023 12:07:02 AM] INFO (identify_potential_gaps|217): ================================================================================
Traceback (most recent call last):
File "/data/zhoujie/software/yagcloser/yagcloser.py", line 1073, in main
return run(parameters)
File "/data/zhoujie/software/yagcloser/yagcloser.py", line 907, in run
data=identify_potential_gaps(bed,scaffolds, parameters)
File "/data/zhoujie/software/yagcloser/yagcloser.py", line 285, in identify_potential_gaps
for read in samfile.fetch(region=region):
File "pysam/libcalignmentfile.pyx", line 1091, in pysam.libcalignmentfile.AlignmentFile.fetch
File "pysam/libchtslib.pyx", line 665, in pysam.libchtslib.HTSFile.parse_region
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/zhoujie/software/yagcloser/yagcloser.py", line 1081, in
SIGNAL_EXIT,MESSAGE=main()
File "/data/zhoujie/software/yagcloser/yagcloser.py", line 1078, in main
return -1, "{}\t{}".format(e.class, e.message)
AttributeError: 'ValueError' object has no attribute 'message'

empty edits file

Hi:
I just ran yagcloser and it completed without any errors or issues. The 'edits' file is empty and my assembly didn't change in any metrics. I was using hifi data to fill gaps with a genome generated with YAHS. The genome from YAHS has several hundred scaffolds for our diploid species that has a large genome size (6 GB) and 23 chromosomes. Should I be concerned that something didn't work or can this be expected in some cases? Thank you.
Andreas

ERROR when getting output.potential.fillable.gaps.txt

hello,,I wanted to use yagcloser to fill gap in my genome( about 500MB) ,but get empty output.potential.fillable.gaps.txt,and get some record if no.support.gaps, I want to know why,Is it because I have a big genome?

here is my “yagcloser_output.no.support.gaps.txt”:

Chr2:38336522-38336622 100
Chr3:19743855-19743955 100
Chr3:29807753-29807853 100
Chr3:34631046-34631146 100
Chr3:35615854-35615954 100
Chr3:36136983-36137083 100
Chr3:36361212-36361312 100
Chr3:36450609-36450709 100

here is my log:

[03/03/2023 05:42:06 PM] WARNING (create_output_folders|865): Take into account that if these folders already exist files will be overwritten.
[03/03/2023 05:42:06 PM] INFO (parse_bed_file|62): ================================================================================
[03/03/2023 05:42:06 PM] INFO (parse_bed_file|63): Reading BED file (/work4/Denovo/JZP202211GM039-01_carrot_wangyaoxin/12.yagcloser/Z21_170/gaps.bed)
[03/03/2023 05:42:06 PM] INFO (parse_bed_file|64): ================================================================================
[03/03/2023 05:42:06 PM] INFO (parse_bed_file|86): Done reading BED file > 0:00:00.001071
[03/03/2023 05:42:06 PM] INFO (parse_reference_file|95): ================================================================================
[03/03/2023 05:42:06 PM] INFO (parse_reference_file|96): Reading scaffolds from reference file...
[03/03/2023 05:42:06 PM] INFO (parse_reference_file|97): ================================================================================
[03/03/2023 05:42:08 PM] INFO (generate_flank_table|146): ================================================================================
[03/03/2023 05:42:08 PM] INFO (generate_flank_table|147): Extracting flank regions...
[03/03/2023 05:42:08 PM] INFO (generate_flank_table|148): ================================================================================
[03/03/2023 05:42:08 PM] INFO (generate_flank_table|180): Done with flanks extraction: > 0:00:00.000091
[03/03/2023 05:42:08 PM] INFO (parse_reference_file|139): Done reading reference file (/work4/Denovo/JZP202211GM039-01_carrot_wangyaoxin/06.Mumer/Z21_170/final.fa) > 0:00:01.648593
[03/03/2023 05:42:08 PM] INFO (identify_potential_gaps|215): ================================================================================
[03/03/2023 05:42:08 PM] INFO (identify_potential_gaps|216): Identifying potential gaps
[03/03/2023 05:42:08 PM] INFO (identify_potential_gaps|217): ================================================================================
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|322): Done with the info extraction... > 0:00:05.947712
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|326): ================================================================================
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|327): Removing gaps with no support from analysis...
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|328): ================================================================================
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|343): Removed 8 gaps.. > 0:00:00.000135
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|344): --------------------------------------------------------------------------------
[03/03/2023 05:42:13 PM] INFO (identify_potential_gaps|345): Reporting removed gaps to file: /work4/Denovo/JZP202211GM039-01_carrot_wangyaoxin/12.yagcloser/Z21_170/yagcloser_output/yagcloser_output.no.support.gaps.txt
[03/03/2023 05:42:14 PM] INFO (identify_potential_gaps|390): Reporting potential fillable gaps (/work4/Denovo/JZP202211GM039-01_carrot_wangyaoxin/12.yagcloser/Z21_170/yagcloser_output/yagcloser_output.potential.fillable.gaps.txt)...
[03/03/2023 05:42:14 PM] INFO (extract_support_data|418): ================================================================================
[03/03/2023 05:42:14 PM] INFO (extract_support_data|419): Extracting support data
[03/03/2023 05:42:14 PM] INFO (extract_support_data|420): ================================================================================
[03/03/2023 05:42:14 PM] INFO (extract_support_data|518): End of extracting support data > 0:00:00.007475
[03/03/2023 05:42:14 PM] INFO (extract_support_data|519): ================================================================================
[03/03/2023 05:42:14 PM] INFO (extract_support_data|520): Checking for ambiguos decisions...
[03/03/2023 05:42:14 PM] INFO (extract_support_data|554): ================================================================================
[03/03/2023 05:42:14 PM] INFO (extract_support_data|555): Writing support files
[03/03/2023 05:42:14 PM] INFO (extract_support_data|556): ================================================================================
[03/03/2023 05:42:14 PM] ERROR (|1083): Done with file writing...
[03/03/2023 05:42:14 PM] INFO (|1087): Done with file writing...

Gap filling only for Ns

Hello,

I'm trying to fill N gaps using yagcloser

I wonder if I could fill only my gaps, not affecting other nucleotide sequences (ATGC).

Is it possible? or are there any options I could use to fill my gaps (Ns) without affecting the other sequences?

Best wishes,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.