Giter Site home page Giter Site logo

pacbio-utilities's Introduction

PacBio-utilities

A collection of scripts useful for working with PacBio-based assemblies.

The pacbio-util script requires BioPerl and samtools >= 0.1.19.

Correction of small indels in polished assemblies

We have noticed that PacBio assemblies can show single-base deletions, in most cases, in proximity to even fairly short homopolymer runs. This does not occur at all homopolymer runs, and in our limited samples incidence rate is perhaps 1 in 10kbp. Single-base insertions and longer insertions may occur but are much rarer.

There are three steps to applying this correction. First, map a set of Illumina reads to the PacBio assembly to generate sorted BAM file(s). Reads from multiple BAMs will be combined. These reads should be from the same individual/strain from which the assembly was created. Reads from other sequencing technologies, or from another individual/strain, will reduce the effectiveness of the correction and may introduce further errors. In all subsequent steps it is required that the order of the sequences in the assembly and the BAMs is identical.

Second, use the pacbio-util script here to generate a set of indel targets for correction. The command is pacbio-util indel-targets, and the criteria for determining targets may be modified via options. Use the -h option to see available options and current defaults.

The third and final step is to use the command pacbio-util indel-apply to apply the targets generated in the second step to the assembly to create a corrected assembly. Options may also be used here to restrict the set of targets applied; use the -h option to see available options and current defaults.

For an assembly pacbio_assembly.fasta, and mappings of Illumina paired-end and single-end reads to this assembly in BAM files pe.sorted.bam and se.sorted.bam, the following will generated the list of indel targets to targets.txt and a corrected assembly to corrected_assembly.fasta.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam > targets.txt
pacbio-util indel-apply -f pacbio_assembly.fasta -t targets.txt > corrected_assembly.fasta

The indel-apply command can find the assembly used for determining the targets by using the first line of the list of targets. This provides a useful shortcut for combining both commands into a one-line piped command. The tee is in place to save the list of targets as they pass through to indel-apply.

pacbio-util indel-targets -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam \
| tee targets.txt \
| pacbio-util indel-apply > corrected_assembly.fasta

Here is example tool output (with -v) and scaffold size results before and after correction.

$ pacbio-util indel-targets -v -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam | tee targets.txt | pacbio-util indel-apply -v > corrected_assembly.fasta
samtools pipe command: 'samtools mpileup -s -B -d 10000 -L 10000 --ff 1280 -q 1 -Q 13 -f pacbio_assembly.fasta pe.sorted.bam se.sorted.bam |'
[mpileup] 2 samples in 2 input files
Merging columns from 2 BAM files ...
pacbio-util indel-targets: 1281 targets generated for assembly pacbio_assembly.fasta
pacbio-util indel-apply: Opening target assembly 'pacbio_assembly.fasta'
pacbio-util indel-apply: unitig_7, no targets
pacbio-util indel-apply: unitig_52, no targets
pacbio-util indel-apply: unitig_46, no targets
pacbio-util indel-apply: unitig_14, no targets
pacbio-util indel-apply: unitig_54, no targets
pacbio-util indel-apply: unitig_44, no targets
pacbio-util indel-apply: unitig_13, no targets
pacbio-util indel-apply: unitig_56, no targets
pacbio-util indel-apply: unitig_55, no targets
pacbio-util indel-apply: unitig_43, no targets
pacbio-util indel-apply: unitig_48, no targets
pacbio-util indel-apply: unitig_45, no targets
pacbio-util indel-apply: unitig_15, no targets
pacbio-util indel-apply: unitig_47, no targets
pacbio-util indel-apply: unitig_60, no targets
pacbio-util indel-apply: unitig_61, no targets
pacbio-util indel-apply: unitig_51, no targets
pacbio-util indel-apply: unitig_59, no targets
pacbio-util indel-apply: unitig_4, no targets
pacbio-util indel-apply: 1281 targets applied to assembly pacbio_assembly.fasta
$
Scaffold bp before bp after
unitig_1 6250549 6250752
unitig_10 1811588 1811655
unitig_7 15299 15299
unitig_52 7932 7932
unitig_46 8606 8606
unitig_14 1884 1884
unitig_0 10120325 10120600
unitig_8 3934972 3935082
unitig_12 240123 240143
unitig_3 16252 16253
unitig_54 22065 22065
unitig_44 12520 12520
unitig_53 5951772 5951911
unitig_13 86224 86224
unitig_56 30475 30475
unitig_55 15275 15275
unitig_41 19231 19232
unitig_43 9344 9344
unitig_58 6608738 6608943
unitig_9 2363594 2363680
unitig_48 30325 30325
unitig_45 14763 14763
unitig_57 18691 18685
unitig_15 2004 2004
unitig_47 15558 15558
unitig_6 4486498 4486633
unitig_60 26006 26006
unitig_61 16086 16086
unitig_51 9410 9410
unitig_59 14740 14740
unitig_4 19898 19898

pacbio-util indel-targets

Generate indel targets for correction. Uses samtools mpileup for indel detection. Targets are written to standard output.

Targets will only be generated where all read mappings containing an indel agree on the size and sequence of the indel. Further criteria for minimum indel fraction, minimum coverage, and maximum indel size may be adjusted via options.

pacbio-util indel-targets -f FASTA FILE1.bam [ FILE2.bam ... ]

OPTIONS

    -f FILE, --fasta FILE    PacBio assembly in Fasta format

    FILE1.bam [ FILE2.bam ]  BAM files of reads mapped to

    -                        Read mpileup from stdin rather than running
                             samtools. Note that mapping qualities
                             (option `-s`) are expected in the mpileup input.

    --indel-frac FLOAT       Do not report indels at a position if the
                             fraction of reads containing them is below FLOAT
                             [default 0.8]
    --indel-min-cov INT      Minimum read coverage to consider an indel
                             [default 10]
    --include-bad-indels     Include indels in target set that do not pass
                             our criteria, these lines are marked 'bad' in
                             the sixth column
    --max-indel-size INT     Maximum indel size to accept, beyond this the
                             position is considered in error [default 1000]
    --mapping-quality FLOAT  Minimum mean mapping quality of reads covering a
                             target site. Note that a value of 0 means no
                             minimum applied; reads with 0 mapQ are still
                             excluded by default (see --all-reads) [default 0]
    --all-reads              Include all reads.  Default is to exclude
                             multiply-mapped reads (mapQ < 1), alignments
                             that are not the primary (SAM flag 256), and
                             duplicate alignments (flag 1024).
    --all-bases              Include all bases.  Default is to exclude bases
                             below quality 13.
    --samtools FILE          Location of samtools executable
                             [default samtools]

    --verbose                Print informational messages
    --help, -?               help message

Below is an example subset of targets generated with this command. The first line of the targets output is the name of the assembly Fasta file against which the targets were generated. Following that, the columns are scaffold name, target position, reference base, total read coverage, type of target (always indel for this command), whether the target passed quality criteria (good or bad), the frequency of coverage supporting the indel, the size of the indel (positive for insertion, negative for deletion), a string describing the indel size and sequence, the number of reads supporting the indel, and the mean mapping quality of reads.

#assembly:pacbio_assembly.fasta
unitig_1	101522	C	72	indel	good	0.9444	1	+1A	68	60
unitig_1	119319	G	127	indel	good	0.8110	1	+1A	103	60
unitig_1	120891	A	104	indel	good	0.9231	1	+1T	96	60
unitig_1	122801	A	93	indel	good	0.9032	1	+1C	84	60
unitig_1	125303	G	84	indel	good	0.9286	1	+1A	78	60
unitig_1	131734	T	67	indel	good	0.8657	1	+1A	58	60
unitig_1	136434	A	28	indel	good	0.8214	1	+1C	23	60
unitig_1	170949	A	40	indel	good	0.9500	1	+1G	38	60
unitig_1	171182	A	80	indel	good	0.9500	1	+1G	76	60
unitig_1	172016	A	67	indel	good	0.9403	1	+1C	63	60
unitig_1	190068	A	18	indel	good	1.0000	1	+1G	18	60
unitig_1	190154	T	11	indel	good	0.9091	1	+1G	10	60
unitig_1	191027	C	104	indel	good	0.9231	1	+1G	96	60
unitig_1	191173	C	96	indel	good	0.9167	1	+1T	88	60
unitig_1	230981	T	63	indel	good	0.9683	1	+1C	61	60
unitig_1	270949	G	100	indel	good	0.9800	1	+1C	98	60
unitig_1	307836	A	38	indel	good	0.8947	1	+1C	34	60
unitig_1	310152	C	105	indel	good	0.8857	1	+1A	93	60
unitig_1	326536	G	105	indel	good	0.9143	1	+1A	96	60
unitig_1	360786	G	39	indel	good	0.8205	1	+1A	32	58.5
unitig_1	369367	C	98	indel	good	0.9592	1	+1A	94	60
unitig_1	371657	T	48	indel	good	0.9375	1	+1G	45	60
unitig_1	371776	G	43	indel	good	1.0000	1	+1T	43	60
unitig_1	373159	C	113	indel	good	0.9469	1	+1A	107	60

pacbio-util indel-apply

Apply indel targets to correct an assembly. The corrected assembly is written to standard output. The order of sequences in the assembly must be the same as that in the targets.

pacbio-util indel-apply [ -f FASTA ] [ -t TARGETS ]

OPTIONS

    -f | --fasta FILE        PacBio assembly in Fasta format, to be corrected.
                             Must be the same assembly used for
                             './pacbio-util indel-targets'.
                             If the assembly is not specified, it is determined
                             from the first line of the target list.
    -t | --targets TARGETS   List of indel targets to apply, standard output
                             from './pacbio-util indel-targets -f FILE ...'.  If no
                             targets are specified, they are read from stdin.

    --include-bad-indels     Apply indels in target set that are marked 'bad'
    --max-indel-size INT     Maximum indel size to apply [default 1000]
    --min-indel-frac FLOAT   Minimim indel fraction to apply, a value of 0
                             means apply all indels in target set [default 0]
    --skip-deletion-verify   Do not verify that a deletion matches the
                             sequence at that position in the assembly
                             [default 0]
    --skip-softmasked        Do not apply targets to softmasked regions of the
                             assembly, as determined by use of lowercase
                             [default 0]

    --verbose                Print informational messages
    --help, -?               help message

pacbio-utilities's People

Contributors

douglasgscofield avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pacbio-utilities's Issues

-f and target assembly do not match

Hi all and Dr Scofield,

As the first step seemed to run perfectly, I am encountering a new issue using indel-apply option. I used the same assembly to build the target file, but I receive this message:
./pacbio-util indel-apply: *** Warning: -f and target assembly do not match

It creates a new file (slightly bigger). To check if corrections were applied, I run the indel-target on this supposably corrected assembly but it produces the exact same targets. So, I guess no correction was applied.

Am I missing something?

Thanks in advance for any help,

Jonathan

multiple issues running pacbio-util indel-apply

I ran the following without an issue:

perl pacbio-util indel-targets -f file.fa bwa.sorted.bam > targets.txt

I then ran the following:

cat targets.txt | perl pacbio-util indel-apply -v > corrected_assembly.fa

This first failed first because targets.txt had truncated deflines.

instead of: Blarg2.0001_pilon it had Blarg2.0001

To fix this I removed _pilon from the FASTA file.

I then reran the indel-apply step and got the following error:

pacbio-util indel-apply: Opening target assembly 'blarg_pacbio_pilon_diploid_default.fasta' deletion 'N' mismatch vs assembly Blarg2.0001:6311, 'G' at pacbio-util line 427, <TARGETS> line 2.

I noticed that the third column in targets.txt is always N and that the ninth column in all of the deletion rows use N instead of actual nucleotide identities (e.g. -1N or -4NNNN). This seems fine to me since they are being deleted. so I tried commenting out line 427. Here are lines 426โ€“428:

if ($indel ne uc($del)) { die "deletion '$indel' mismatch vs assembly ".$seq->id.":$seq_start, '$del'"; }

When I reran with this line commented I get the following error:

`
pacbio-util indel-apply: Opening target assembly 'blarg_pacbio_pilon_diploid_default.fasta'

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Bad end parameter (85598). End must be less than the total length of sequence (total=85458)
STACK: Error::throw
STACK: Bio::Root::Root::throw /usr/local/share/perl5/Bio/Root/Root.pm:449
STACK: Bio::PrimarySeq::subseq /usr/local/share/perl5/Bio/PrimarySeq.pm:452
STACK: Bio::Seq::subseq /usr/local/share/perl5/Bio/Seq.pm:633
STACK: main::indel_apply pacbio-util:413
STACK: pacbio-util:43

`

Thanks for providing this code. Any help on resolving these issues would be greatly appreciated.

running pacbio-util using qsub

Hi all,

Thanks for the tools, seems great!
I am having an issue running pacbio-util on a cluster using qsub... It seems that pacbio-util is running, creating a target.txt file but containing only one line indicating the path to my assembly, but no list of the indels...
The only message I get is:
isdigit is not a valid POSIX macro at pacbio-util line 247

The command line I enter is:
pacbio-util indel-targets --fasta /Path_to/assembly.fasta /Path_to/illumina.aln_sorted.bam > targets.txt

If I type this command in the terminal, I do get the indels in the target file but with these lines in the terminal:
Calling POSIX::isdigit() is deprecated at pacbio-util line 247, line 1.
Use of uninitialized value in string eq at pacbio-util line 83, line 251675.
Use of uninitialized value in string eq at pacbio-util line 83, line 281153.
etc...

Any thoughts?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.