rrwick / porechop Goto Github PK

adapter trimmer for Oxford Nanopore reads

License: GNU General Public License v3.0

Makefile 0.01% Python 1.58% C++ 94.61% C 3.80%

porechop's Introduction

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

Oct 2018 update: Porechop is officially unsupported

While I'm happy Porechop has so many users, it has always been a bit klugey and a pain to maintain. I don't have the time to give it the attention it deserves, so I'm going to now officially declare Porechop as abandonware (though the unanswered issues and pull requests reveal that it already has been for some time). I've added a known issues section to the README to outline what I think is wrong with Porechop and how a reimplementation should look. I may someday (no promises though 😛) try to rewrite it from a blank canvas to address its faults.

Requirements
Installation
- Install from source
- Build and run without installation
Quick usage examples
How it works
Known adapters
Full usage
Known issues
Acknowledgements
License

Requirements

Linux or macOS
Python 3.4 or later
C++ compiler
- If you're using GCC, version 4.9.1 or later is required (check with g++ --version).
- Recent versions of Clang and ICC should also work (C++14 support is required).

I haven't tried to make Porechop run on Windows, but it should be possible. If you have any success on this front, let me know and I'll add instructions to this README!

Installation

Install from source

Running the setup.py script will compile the C++ components of Porechop and install a porechop executable:

git clone https://github.com/rrwick/Porechop.git
cd Porechop
python3 setup.py install
porechop -h

Notes:

If the last command complains about permissions, you may need to run it with sudo.
Install just for your user: python3 setup.py install --user
- If you get a strange "can't combine user with prefix" error, read this.
Install to a specific location: python3 setup.py install --prefix=$HOME/.local
Install with pip (local copy): pip3 install path/to/Porechop
Install with pip (from GitHub): pip3 install git+https://github.com/rrwick/Porechop.git
If you'd like to specify which compiler to use, set the CXX variable: export CXX=g++-6; python3 setup.py install
Porechop includes ez_setup.py for users who don't have setuptools installed, though that script is deprecated. So if you run into any installation problems, make sure setuptools is installed on your computer: pip3 install setuptools

Build and run without installation

By simply running make in Porechop's directory, you can compile the C++ components but not install an executable. The program can then be executed by directly calling the porechop-runner.py script.

git clone https://github.com/rrwick/Porechop.git
cd Porechop
make
./porechop-runner.py -h

Quick usage examples

Basic adapter trimming:
porechop -i input_reads.fastq.gz -o output_reads.fastq.gz

Trimmed reads to stdout, if you prefer:
porechop -i input_reads.fastq.gz > output_reads.fastq

Demultiplex barcoded reads:
porechop -i input_reads.fastq.gz -b output_dir

Demultiplex barcoded reads, straight from Albacore output directory:
porechop -i albacore_dir -b output_dir

Also works with FASTA:
porechop -i input_reads.fasta -o output_reads.fasta

More verbose output:
porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --verbosity 2

Got a big server?
porechop -i input_reads.fastq.gz -o output_reads.fastq.gz --threads 40

How it works

Find matching adapter sets

Porechop first aligns a subset of reads (default 10000 reads, change with --check_reads) to all known adapter sets. Adapter sets with at least one high identity match (default 90%, change with --adapter_threshold) are deemed present in the sample.

Identity in this step is measured over the full length of the adapter. E.g. in order to qualify for a 90% match, an adapter could be present at 90% identity over its full length, or it could be present at 100% identity over 90% of its length, but a 90% identity match over 90% of the adapter length would not be sufficient.

The alignment scoring scheme used in this and subsequent alignments can be modified using the --scoring_scheme option (default: match = 3, mismatch = -6, gap open = -5, gap extend = -2).

Trim adapters from read ends

The first and last bases in each read (default 150 bases, change with --end_size) are aligned to each present adapter set. When a long enough (default 4, change with --min_trim_size) and strong enough (default 75%, change with --end_threshold) match is found, the read is trimmed. A few extra bases (default 2, change with --extra_end_trim) past the adapter match are removed as well to ensure it's all removed.

Identity in this step is measured over the aligned part of the adapter, not its full length. E.g. if the last 5 bases of an adapter exactly match the first 5 bases of a read, that counts as a 100% identity match and those bases will be trimmed off. This allows Porechop to effectively trim partially present barcodes.

The default --end_threshold is low (75%) because false positives (trimming off some sequence that wasn't really an adapter) shouldn't be too much of a problem with long reads, as only a tiny fraction of the read is lost.

Split reads with internal adapters

The entirety of each read is aligned to the present adapter sets to spot cases where an adapter is in the middle of the read, indicating a chimera. When a strong enough match is found (default 85%, change with --middle_threshold), the read is split. If the resulting parts are too short (default less than 1000 bp, change with --min_split_read_size), they are discarded.

The default --middle_threshold (85%) is higher than the default --end_threshold (75%) because false positives in this step (splitting a read that is not chimeric) could be more problematic than false positives in the end trimming step. If false negatives (failing to split a chimera) are worse for you than false positives (splitting a non-chimera), you should reduce this threshold (e.g. --middle_threshold 75).

Extra bases are also removed next to the hit, and how many depends on the side of the adapter. If we find an adapter that's expected at the start of a read, it's likely that what follows is good sequence but what precedes it may not be. Therefore, a few bases are trimmed after the adapter (default 10, change with --extra_middle_trim_good_side) and more bases are trimmed before the adapter (default 100, change with --extra_middle_trim_bad_side). If the found adapter is one we'd expect at the end of the read, then the "good side" is before the adapter and the "bad side" is after the adapter.

Here is a real example of the "good" and "bad" sides of an adapter. The adapter is in the middle of this snippet (SQK-NSK007_Y_Top at about 90% identity). The bases to the left are the "bad" side and their repetitive nature is clear. The bases to the right are the "good" side and represent real biological sequence.

TGTTGTTGTTGTTATTGTTGTTATTGTTGTTGTATTGTTGTTATTGTTGTTGTTGTACATTGTTATTGTTGTATTGTTGTTATTGTTGTTGTATTATCGGTGTACTTCGTTCAGTTACGTATTACTATCGCTATTGTTTGCAGTGAGAGGTGGCGGTGAGCGTTTTCAAATGGCCCTGTACAATCATGGGATAACAACATAAGGAACGGACCATGAAGTCACTTCT

Discard reads with internal adapters

If you run Porechop with --discard_middle, the reads with internal adapters will be thrown out instead of split.

If you plan on using your reads with Nanopolish, then the --discard_middle option is required. This is because Nanopolish first runs nanopolish index to find a one-to-one association between FASTQ reads and fast5 files. If you ran Porechop without --discard_middle, then you could end up with multiple separate FASTQ reads which are from a single fast5, and this is incompatible with Nanopolish.

This option is also recommended if you are trimming reads from a demultiplexed barcoded sequencing run. This is because chimeric reads may contain two sequences from two separate barcodes, so throwing them out is the safer option to reduce cross-barcode contamination.

Barcode demultiplexing

Porechop can also demultiplex the reads into bins based on which barcode was found. This is done by using the -b option, which specifies an output directory for the trimmed reads (each barcode in a separate file), instead of -o.

Porechop looks for barcodes at the start and end of each read. All barcode matches are found, and if the best match is strong enough (default 75%, change with --barcode_threshold) and sufficiently better than the second-best barcode match (default 5% better, change with --barcode_diff), then the read is assigned to that barcode bin. E.g. with default settings, if BC01 was found at 79% identity and BC02 was found at 76% identity, the read will not be assigned to a barcode bin because the results were too close. But if you used --barcode_diff 1, then that read would be assigned to the BC01 bin.

By default, Porechop only requires a single barcode match to bin a read. If you use the --require_two_barcodes option, then it will be much more stringent and assess the start and end of the read independently. I.e. to be binned, the start of a read must have a good match for a barcode and the end of the read must also have a good match for the same barcode. This will result in far more reads failing to be assigned to a bin, but the reads which are assigned have a very high confidence. Note that for some library preps (e.g. the rapid barcoding kit), barcodes may only be at the start of reads, in which case the --require_two_barcodes option is not appropriate.

Note that the --discard_middle option is always active when demultiplexing barcoded reads. This is because a read with a middle adapter is likely chimeric and the pieces of chimeric reads may belong in separate bins.

Usage examples:

Default settings:
porechop -i input_reads.fastq.gz -b output_dir
Stringent binning (more reads in "none" bin but low risk of misclassified reads):
porechop -i input_reads.fastq.gz -b output_dir --barcode_threshold 85 --require_two_barcodes
Lenient binning (fewer reads in "none" bin but higher risk of misclassification):
porechop -i input_reads.fastq.gz -b output_dir --barcode_threshold 60 --barcode_diff 1

Barcode demultiplexing with Albacore

'What about Albacore's barcode demultiplexing?' I hear you say. 'Does this make Porechop's demultiplexing redundant?' Yes, Albacore v1.0 and later can demultiplex Nanopore reads during basecalling, which is a very nice feature. But Albacore and Porechop sometimes disagree on the appropriate bin for a read.

So if you use Albacore's output directory as input, here's what Porechop will do:

Load reads from all FASTQ files it finds, noting for each read what Albacore's bin was.
Call a barcode for each read using its normal demultiplexing logic.
Take all reads where Porechop and Albacore disagree and put them in the 'none' bin.

For example, Albacore may have put reads into the barcode02 directory. When Porechop trims and bins these reads, it may put 95% of them in the BC02 bin, but 4% go in the 'none' bin and 1% go into bins for other barcodes. By keeping only the 95% of reads where Albacore and Porechop agree, the risk of misclassification is reduced.

Output

If Porechop is run with the output file specified using -o, it will display progress info to stdout. It will try to deduce the format of the output reads using the output filename (can handle .fastq, .fastq.gz, .fasta and .fasta.gz). The --format option can be used to override this automatic detection.

Alternately, you can run Porechop with -b which specifies a directory for barcode bins. Porechop will then make separate read files in this directory for each barcode sequence (see Barcode demultiplexing for more details on the process). The files will be named using the barcode name or "none" if no barcode call was made (e.g. BC01.fastq.gz, BC02.fastq.gz, none.fastq.gz). The reads will be outputted in either fasta, fastq, fasta.gz or fastq.gz format, as determined by the input read format or the --format option.

If Porechop is run without -o or -b, then it will output the trimmed reads to stdout and print its progress info to stderr. The output format of the reads will be FASTA/FASTQ based on the input reads, or else can be specified using --format.

The --verbosity option will change the amount of progress info:

--verbosity 0 gives no progress output.
--verbosity 1 (default) gives summary info about end adapter trimming and middle adapter splitting.
--verbosity 2 shows the actual trimmed/split sequences for each read (described more below).
--verbosity 3 shows tons of data (mainly for debugging).

Verbose output

If you call Porechop with --verbosity 2, then it will display the start/end of each read show the trimming in colour. Red indicates the adapter sequence and yellow indicates additional trimmed bases:

If you are demultiplexing barcodes, then this output will also show the barcodes found at the start/end of each read, along with the final call for the read's bin:

The same colour scheme is used for middle adapters, but only reads with a positive hit are displayed:

Known adapters

The known Nanopore adapters that Porechop looks for are defined in the adapters.py file.

They are:

Ligation kit adapters
Rapid kit adapters
PCR kit adapters
Barcodes
Native barcoding
Rapid barcoding

If you want to add your own adapter sequences to Porechop, you can do so by editing the adapters.py file (instructions are in that file). And if you know of any adapter sequences that I've missed, please let me know and I'll add them!

Full usage

usage: porechop -i INPUT [-o OUTPUT] [--format {auto,fasta,fastq,fasta.gz,fastq.gz}] [-v VERBOSITY]
                [-t THREADS] [-b BARCODE_DIR] [--barcode_threshold BARCODE_THRESHOLD]
                [--barcode_diff BARCODE_DIFF] [--require_two_barcodes] [--untrimmed]
                [--discard_unassigned] [--adapter_threshold ADAPTER_THRESHOLD]
                [--check_reads CHECK_READS] [--scoring_scheme SCORING_SCHEME] [--end_size END_SIZE]
                [--min_trim_size MIN_TRIM_SIZE] [--extra_end_trim EXTRA_END_TRIM]
                [--end_threshold END_THRESHOLD] [--no_split] [--discard_middle]
                [--middle_threshold MIDDLE_THRESHOLD]
                [--extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE]
                [--extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE]
                [--min_split_read_size MIN_SPLIT_READ_SIZE] [-h] [--version]

Porechop: a tool for finding adapters in Oxford Nanopore reads, trimming them from the ends and
splitting reads with internal adapters

Main options:
  -i INPUT, --input INPUT        FASTA/FASTQ of input reads or a directory which will be
                                 recursively searched for FASTQ files (required)
  -o OUTPUT, --output OUTPUT     Filename for FASTA or FASTQ of trimmed reads (if not set, trimmed
                                 reads will be printed to stdout)
  --format {auto,fasta,fastq,fasta.gz,fastq.gz}
                                 Output format for the reads - if auto, the format will be chosen
                                 based on the output filename or the input read format (default:
                                 auto)
  -v VERBOSITY, --verbosity VERBOSITY
                                 Level of progress information: 0 = none, 1 = some, 2 = lots, 3 =
                                 full - output will go to stdout if reads are saved to a file and
                                 stderr if reads are printed to stdout (default: 1)
  -t THREADS, --threads THREADS  Number of threads to use for adapter alignment (default: 8)

Barcode binning settings:
  Control the binning of reads based on barcodes (i.e. barcode demultiplexing)

  -b BARCODE_DIR, --barcode_dir BARCODE_DIR
                                 Reads will be binned based on their barcode and saved to separate
                                 files in this directory (incompatible with --output)
  --barcode_threshold BARCODE_THRESHOLD
                                 A read must have at least this percent identity to a barcode to be
                                 binned (default: 75.0)
  --barcode_diff BARCODE_DIFF    If the difference between a read's best barcode identity and its
                                 second-best barcode identity is less than this value, it will not
                                 be put in a barcode bin (to exclude cases which are too close to
                                 call) (default: 5.0)
  --require_two_barcodes         Reads will only be put in barcode bins if they have a strong match
                                 for the barcode on both their start and end (default: a read can
                                 be binned with a match at its start or end)
  --untrimmed                    Bin reads but do not trim them (default: trim the reads)
  --discard_unassigned           Discard unassigned reads (instead of creating a "none" bin)
                                 (default: False)

Adapter search settings:
  Control how the program determines which adapter sets are present

  --adapter_threshold ADAPTER_THRESHOLD
                                 An adapter set has to have at least this percent identity to be
                                 labelled as present and trimmed off (0 to 100) (default: 90.0)
  --check_reads CHECK_READS      This many reads will be aligned to all possible adapters to
                                 determine which adapter sets are present (default: 10000)
  --scoring_scheme SCORING_SCHEME
                                 Comma-delimited string of alignment scores: match, mismatch, gap
                                 open, gap extend (default: 3,-6,-5,-2)

End adapter settings:
  Control the trimming of adapters from read ends

  --end_size END_SIZE            The number of base pairs at each end of the read which will be
                                 searched for adapter sequences (default: 150)
  --min_trim_size MIN_TRIM_SIZE  Adapter alignments smaller than this will be ignored (default: 4)
  --extra_end_trim EXTRA_END_TRIM
                                 This many additional bases will be removed next to adapters found
                                 at the ends of reads (default: 2)
  --end_threshold END_THRESHOLD  Adapters at the ends of reads must have at least this percent
                                 identity to be removed (0 to 100) (default: 75.0)

Middle adapter settings:
  Control the splitting of read from middle adapters

  --no_split                     Skip splitting reads based on middle adapters (default: split
                                 reads when an adapter is found in the middle)
  --discard_middle               Reads with middle adapters will be discarded (default: reads with
                                 middle adapters are split) (required for reads to be used with
                                 Nanopolish, this option is on by default when outputting reads
                                 into barcode bins)
  --middle_threshold MIDDLE_THRESHOLD
                                 Adapters in the middle of reads must have at least this percent
                                 identity to be found (0 to 100) (default: 85.0)
  --extra_middle_trim_good_side EXTRA_MIDDLE_TRIM_GOOD_SIDE
                                 This many additional bases will be removed next to middle adapters
                                 on their "good" side (default: 10)
  --extra_middle_trim_bad_side EXTRA_MIDDLE_TRIM_BAD_SIDE
                                 This many additional bases will be removed next to middle adapters
                                 on their "bad" side (default: 100)
  --min_split_read_size MIN_SPLIT_READ_SIZE
                                 Post-split read pieces smaller than this many base pairs will not
                                 be outputted (default: 1000)

Help:
  -h, --help                     Show this help message and exit
  --version                      Show program's version number and exit

Known issues

Adapter search

Porechop tries to automatically determine which adapters are present by looking at the reads, but this approach has a few issues:

As the number of kits/barcodes has grown, adapter-search part of the Porechop's pipeline has become increasingly slow.
Porechop only does the adapter search on a subset of reads, which means there can be problems with non-randomly ordered read sets (e.g. all barcode 1 reads at the start of a file, followed by barcode 2 reads, etc).
Many ONT adapters share common sequence with each other, making false positive adapter finds possible.

A simpler solution (and in hindsight what I should have done) would be to require the kit and/or adapters from the user. E.g. porechop --sqk-lsk109 or porechop --start_adapt ACGCTAGCATACGT.

Performance

Porechop uses SeqAn to perform its alignments in C++. This library is very flexible, but not as fast as some alternatives, such as Edlib.

Another performance issue is that Porechop uses ctypes to interface with its C++ code. Function calls with ctypes can have a bit of overhead, which means that Porechop cannot use threads very efficiently (it spends too much of its time in the Python code, which is intrinsically non-parallel).

Barcode demultiplexing

I added demultiplexing to Porechop as an afterthought – it was already looking for barcodes to trim, so why not sort reads by barcodes too? This turned out to be a very useful feature, but in hindsight I think it might be simpler (and easier to maintain) if trimming and demultiplexing functionality were in separate tools.

Base-space problems

I've encountered a couple of issues where adapter sequences are not properly basecalled, resulting in inconsistent sequence. Porechop trims in base-space, so this is a somewhat intractable problem. See issue #40 for an example.

These have made me wonder if the adapter trimming should be done in signal-space instead, though that would be a more complex problem to solve. I hope that in the future ONT can integrate this kind of functionality directly into their basecallers.

Acknowledgements

Porechop was inspired by (and largely coded during) Porecamp Australia 2017. Thanks to the organisers and attendees who helped me realise that a Nanopore adapter trimmer might be a useful tool! I later met David Stoddart from Oxford Nanopore at London Calling 2017, and he helped me get many of the adapter sequences right.

Also I'd like to thank the SeqAn developers for their great library (Porechop uses SeqAn to perform its alignments).

And of course, many thanks to Kat Holt and Louise Judd for keeping me well supplied with Nanopore reads!

License

GNU General Public License, version 3

porechop's People

Contributors

Stargazers

Watchers

Forkers

scottgigante zibraproject animalbiome intellia punkrockscience jvolkening psy-fer xxz19900 runsheng phiweger sclamons nick297 xzhuo joshquick alex84425 devindrown sn0flingan artic-network 0820ll kpalin macieksk marionilab t-neumann looselab babramso scalefreegan callumparr chrishah alexpersa7 gvaihir imwantaek w1bw dr3y markushaak gsc0107 radnovogene colinanthony oschwengers pythseq bonsai-team rluis codesteph snunzi winterlich xujing90ss alegione mattdmem glin0311 biorb kevinzjy huangziyan11111 graeme-smith talalhossain dnieuw smm001 el41na ashrosewood milesjor ms-gx n-damo jxw659 kamalfartiyal84 sajjadasaf driftbio standardgalactic boyle-lab shulp2211 flavianhm hongyhong pbreum 1383385 gsarfo-boateng bzrry pastvir oracle5th davymoon kenmurithi ruth-moraa thanhleviet jagadeeshselvam marvillar tetukas muhammedhasan sotakan sean-workman seclemens zhengyabiao hcstubbe phurbawangdi hfischl eliavw ssyamoako anukrati0306 junho05160516 biowilko weirir swart-lab 14davidmachuca kamalicmr dehui333

porechop's Issues

direct RNA adapter sequences are available

hi Ryan,

here

I would need to incorporate them in the adapter.py script, correct?

thank you very much,
Adrian

Demultiplexing fail / pass folders

A simple question.

I use the albacore output directory as input folder for porechop, but then the failed reads seemes to be included. Should i only use the "pass" folder from the albacore output as input?

user defined adapter.py file

Howdy,

I had an issue where a few people wanted to use porechop at the same time, but with different adapters.py settings.

I have forked and implemented a dirty hack for the user to define the path to the file and the module name (which is just the name of the file without .py....i got lazy ok)

Would be cool if this was part of the main package, with some defaults set up, or, if you could just give it a fasta file with barcodes, and porechop built the adapters file from it.

anyway, code is over here https://github.com/Psy-Fer/Porechop

command to define alt_adapters.py
./porechop-runner.py -i test/test_format_barcodes.fasta.gz -b test/ -AP ./alt_adapters.py -AN alt_adapters

Cheers!
(i love the pretty interface btw. So good!)

Report progress while porechop is doing its thing

When I run porechop, even in verbose mode 2, it reports nothing to the screen until the run has finished. This means that I can't tell the difference between a frozen program and a program that is working properly. Runtime output in verbose mode would be also useful because it allows me to work out fairly promptly if all the adapters are being detected, or if something needs to be changed.

Demonstration of this situation, testing with 10 sequences (breaks as soon as 10 lines have been seen):

$ time porechop -v 2 -i porechop_test_10.fastq -o out.fastq | head -n 10

Looking for known adapter sets
                                        Best               
                                        read       Best    
                                        start      read end
  Set                                   %ID        %ID     
  SQK-NSK007                                86.2       72.7
  Rapid                                     57.9        0.0
  SQK-MAP006                                83.3       95.5
  SQK-MAP006 Short                          65.4       44.8
Traceback (most recent call last):
  File "/usr/local/bin/porechop", line 11, in <module>
    load_entry_point('porechop==0.2.1', 'console_scripts', 'porechop')()
  File "/usr/local/lib/python3.5/dist-packages/porechop/porechop.py", line 43, in main
    display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
  File "/usr/local/lib/python3.5/dist-packages/porechop/porechop.py", line 285, in display_adapter_set_results
    fixed_col_widths=[35, 8, 8])
  File "/usr/local/lib/python3.5/dist-packages/porechop/misc.py", line 270, in print_table
    print(indenter + row_str, flush=True, file=print_dest)
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

real	0m1.757s
user	0m0.928s
sys	0m0.028s

Same operation, but with 100 sequences (breaks as soon as 10 lines have been seen). Note that the run time (shown at the end) is substantially greater:

$ time porechop -v 2 -i porechop_test_100.fastq -o out.fastq | head -n 10

Looking for known adapter sets
                                        Best               
                                        read       Best    
                                        start      read end
  Set                                   %ID        %ID     
  SQK-NSK007                                96.4       80.0
  Rapid                                     60.7        0.0
  SQK-MAP006                                86.2       95.5
  SQK-MAP006 Short                          72.0       68.0
Traceback (most recent call last):
  File "/usr/local/bin/porechop", line 11, in <module>
    load_entry_point('porechop==0.2.1', 'console_scripts', 'porechop')()
  File "/usr/local/lib/python3.5/dist-packages/porechop/porechop.py", line 43, in main
    display_adapter_set_results(matching_sets, args.verbosity, args.print_dest)
  File "/usr/local/lib/python3.5/dist-packages/porechop/porechop.py", line 285, in display_adapter_set_results
    fixed_col_widths=[35, 8, 8])
  File "/usr/local/lib/python3.5/dist-packages/porechop/misc.py", line 270, in print_table
    print(indenter + row_str, flush=True, file=print_dest)
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe

real	0m13.819s
user	0m4.540s
sys	0m0.068s

Add PCR 96 Barcode sequence

Can you add the PCR96 barcodes (EXP-PBC001) to the adapters list?

https://community.nanoporetech.com/protocols/pcr-96-barcoding-amplicons-100/v/pbae96_9016_v108_revm_18oc/introduction-to-the-pcr-96

Adding functionaliy to strand reads from cDNA sequencing

I am interested in stranding (assigning if the orignial mRNA came from the + or - DNA strand) for all my reads from minion cDNA-seq. Running porechop effectively finds the end adapter on (~70%) of my reads, but the default function is to then remove the adapter sequence in the processed file.

Maybe one could hack a solution with the de-multiplex option, by supplying the adapter as the barcode, and getting it to break the reads into '+strand-barcode', '-strand-barcode', and 'strand not found'. But I was wondering if you have a comment on this or would be interested to add the option.

I think one would want the output to be a fastq, fasta, etc file where the reads are correctly oriented based on the strand in which they were transcribed from, and another file where the strand cannot be determined. One could specify the stringency by deciding which end of the read (5' or 3' or both) the adapter needs to be found. That is, the 3' polyA adapter is most often found in a read, but the 5' end adapter is more rare, I guess owing to the fact that many cDNA are not truly full length.

Thanks for your thoughts.

Trim a fixed number of bases of read ends

Thanks for a great tool,

I was wondering if there is a way to trim a fixed number of bases of each read end, even if no adaptor sequence was discovered. Such a flag/feature could be useful when trimming 1D and 1D^2 reads from SQK-LSK308 sequencing runs.

Cheers
Jon

1d^2 adapter basecalls

# But when looking at actual reads, I found two parts. The first corresponds to one end
# of the provided sequences (through slightly different):

Was talking to John Tyson about this and he pointed out that the 1d^2 adapter contains a chemical modification which is what gives it affinity to the R9.5 pore so it looks like it is throwing the basecalling off here as you observe the reads don't match the documentation. As it is quite a long adapter it probably doesn't matter but just wanted to point it out.

Include trimmed adapters in header line

It would be useful to include the name of the adapters in the header line of FASTA/FASTQ output:

@2928ef06-e15a-4d9f-9dbe-b7d18e172dbc_t runid=d3311426410a829f91b5484f951a6b26f0f43cc3 read=723 ch=42 start_time=2016-08-04T15:24:56Z [trimmed=Hairpin_2_RC;PCR_1_start]

I'm currently trying to use Porechop on the Albertsen Lab sequences, and it would be nice to know which (if any) adapters are found in sequences so that I can identify reads that are likely to be high quality.

Rapid barcodes

As far as I can tell the rapid barcodes differ slightly in their sequence from the native barcodes (although this post seems to contradict this slightly https://community.nanoporetech.com/posts/rapid-barcoding-analysi#comment_5049). Assuming they do differ (and they are available somewhere), can you add those in?

All reads ending up in none bin when demultiplexing

Hello,

We're currently testing nanopore runs from our GridION. However, the live basecalling software guppy does not seem to support demultiplexing at the moment. As such I wanted to give Porechop a go on this data to compare to Albacore basecalling and demultiplexing.
However, I'm running into the problem that Porechop will recognize the correct barcodes at good identity levels, but keeps binning all reads into the none bin, no matter how lenient I set the barcode recognition. Do you have any idea what might be causing this behaviour? This occurs on both the guppy called fastq files and the albacore fastq files.

Thanks in forward!
Joost

Example commands:
porechop -i total.fastq -b Porechop_Test/
porechop -i total.fastq -b Porechop_Test/ --barcode_threshold 60 --barcode_diff 1 --threads 20
Log:

Loading reads
total.fastq
1,255,086 reads loaded


Looking for known adapter sets
10,000 / 10,000 (100.0%)
                                        Best
                                        read       Best
                                        start      read end
  Set                                   %ID        %ID
  SQK-NSK007                                82.8       78.3
  Rapid                                     70.4        0.0
  SQK-MAP006                                80.0       78.3
  SQK-MAP006 Short                          84.0       79.2
  PCR adapters 1                            78.3       77.3
  PCR tail 1                                75.0       75.0
  PCR tail 2                                73.3       75.9
  1D^2 part 1                               71.0       71.4
  1D^2 part 2                               79.4       78.1
  Barcode 1 (reverse)                       76.0      100.0
  Barcode 2 (reverse)                       76.9      100.0
  Barcode 3 (reverse)                       76.0       79.2
  Barcode 4 (reverse)                       72.0       73.3
  Barcode 5 (reverse)                       76.9       75.0
  Barcode 6 (reverse)                       76.0       76.9
  Barcode 7 (reverse)                       76.0       75.0
  Barcode 8 (reverse)                       77.8       79.2
  Barcode 9 (reverse)                       76.0       73.1
  Barcode 10 (reverse)                      75.0       80.0
  Barcode 11 (reverse)                      79.2       75.0
  Barcode 12 (reverse)                      79.2       79.2
  Barcode 1 (forward)                      100.0      100.0
  Barcode 2 (forward)                      100.0      100.0
  Barcode 3 (forward)                       79.2       79.2
  Barcode 4 (forward)                       83.3       76.0
  Barcode 5 (forward)                       76.0       76.0
  Barcode 6 (forward)                       77.8       76.9
  Barcode 7 (forward)                       76.0       75.0
  Barcode 8 (forward)                       75.0       72.0
  Barcode 9 (forward)                       73.1       76.0
  Barcode 10 (forward)                      75.0       74.1
  Barcode 11 (forward)                      77.8       75.0
  Barcode 12 (forward)                      79.2       79.2
  Barcode 13 (forward)                      76.0       76.0
  Barcode 14 (forward)                      75.0       74.1
  Barcode 15 (forward)                      76.0       76.9
  Barcode 16 (forward)                      75.0       76.0
  Barcode 17 (forward)                      79.2       76.0
  Barcode 18 (forward)                      72.0       72.0
  Barcode 19 (forward)                      79.2       76.0
  Barcode 20 (forward)                      76.9       76.9
  Barcode 21 (forward)                      76.0       76.9
  Barcode 22 (forward)                      76.0       76.9
  Barcode 23 (forward)                      76.0       76.9
  Barcode 24 (forward)                      75.0       79.2
  Barcode 25 (forward)                      76.9       76.0
  Barcode 26 (forward)                      79.2       76.9
  Barcode 27 (forward)                      77.8       76.0
  Barcode 28 (forward)                      75.0       74.1
  Barcode 29 (forward)                      76.9       76.0
  Barcode 30 (forward)                      76.9       76.9
  Barcode 31 (forward)                      80.0       79.2
  Barcode 32 (forward)                      75.0       80.0
  Barcode 33 (forward)                      76.9       80.0
  Barcode 34 (forward)                      75.0       74.1
  Barcode 35 (forward)                      73.1       75.0
  Barcode 36 (forward)                      76.9       77.8
  Barcode 37 (forward)                      79.2       76.9
  Barcode 38 (forward)                      75.0       79.2
  Barcode 39 (forward)                      76.0       76.9
  Barcode 40 (forward)                      76.9       73.1
  Barcode 41 (forward)                      75.0       80.0
  Barcode 42 (forward)                      76.9       76.9
  Barcode 43 (forward)                      76.9       76.0
  Barcode 44 (forward)                      74.1       75.0
  Barcode 45 (forward)                      76.0       76.0
  Barcode 46 (forward)                      76.0       76.9
  Barcode 47 (forward)                      76.9       76.9
  Barcode 48 (forward)                      76.0       76.9
  Barcode 49 (forward)                      76.0       76.9
  Barcode 50 (forward)                      75.0       76.0
  Barcode 51 (forward)                      80.0       76.0
  Barcode 52 (forward)                      76.9       79.2
  Barcode 53 (forward)                      76.0       75.0
  Barcode 54 (forward)                      76.9       76.9
  Barcode 55 (forward)                      73.1       76.0
  Barcode 56 (forward)                      76.0       76.0
  Barcode 57 (forward)                      76.9       77.8
  Barcode 58 (forward)                      80.8       79.2
  Barcode 59 (forward)                      76.0       76.0
  Barcode 60 (forward)                      76.0       76.9
  Barcode 61 (forward)                      73.1       79.2
  Barcode 62 (forward)                      75.0       75.0
  Barcode 63 (forward)                      79.2       76.0
  Barcode 64 (forward)                      76.9       79.2
  Barcode 65 (forward)                      76.9       80.8
  Barcode 66 (forward)                      79.2       76.9
  Barcode 67 (forward)                      79.2       76.9
  Barcode 68 (forward)                      79.2       76.0
  Barcode 69 (forward)                      74.1       75.0
  Barcode 70 (forward)                      80.0       80.0
  Barcode 71 (forward)                      76.0       76.0
  Barcode 72 (forward)                      79.2       80.0
  Barcode 73 (forward)                      76.0       76.9
  Barcode 74 (forward)                      75.0       76.9
  Barcode 75 (forward)                      77.8       73.1
  Barcode 76 (forward)                      76.0       76.9
  Barcode 77 (forward)                      76.0       76.0
  Barcode 78 (forward)                      76.0       75.0
  Barcode 79 (forward)                      76.0       76.9
  Barcode 80 (forward)                      80.0       76.0
  Barcode 81 (forward)                      76.9       76.9
  Barcode 82 (forward)                      79.2       76.0
  Barcode 83 (forward)                      80.0       79.2
  Barcode 84 (forward)                      76.9       76.9
  Barcode 85 (forward)                      76.0       76.0
  Barcode 86 (forward)                      72.0       75.0
  Barcode 87 (forward)                      76.9       76.9
  Barcode 88 (forward)                      77.8       79.2
  Barcode 89 (forward)                      75.9       75.0
  Barcode 90 (forward)                      74.1       73.1
  Barcode 91 (forward)                      76.0       76.9
  Barcode 92 (forward)                      75.0       76.9
  Barcode 93 (forward)                      79.2       83.3
  Barcode 94 (forward)                      74.1       76.0
  Barcode 95 (forward)                      76.0       76.9
  Barcode 96 (forward)                      75.0       79.2


Trimming adapters from read ends
  BC01_rev: CACAAAGACACCGACAACTTTCTT
      BC01: AAGAAAGTTGTCGGTGTCTTTGTG
  BC02_rev: ACAGACGACTACAAACGGAATCGA
      BC02: TCGATTCCGTTTGTAGTCGTCTGT
      BC01: AAGAAAGTTGTCGGTGTCTTTGTG
  BC01_rev: CACAAAGACACCGACAACTTTCTT
      BC02: TCGATTCCGTTTGTAGTCGTCTGT
  BC02_rev: ACAGACGACTACAAACGGAATCGA

1,255,086 / 1,255,086 (100.0%)

1,084,784 / 1,255,086 reads had adapters trimmed from their start (89,476,779 bp removed)
  954,999 / 1,255,086 reads had adapters trimmed from their end (32,552,203 bp removed)


Discarding reads containing middle adapters
1,255,086 / 1,255,086 (100.0%)

15,082 / 1,255,086 reads were discarded based on middle adapters


Saving untrimmed reads to barcode-specific files

  Barcode  Reads      Bases          File
  none     1,240,004  1,878,862,050  Porechop_Test/none.fastq

Dealing with multiple adapters in read

I'm wondering what porechop's behavior would be if there are multiple adapters present in the middle of a read? I saw that it'll split it in half, does it check the split reads for additional adapters in the middle?

Edit: Forgot to ask, can you run it recursively in such a case?

error: [Errno 2] No such file or directory: 'porechop/cpp_functions.so'

Hi,

I tried to compile several versions of Porechop on my Ubuntu server and always get the same error:

error: [Errno 2] No such file or directory: 'porechop/cpp_functions.so'

Help would be much appreciated!

Cheers,
Stefan

Auto format to select compression choice for output

Hello Ryan,

First of all, thank you very much for Porechop, it is fabulous! I am using it with custom barcoding in adapters.py and my first testings are promising.

This is more a feature request, but I would expect to have an uncompressed output if the input is uncompressed, specially with the auto choice for the format flag. Another option is to extend the choices to something like {auto,fasta,fastq,fastagz,fastqgz} and avoid forced gzipping the output. Whatever... this is just a suggestion and a good excuse to thank you for Porechop.

Install problem?

Hi, right after installing I try to run (the "porechop -h" works) on sample data i get the following error:

porechop -i ./unclass.fastq -o ./b.fastq
Traceback (most recent call last):
File "/usr/local/bin/porechop", line 11, in
load_entry_point('porechop==0.2.1', 'console_scripts', 'porechop')()
File "/usr/local/lib/python3.4/dist-packages/porechop/porechop.py", line 33, in main
reads, read_type = load_reads(args.input, args.verbosity, args.print_dest)
File "/usr/local/lib/python3.4/dist-packages/porechop/porechop.py", line 196, in load_reads
reads, read_type = load_fasta_or_fastq(input_filename)
File "/usr/local/lib/python3.4/dist-packages/porechop/misc.py", line 125, in load_fasta_or_fastq
return load_fastq(filename), 'FASTQ'
File "/usr/local/lib/python3.4/dist-packages/porechop/misc.py", line 168, in load_fastq
short_name = full_name.split()[0]
IndexError: list index out of range

Any ideas on how to fix it?

(Installed via python3 setup.py ... method; source from github)

Unittests passed, also when porechop -i ... -o ... is run on the test set it runs without problem.

The sample that issues the error above was basecalled with albacore with basecalling already on (but with a high number of unclassified reads).

Thanks in advance!

definition of "end" vs "middle" adapters

Hi Ryan,

I'm processing a MinION dataset produced from an SQK-LSK108 kit and attempting to use Porechop to demux and trim adapters. It seems to be doing a reasonable job of identifying the adapters/barcodes based on a visual inspection of verbose output (nice highlighting), although I think I will need to tweak adapters.py slightly to match the adapter sequences listed in the albacore source. However, ~ 85% of reads are being discarded "based on middle adapters". These are predominantly "short" reads as far as nanopore reads go because they are based on an RT-PCR amplicon, with median length ~ 800bp.

The verbose output shows that for most reads, adapters are identified a short distance from, but not exactly at, the ends. However, these seem to be considered middle adapters. Is there a threshold defined somewhere that controls what is considered an "end" vs "middle" adapter? If so, is this threshold/distance absolute or relative to read length?

Skipping barcode search altogether

Hi Ryan,
I have a bunch of reads after sequencing my library on MinION, prepared using only two adapters that are included in the SQK-LSK108 ligation sequencing kit from Oxford Nanopore. The simplest case really. I replaced existing non-barcode adapters with the two Y-adapters according to the note by ONP. But when I tried commenting out the barcodes in the ADAPTERS list (adapters.py file), the program threw an error saying that the barcodes string cannot be empty (source code string cannot contain null bytes). Through trial and error I found out that I can only exclude the native barcodes set - but keep at least one of them intact - for the program to proceed without error. Not knowing anything about python syntax and being short of time I couldn't find a nicer fix. The problem got worse because I had to lower the adapter threshold and suddenly the program started finding barcode adapters in my reads (or so it seemed according to the report in my terminal).
Would you consider making barcodes search optional? And why include all the available adapters in the set? If I know that most of them are simply not there in the first place, some of them probably no longer manufactured, why waste CPU time or worse trim parts of the reads that align to adapters by pure chance?

Barcode demultiplexing after albacore

Dear all,
I have encountered something that I can't seem to explain in the results of running Porechop on Albacore demultiplexed data.
As I understand, the final results for all barcodes should only contain the reads for which both Porechop and Albacore assigned the same barcode, otherwise the read should be in the 'none.fastq' file.

When looking at the results I receive, I see some cases where the final assignment of porechop is different than the assignment of albacore and the reads are assigned to a certain barcode and not filtered to the 'none' file.

Albacore adds a tag to the read ID in the fastq file with the barcode ID that was chosen.
When I look at 'BC01.fastq' file, I expect all headers to contain the value barcode=barcode01 (since disagreements should be filtered out), but instead I see the following histogram of barcodes:

It is clear that in most cases they indeed agree, but ~7,500 reads were assigned differently.

In order to explore this further, I have ran Porechop using verbosity = 3 and looked at a specific read which was assigned to BC02 by Albacore but to BC01 after running Porechop on it's output and this is the relevant results I see:

ab8c241b-5a89-420a-b723-98b7790d8e44 runid=77c14fc9c92785d9274f3d85723a4061540e1a40 read=1970 ch=34 start_time=2017-12-11T12:44:48Z barcode=barcode02
start: TCATGCTTCGTTCAGTTACGTATTGCTTCAGTTCGTTTACATGAAAGTCGTCTGTGTTTTCGCATTTATCGACAAACGCTTTCGCGTTTCGTGCGCCGCTTCACGCAGGCCTTCATCTTTCTGGCTGATGTACCACATTGCCTGCCGCGA...
start alignments:
SQK-NSK007, full score=89.285714, partial score=89.285714, read position: 2-27
Rapid, full score=92.0, partial score=92.0, read position: 55-103
1D^2 part 2, full score=81.818182, partial score=81.818182, read position: 5-33
Rapid barcoding 1 (full sequence), full score=76.315789, partial score=76.315789, read position: 2-103
Rapid barcoding 2 (full sequence), full score=85.185185, partial score=85.185185, read position: 2-103
Rapid barcoding 3 (full sequence), full score=75.221239, partial score=75.221239, read position: 2-103
Rapid barcoding 5 (full sequence), full score=78.378378, partial score=78.378378, read position: 2-103
Rapid barcoding 6 (full sequence), full score=79.279279, partial score=79.279279, read position: 2-103
Rapid barcoding 7 (full sequence), full score=79.464286, partial score=79.464286, read position: 2-103
Rapid barcoding 8 (full sequence), full score=77.876106, partial score=77.876106, read position: 2-103
Rapid barcoding 9 (full sequence), full score=78.378378, partial score=78.378378, read position: 2-103
Rapid barcoding 10 (full sequence), full score=78.378378, partial score=78.378378, read position: 2-103
Rapid barcoding 11 (full sequence), full score=75.438596, partial score=75.438596, read position: 2-103
Rapid barcoding 12 (full sequence), full score=79.090909, partial score=79.090909, read position: 2-103
end: ...CTGGACATTAGTACTACTCATTTTACATTAATGCTGGAATGCGTCTGGAGAAACGGGAATATGACGTTACCTCTCCCCAAAGAGAATGGGGAGAAGATTTTCTGCATGTGCATTTTATGCACGGCGATAAAGAATGCTGCTGGGTGGAGT
Barcodes:
start barcodes: BC01 (75.0%), BC02 (70.0%), BC03 (69.0%), BC04 (65.5%), BC05 (58.6%), BC06 (60.0%), BC07 (55.2%), BC08 (45.8%), BC09 (25.0%), BC10 (8.3%), BC11 (59.3%), BC12 (12.5%)
end barcodes: BC01 (8.3%), BC02 (64.3%), BC03 (46.4%), BC04 (55.6%), BC05 (34.6%), BC06 (66.7%), BC07 (4.2%), BC08 (16.7%), BC09 (35.7%), BC10 (12.0%), BC11 (4.2%), BC12 (12.5%)
best start barcode: BC01 (75.0%)
best end barcode: BC06 (66.7%)
final barcode call: BC01

I will add that I see around the same percentage of reads with different assignments in all the barcodes (not only BC01).

Am I missing something? What is the possible explanation for this?

Thank you in advance,
Olga Karinsku

barcode identification requires randomised fastq

When running porechop, I came across unexpected output when identifying the barcodes on 10k sample reads.

It seems it takes the first 10k reads, however I concatenated the outputs of the albacore reads after demultiplexing, so they were ordered from barcode01->barcode12->unclassified.
So the first 10k reads were all barcode01.

I wrote a quick script to shuffle a fastq file (python 2.7ish) shuffle_fastq.py
see here: https://github.com/Psy-Fer/bioinf_tools

When I ran porechop on this new shuffled file, it detected all the correct barcodes (better than albacore i might add) and seems to be running smoothly.

A feature request would be to modify the barcode detection function to randomly sample the ingested fastq. Otherwise note in the docs would do :)

cheers.

Effectiveness of --min_trim_size

Hello,

We are trying to get a hand on the demultiplexing part. According to the manual the --min_trim_size is the minimal size of the alignment against an adapter.
With the verbose 3 option I can see that the highlighted part in red changes if I modify this parameter. However the outputted sequences in bins do not change.

For example with

"-check_reads 100 --no_split **--min_trim_size 50** --end_size 75"
.....
  Barcode  Reads  Bases   File                       
  BC01         4   4,662  /project/method4/BC01.fasta
  BC02         4   3,971  /project/method4/BC02.fasta
  BC03         2   5,182  /project/method4/BC03.fasta
  BC04         5   5,951  /project/method4/BC04.fasta
  none        35  40,121  /project/method4/none.fasta

and with

"--check_reads 100 --no_split -**-min_trim_size 3** --end_size 75"

  Barcode  Reads  Bases   File                       
  BC01         4   4,546  /project/method4/BC01.fasta
  BC02         4   3,828  /project/method4/BC02.fasta
  BC03         2   5,046  /project/method4/BC03.fasta
  BC04         5   5,684  /project/method4/BC04.fasta
  none        35  39,534  /project/method4/none.fasta

With the verbose 3 option for example read # 43 get the following asignment:
first Image --min_trim_size 3 --end_size 100 : (second image:-min_trim_size 50 --end_size 100)

As you can see, the highlighting in red (the barcode matching) changes but the final assignment does not.
Is this expected behavior? If so, How do I set for example a minimal alignment length of 30bp on either ends of the sequence as a requirement for barcode matching and binning?

Make pip installable

Deploying this and its deps would be much easier if i could just pip3 install porechop :)

could not find cpp_functions.so - please reinstall

Hi all/ rrwick,
I got the error above triying to install porechop following the commands below:
$git clone https://github.com/rrwick/Porechop.git
$cd Porechop/
$python setup.py install

Indeed I tried without installation and I got the same error:
$./porechop-runner.py

I am running Python 3.4.5
Could you help me with that issue?
Thanks a lot

error running in porechop

Hello,

I am trying to use porechop for adaptor trimming on nanopore reads. I constantly get the following error:

Error: input_reads.fastq could not be parsed - is it formatted correctly?

I double checked the file. It is fine.

I am using the following command:

porechop -i input_reads.fastq -o output_reads.fastq
Am I missing anything?

TIA

Adapter info

Hello,
If possible, could I ask some questions (may be a little naive) about the adpater info for nanopore sequencing?

Whether the adapter sequences are the same for different version of sequencing kits? If so, we probabally to remove adapters based on the kit version.
Is there any good way to visualize the enrichment of adapter info before and after porechop processing? I tested on fastqc and it seemed that it can not detect the adapter info before trimming (maybe not in the fastqc library).

Thanks!

missing file?

when i try to install im getting this error:
error: [Errno 2] No such file or directory: 'porechop/cpp_functions.so'

file does not appear to exist on github either?

( i had a colleague send me an old version of the repo which installed without this issue )

stream reads instead of loading whole fastq into memory

Would it be possible/ easy to implement to stream the first say 10k reads, identify the barcodes and then stream all reads for chopping? With good sequencing runs, loading all reads into memory becomes difficult for a standard laptop.

thank you for providing this very helpful software,
Adrian

Citing Porechop

How would like to have pore chop referenced in a manuscript?

Low barcode assignment %

Hello,

I'm trying to run a fasta file produced from the SQK-RLB001 kit and am getting nearly half of my sequences in the 'none' file, even after lowering the barcode threshold to 20. Is this something other users have experienced? I know MinION data isn't exactly error-free. Have the barcodes in the known adapters been updated to the new kit releases?

My apologies if this is a silly question. Thanks for making a great open source program, Ryan!

Catherine

Split output based on native barcodes?

Hi Ryan,
This looks like a great tool. Would it be possible to modify this so that the native barcodes can be split into individual output files? e.g. NB001.trim.fasta, NB002.trim.fasta, unassigned.fa etc. I'm looking for a faster way than metrichor to demulitplex the reads. I could write something but seems like you have a very fast implementation here and maybe wouldn't be that hard to provide an option to split outputs?
Thanks,
Jon

After back-transcription, can porechop be used for direct RNA seq kit data?

Or might it even work on the raw data?

Thank you,
Adrian

FASTQ output qual encodings are all "+"

When working with FASTQ, seems like the quality scores are not being retained, i.e.
Here is original read:

@channel_37_d6b322e5-b47b-4400-b152-ab15b84b12ce_template /Library/MinKNOW/data/reads/fail/0/cfmrs_mac_pro_fpl_fs_fed_us_20170307_FNFAF13190_MN18073_mux_scan_SWJ_Anid_54932_ch37_read188_strand.fast5
TCTGGTGTGCTTCGTTCAGTTACGTATTGTAAGGTTAACACAAGACGCCGTGGCTTTCTTCAGCACCTGGTCTCTCCGACCAGGTCTTTAATGTGGTGTTCATTAATTCTCCGATCGTCTTGGTCCATGGTGCATGCATCGCGTGCTGGGATCTTTGCCCTTCTCGCAAACGCAAATCCGCACTGATCTCGACAAGCATCATCAACAGCCAGCGGACTATAAGCCACAGCGGCATGCAGCGCAGTCTCCCATCTCCGTTGGCCGTAATGTTGAAACAACTCGGCTGACCCTCGGATGATTCGCTAACCAGCGGTCCCCTTCATGCCAAGATGCACGGCTATCCTCTTCTCGTCCTGGGGTCAGCATCTGCCTCCTTCGTCGCCGCTACCGGCGGACATCTGCATCCGTTCTCTGCGGTTGCGAGGACCGCGGGTGCGCGTTGGTGGCTCGCGAAGGGCGCGCTGATCATCCTGGCCGCGGTGGGCCATGCGGGATGGAGATCTGCGCGCGCACGCGCGTACGCGCTA
+
##$'('$.(.101+3/0)'++&(&*$%(#%'&,-1+-**#$$*)*1%&'&*+)'*.-/3.-%'')-++,$&,.,(/.*'%'$%'&#*+).02/1.)(&)(%&$)$$'),,,1/112-*)%$%*()+&$#)*&&$''*)*3-,0/,./01,&$'&*,*,2+*,--0('/1..#$$$()+-*,(+,1/2+(-&+2++)+**))(.(,1%+0((&.&)+**)+,'(.2)/*'+'),+1-*))*$(&%(*(*3)-',*2/)*.%''((&%$+*.2.,+-))&$&(#$##%&25++-+1/,*(-1*))*+0,/(*+,*)..-)'+*.)$*)(&*()%+))*,,(%(00--0/,,/.)+%$%/1-,-.'0+*)#%./*0./31/4,(+./11)..,),(.*)(*(&&&$*31/-+,-.*-/*/1/0(&#'(&%&&&'&&*++,(('&)))&&'&'&%($&'**1+,)%%+)(*))))-0.1.((*&((+'')#$%)',)'+(.'**++'+&')((&$#%$&%%###&$$#%"#

And here is "trimmed" output read.

@channel_37_d6b322e5-b47b-4400-b152-ab15b84b12ce_template /Library/MinKNOW/data/reads/fail/0/cfmrs_mac_pro_fpl_fs_fed_us_20170307_FNFAF13190_MN18073_mux_scan_SWJ_Anid_54932_ch37_read188_strand.fast5
GTCTCTCCGACCAGGTCTTTAATGTGGTGTTCATTAATTCTCCGATCGTCTTGGTCCATGGTGCATGCATCGCGTGCTGGGATCTTTGCCCTTCTCGCAAACGCAAATCCGCACTGATCTCGACAAGCATCATCAACAGCCAGCGGACTATAAGCCACAGCGGCATGCAGCGCAGTCTCCCATCTCCGTTGGCCGTAATGTTGAAACAACTCGGCTGACCCTCGGATGATTCGCTAACCAGCGGTCCCCTTCATGCCAAGATGCACGGCTATCCTCTTCTCGTCCTGGGGTCAGCATCTGCCTCCTTCGTCGCCGCTACCGGCGGACATCTGCATCCGTTCTCTGCGGTTGCGAGGACCGCGGGTGCGCGTTGGTGGCTCGCGAAGGGCGCGCTGATCATCCTGGCCGCGGTGGGCCATGCGGGATGGAGATCTGCGCGCGCACGCGCGTACGCGCTA
+
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DRS read adapters...

Will porechop identify and trim adapters in the Nanopore reads from the new DRS protocol?

Add a custom adapter sequence option

In this issue Miles described changing the source code to include a new adapter. I should make an option where users can pass in new adapter sequences via FASTA so they can do this more easily.

Custom barcodes not binned correctly

Hi,

I am trying to demultiplex sequence reads that where pooled with custom barcodes. I followed your instructions and added my barcode sequences in the adapters.py file and removed the previous barcode sequences. Below is an example of how I added the barcodes for two samples (Barcode 1 & Barcode 2):

Adapter('Barcode 1 (forward)', start_sequence=('BC1', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC1_rev', 'GCAGGTGAAAGGCATTAGCGCAGG')), Adapter('Barcode 2 (forward)', start_sequence=('BC2', 'CAAACTGCAGGAGGGTAGCGCTAC'), end_sequence=('BC2_rev', 'AAGGACATAATGCATGAGCGGATC'))]

In the example above, both samples share the same forward barcode but have a different reverse barcode. Each of my samples was barcoded with a forward and reverse barcodes. The reverse barcodes are NOT the reverse complement of the forward but unique from the forward barcodes.

Now, when I run porechop with this adapter file, the barcodes are found in the sequences but reads are binned to the start sequence names instead of the actual Barcode names (The example below shows most reads in none because I am only using 2 instead of 100 barcodes for simplicity):

Looking for known adapter sets
4,000 / 4,000 (100.0%)
Best
read Best
start read end
Set %ID %ID
SQK-NSK007 100.0 81.8
Rapid 68.4 0.0
SQK-MAP006 80.0 83.3
SQK-MAP006 Short 80.8 77.8
PCR adapters 1 80.0 79.2
PCR tail 1 76.7 76.7
PCR tail 2 72.4 73.3
1D^2 part 1 73.3 76.7
1D^2 part 2 88.9 80.6
Barcode 1 (forward) 100.0 79.2
Barcode 2 (forward) 100.0 92.3

Barcodes determined to be in forward orientation

Trimming adapters from read ends
SQK-NSK007_Y_Top: AATGTACTTCGTTCAGTTACGTATTGCT
SQK-NSK007_Y_Bottom: GCAATACGTAACTGAACGAAGT
BC1: CAAACTGCAGGAGGGTAGCGCTAC
BC1_rev: GCAGGTGAAAGGCATTAGCGCAGG
BC2: CAAACTGCAGGAGGGTAGCGCTAC
BC2_rev: AAGGACATAATGCATGAGCGGATC

4,000 / 4,000 (100.0%)

3,183 / 4,000 reads had adapters trimmed from their start (102,974 bp removed)
1,489 / 4,000 reads had adapters trimmed from their end (22,429 bp removed)

Discarding reads containing middle adapters
4,000 / 4,000 (100.0%)

15 / 4,000 reads were discarded based on middle adapters

Saving trimmed reads to barcode-specific files

Barcode Reads Bases File
BC1 7 6,112 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC1.fastq
BC2 6 3,774 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/BC2.fastq
none 3,972 2,786,813 /Volumes/TOSHIBA EXT/3_Demultiplexed_trimmed/none.fastq

I assume that the issue is that barcode sets from ONT use unique barcodes for each sample (as in start_sequence would be barcode sequence and end_sequence would be the same sequence in reverse complement). Hence, when using an ONT barcode set, no sample would share any barcode sequence. And therefore reads are binned to the start_sequence "name" (BCx). This logic does not work in my case since some of my samples share the same start_sequence and porechop bins reads sharing a start_sequence to the same bin eventhough they have a different end_sequence.

For troubleshooting I tried to run the verbosity 2 setting and I can see that all start and end sequences are found in the reads. I ran with --require_two_barcodes and all reads get binned to "none".
I used reverse, complement, and reverse-complement of the end_sequence barcode, with both default and --require_two_barcodes setting.

Nothing worked so far.

Could you please confirm on my suspicion of a logic problem and share your thoughts on how I could resolve this issue and get my samples demultiplexed?

I would be forever grateful!

All the best and many thanks in advance,
Scott

adapters file in readme has bad link

You link to the adapters.py file in README.md, but the link is broken.

https://github.com/rrwick/Porechop/blob/blob/master/porechop/adapters.py

Lambda SQK-NSK007 and crossbarcoding detection

Hi,
I´m running porechop and used multiplex outpu on one sample that was built with barcode01 only.

Here below the results.

From the image below it looks that porechop is checking for lambda DNA. The green colors means it was found ??
How can i check if lambda DNA was found ??

I also see that other barcodes were match although the sample was prepared with only barcode01. Are those mismatches or contaminations ?

Thanks !!!!

Running pore adaptor on 1 D^2 data

Hello,
1.
I am running porechop on 1 D^2 data using the following command.

porechop -i input_reads.fastq.gz -o output_reads.fastq.gz

For 1D ~98% reads were trimmed for adaptors.
however, for 1D^2 only ~ 65% reads were trimmed for adaptors.

I am unable to understand why is there this difference?

Is there any way to see which adaptor porechop is trimming?

optional compression?

It seems that all demultiplexed output is gzip-compressed by default. Is there a way to turn off compression on these output files?

Installation in Singularity xenial image

Hi Ryan,

This isn't an issue really, more just noting this somewhere incase someone else comes across it.
I was having some difficulty with getting porechop to work in a singularity container and thought I would share a couple of simple things I had to do to get it working

PORECHOP_VERSION=0.2.3
apt-get install -y python3-setuptools python3-pkg-resources
wget https://github.com/rrwick/Porechop/archive/v${PORECHOP_VERSION}.tar.gz
tar xzf v${PORECHOP_VERSION}.tar.gz
rm v${PORECHOP_VERSION}.tar.gz
cd Porechop-${PORECHOP_VERSION}
python3 setup.py install

This is the code I had to run in my Singularity recipe to get porechop working, otherwise it was throwing an error complaining about ImportError: No module named 'pkg_resources'. I know you noted something about this in the README.
The key was the apt-get install -y python3-setuptools python3-pkg-resources. I had previously been trying to pip3 install setuptools but wasn't having any luck.

Hope this helps someone.

C++14 requirement

Can you tell me what parts of the code require C++14 support? I'm trying to get this on bioconda to make Galaxy packaging easier, but the GCC currently packaged by bioconda doesn't support C++14. I'm wondering whether a patch to get around the C++14 requirement would be a short-term solution until bioconda supports GCC 4.9.

Possible to restrict to a barcode set?

Hi Ryan

I've noticed a few occasions where Porechop has picked both a native barcoding set (i.e. labeled as 'reverse') and also PCR barcoding set (labeled 'forward') both of the same barcode number during adaptor screening.

This then seems to result in bad things happening such as all or many reads getting thrown into the 'None' bin.

Is there a way of disabling certain barcode sets if I know in advance which ones I am using. I.e. I would like it just to look for native barcodes and not PCR barcodes.

Thanks in advance!
Nick

Use pigz for parallel compression

In some cases the gzip compression time clearly dominates the whole running time. The reason is that plain gzip is used. I would suggest checking whether pigz is present, and if yes - then forward ---threads option to pigz to allow much faster parallel compression.

Chopping out vector backbones

Hi @rrwick,

I was directed to Porechop by @skoren when asking him a question about the removal of vector sequences during long-read assembly. He suggested I look into Porechop. My problem is that I would like to assemble genomes where the genomes have been cloned/sequenced as pools of cosmid/fosmid/BACs and I want to remove the cloning vector sequence form the reads before assembly. ( I'm using PacBio). I will try Porechop but I am wondering if you have used Porechop in this way or if there any potential hiccups you would anticipate with this use case.

Thank you,
zahc cp

editing adapters.py

Hello
I would like to use porechop to remove BAC vector sequence from my reads, or at least split the reads at the vector sequence which in most cases is somewhere in the middle of the read. I wondered if I edited the adapters.py file if I could achieve this. Porechop ran fine un-edited and found and trimmed many instances of the NSK007 Y adapter from the start of my reads. I replaced the NB12 adapter sequence with my own in adapters.py and ran porechop on the same fastq reads (many of which contain at least an 85% match to the 38 bp I specified. The output was the same. In fact it doesn't seem to matter what changes I make to adapters.py, the outcome is the same. For example, editing an adapter name is not reflected in the verbose output.
I have never written a python script so I imagine I'm missing something basic.

thanks for reading
Miles

Porechop on windows

Hello,

I could install porechopp on windows using bioconda

Repository not found?

Hi, accessing via linux on terminal Mac OX Sierra and encounter this upon installation?
fatal repository 'git clone https://github.com/rrwick/Porechop.git' not found?

feature request: specify expected barcodes

For use in automated workflows, it would be nice to be able to specify expected barcodes on the command line to be included in output. For instance, if a user knows they only have libraries tagged with 'BC01' and 'BC02', they don't expect and don't care about reads (mis)classified into the 'BC06' bin. I'm imagining something like:

porechop --include_barcodes BC01,BC02 [...]

where any bins not on the inclusion list would be reported but not actually included in output. Of course, this can be dealt with using a downstream filter, but it seems this is logical enough to possibly include in Porechop itself.

Overcoming limitations on the number of reads that can be processed

Hi
There seems to be a limitation in the number of reads that can be processed by porechop at a time, is there a way to remove this limit.

Thanks

Add hairpin adapters to detected sequences

>Hairpin_1
CGTTCTGTTTATGTTTCTTGGACACTGATTGACACGGTTTAGTAGAAC
>Hairpin_2
CAAGAAACATAAACAGAACGT
>adapt_Y_top_1
GGCGTCTGCTTGGGTGTTTAACCTTTTTTTTTT
>adapt_Y_top_2
AATGTACTTCGTTCAGTTACGTATTGCT
>adapt_Y_bottom
GCAATACGTAACTGAACGAAGT

The nice thing about adding hairpin adapters is that you could then do hairpin detection and [possibly automatic] consensus calling all in base space

Required two barcodes set as default?

Hi there,

After running porechop with default settings and --verbosity 3 I noticed that it seems a lot of my reads aren't being classified because there isn't a threshold match at both ends. As I understand it, in the default settings this shouldn't be happening. Have I understood the info below correctly? If I have, is there a way to stop this?

Thanks,

Tim.

...ACGGCTCATCGGGACTAAAAGATCACCTGCATCCACGAGTGACATGACGTCGTCTATACGGGCTCAGCAACGCCCATCCGTTTGGAGCCTCATCAGGTAATTCTTTACTCCCTCAGGATGCAAGCCTATATATAAGGTTTGCGGTGCGGG
Barcodes:
start barcodes: BC02 (78.6%), BC01 (48.0%)
end barcodes: BC02 (64.0%), BC01 (59.3%)
best start barcode: BC02 (78.6%)
best end barcode: BC02 (64.0%)
barcode call: none

rrwick / porechop Goto Github PK

porechop's Introduction

Oct 2018 update: Porechop is officially unsupported

Table of contents

Requirements

Installation

Install from source

Build and run without installation

Quick usage examples

How it works

Find matching adapter sets

Trim adapters from read ends

Split reads with internal adapters

Discard reads with internal adapters

Barcode demultiplexing

Barcode demultiplexing with Albacore

Output

Verbose output

Known adapters

Full usage

Known issues

Adapter search

Performance

Barcode demultiplexing

Base-space problems

Acknowledgements

License

porechop's People

Contributors

Stargazers

Watchers

Forkers

porechop's Issues

Recommend Projects

Recommend Topics

Recommend Org