Giter Site home page Giter Site logo

a-slide / pycoqc Goto Github PK

View Code? Open in Web Editor NEW
242.0 5.0 38.0 183.86 MB

pycoQC computes metrics and generates Interactive QC plots from the sequencing summary report generated by Oxford Nanopore technologies basecaller (Albacore/Guppy)

Home Page: https://a-slide.github.io/pycoQC/

License: GNU General Public License v3.0

Python 95.01% TeX 3.00% Shell 0.54% Jinja 1.46%
jupyter-notebook generates-plots computing-metrics nanopore

pycoqc's Introduction

pycoQC v2.5.2

pycoQC

JOSS DOI Gitter chat GitHub license Language

PyPI version Downloads

Anaconda Version Anaconda Downloads

install with bioconda Bioconda Downloads

Build Status


PycoQC computes metrics and generates interactive QC plots for Oxford Nanopore technologies sequencing data

PycoQC relies on the sequencing_summary.txt file generated by Albacore and Guppy, but if needed it can also generate a summary file from basecalled fast5 files. The package supports 1D and 1D2 runs generated with Minion, Gridion and Promethion devices, basecalled with Albacore 1.2.1+ or Guppy 2.1.3+. PycoQC is written in pure Python3. Python 2 is not supported. For a quick introduction see tutorial by Tim Kahlke available at https://timkahlke.github.io/LongRead_tutorials/QC_P.html

Full documentation is available at https://a-slide.github.io/pycoQC

Gallery

summary

reads_len_1D_example]

reads_len_1D_example]

reads_qual_len_2D_example

channels_activity

output_over_time

qual_over_time

len_over_time

align_len

align_score

align_score_len_2D

alignment_coverage

alignment_rate

alignment_summary

Example HTML reports

Example JSON reports

Disclaimer

Please be aware that pycoQC is a research package that is still under development.

It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.

Thank you

Classifiers

  • Development Status :: 3 - Alpha
  • Intended Audience :: Science/Research
  • Topic :: Scientific/Engineering :: Bio-Informatics
  • License :: OSI Approved :: GNU General Public License v3 (GPLv3)
  • Programming Language :: Python :: 3

licence

GPLv3 (https://www.gnu.org/licenses/gpl-3.0.en.html)

Copyright © 2020 Adrien Leger & Tommaso Leonardi

Authors

pycoqc's People

Contributors

a-slide avatar danielskatz avatar liujamin avatar tleonardi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pycoqc's Issues

import sys addition

I know this is a tutorial, and you are using a dataset designed for the tutorial, but it may be worth including a section in the workflow about using a real dataset. It is an easy addition and I know these manuals are often designed for people who understand the code but it may be useful for people who aren't great at coding to include a portion in the readme:
after:
from pycoQC.pycoQC import jprint as print
import sys
example_file_1D = '/path/to/real/sequencing/sequencing_summary.txt'
print(example_file_1D)
/path/to/real/sequencing/sequencing_summary.txt
Then all analysis following this portion will point to real data rather than your preloaded set. I only say this because I have had this installed in my work's cluster and I have had several people ask me why they always get the same output for the data. I then refer them to this or meet with them to show them but it makes it more inclusive.
Thanks

Tweak HTML layout

  • Remove side menu = to the top ?
  • Add sample name (as a cli option)
  • Add path to summary file ? maybe.

EXP-NBD114 support

Describe the bug
pycoQC does not seem to support the EXP-NBD114 expansion.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
The generation of the pycoQC report in html, including the distrubution of the reads over the barcodes.

Screenshots

This is the error log.
PARSE DATA FILES
Import raw data from sequencing summary files
3,099,954 reads found in initial file
Import barcode information from barcode summary files
Traceback (most recent call last):
File "/path/Anaconda3-5.1.0/envs/pycoqc/bin/pycoQC", line 10, in
sys.exit(main_pycoQC())
File "/path/Anaconda3-5.1.0/envs/pycoqc/lib/python3.6/site-packages/pycoQC/cli.py", line 169, in main_pycoQC
title=args.title)
File "/path/Anaconda3-5.1.0/envs/pycoqc/lib/python3.6/site-packages/pycoQC/cli.py", line 196, in generate_report
filter_calibration=filter_calibration)
File "/path/Anaconda3-5.1.0/envs/pycoqc/lib/python3.6/site-packages/pycoQC/pycoQC.py", line 94, in init
raise pycoQCError ("File {} does not contain required barcode information".format(fp))
pycoQC.common.pycoQCError: File ./fastq_demux/barcoding_summary.txt does not contain required barcode information

Desktop (please complete the following information):

  • OS: Linux 3.10.0-957.10.1.el7.x86_64 x86_64
  • Browser: N.A.
  • Version: pycoQC-2

Additional context
The same version of pycoQC is processing EXP-NBD104 barcodes flawlessly.

Is there a lack of compatibility?

The channel activity plot is not working as expected for Promethion data

When loading a Prom Seq summary file there is an issue in how data are collected for the channel activity plot ending up in having a massive array stored in the HTML file

See example of plot below

newplot

Although it doesn't prevent pycoQC from working, this needs to be should be solved as the file generated is massive and the plot is mostly empty

Combining several sequencing_summary.txt files

My sequencing was divided into two runs, and I therefore have two summary-files. I tried to combine them by just copy-pasting them together, but pycoQC doesn't count the reads from the second run. Is there a simple way to solve this?

Option for static image generation in HTML report

Large summary files (eg, from Promethion) lead to massive pycoQC report file as data are self contained in the HTML file.

One option to dramatical reduce the size would be to have static images instead of dynamic Js plots. This is apparently feasible but not very straightforward and it requires a package not available through pip: https://plot.ly/python/static-image-export/

@tleonardi could you have a look at this issue when you have a bit of time?

Windows-error with HOME - variable

Describe the bug
When running pycoQC v2.2.2 on Windows I get the following error message:

PARSE CONFIGURATION FILE
Traceback (most recent call last):
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\snorres\AppData\Local\Continuum\Anaconda3\envs\pycoQC\Scripts\pycoQC.exe\__main__.py", line 9, in <module>
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\site-packages\pycoQC\cli.py", line 169, in main_pycoQC
    title=args.title)
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\site-packages\pycoQC\cli.py", line 186, in generate_report
    config_dict = parse_config_file(config)
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\site-packages\pycoQC\cli.py", line 280, in parse_config_file
    home=os.environ['HOME']
  File "c:\users\snorres\appdata\local\continuum\anaconda3\envs\pycoqc\lib\os.py", line 725, in __getitem__
    raise KeyError(key) from None
KeyError: 'HOME'

To Reproduce
On Windows:

pycoQC --file sequencing_summary.text

Expected behavior
Run as normal

Desktop (please complete the following information):

  • Windows 10
  • Chrome
  • 2.2.2

Additional context
The issue is solved by adding "HOME" as an environmental variable with

set HOME=%USERPROFILE%

as described in https://stackoverflow.com/questions/14742064/python-os-environhome-works-on-idle-but-not-in-a-script

Export tables from report as *.tsv.

Is your feature request related to a problem? Please describe.
The results in the tables (made by plotly) are enlocked in the *.html file.

Describe the solution you'd like
A button/handle in order to save the tabular results.

Describe alternatives you've considered
Looking into the source.

Additional context
N.A.

Error when using pycoQC on small files

If the sequencing_summary.txt files only contains a few reads, the binning in __output_over_time_data() fails:

PARSE DATA FILES
GENERATES PLOTS
Traceback (most recent call last):
  File "/home/nanopore/.local/bin/pycoQC", line 11, in <module>
    sys.exit(main())
  File "/home/nanopore/.local/lib/python3.6/site-packages/pycoQC/cli.py", line 108, in main
    verbose_level=args.verbose_level)
  File "/home/nanopore/.local/lib/python3.6/site-packages/pycoQC/cli.py", line 143, in generate_report
    fig = method(**method_args)
  File "/home/nanopore/.local/lib/python3.6/site-packages/pycoQC/pycoQC.py", line 513, in output_over_time
    dd1, ld1 = args=self.__output_over_time_data (all_df, level="reads")
  File "/home/nanopore/.local/lib/python3.6/site-packages/pycoQC/pycoQC.py", line 558, in __output_over_time_data
    t = np.digitize (t, bins=x, right=True)
ValueError: bins must have non-zero length

Possible new functionality

A useful addition would be to show cumulative yield over time. This could also report how long into the run 1/4, 1/2, 3/4 of the output was generated.

Another possible addition would be to allow temporal ordering of multiple sequencing runs within an experiment. When a sequencing run is stopped or crashes and then is restarted the runID changes. Currently the analyses that output yield or quality over-time put these multiple runs together when they should (ideally) be consecutive. I don't think it is possible to tell solely from the sequence_summary.txt what the order of the runs was, but maybe the user could specify the order of the runIDs?

Trouble installing pycoQC

Describe the bug
Hi I am having trouble installing pycoQC.
I followed the instructions and created a venv running Python 3.5.2.
I installed using Option 1: Installation with pip from pypi.
After the installation was completed and I tried to run pycoQC I got the following error:


Traceback (most recent call last):
  File "/data/laurensl/venv/bin/pycoQC", line 6, in <module>
    from pycoQC.__main__ import main_pycoQC
  File "/data/laurensl/venv/lib/python3.5/site-packages/pycoQC/__main__.py", line 18, in <module>
    from pycoQC.pycoQC import pycoQC
  File "/data/laurensl/venv/lib/python3.5/site-packages/pycoQC/pycoQC.py", line 13, in <module>
    from pycoQC.pycoQC_plot import pycoQC_plot
  File "/data/laurensl/venv/lib/python3.5/site-packages/pycoQC/pycoQC_plot.py", line 179
    height:int=300+(30*self.all_df[groupby].nunique()) if groupby else 300
          ^
SyntaxError: invalid syntax

How can I fix this?

Cannot write seq_summary_fn

Describe the bug
I tried creating a report using a FAST5 folder inside a Docker container.

To Reproduce

1.) Use nfcore/bacass:dev container with pycoQC recipe in it.
2.) Try running Fast5_to_seq_summary -f FAST5 -s testme

Fast5_to_seq_summary -f FAST5 -s testme
Traceback (most recent call last):
  File "/opt/conda/envs/nf-core-bacass-1.1.0dev/bin/Fast5_to_seq_summary", line 12, in <module>
    sys.exit(main_Fast5_to_seq_summary())
  File "/opt/conda/envs/nf-core-bacass-1.1.0dev/lib/python3.6/site-packages/pycoQC/cli.py", line 80, in main_Fast5_to_seq_summary
    verbose_level = args.verbose_level)
  File "/opt/conda/envs/nf-core-bacass-1.1.0dev/lib/python3.6/site-packages/pycoQC/Fast5_to_seq_summary.py", line 114, in __init__
    raise pycoQCError ("Cannot write the indicated seq_summary_fn")
pycoQC.common.pycoQCError: Cannot write the indicated seq_summary_fn

Though the directory is writable from inside the container without issues.

touch test_seq
(base) root@083663bc4c59:/home/apeltzer/bacass_test/work/7b/b16ef3592d07d01f0febf80be61d67# ls
FAST5  barcode01_NB01_Burkholderia.fastq  test_seq

Expected behavior
I expect the file to be written out :-)

Desktop (please complete the following information):

  • OS: [e.g. iOS] CentoS7

numpy.dtype size changed

Describe the bug
Upon running the following code an error is reported.

bsub -q bio -o pycoQClog.txt -e pycoQCerror.txt -n 13 pycoQC -f fastq/sequencing_summary.txt -o pycoQC_output.html

path/to/envs/pycoqc/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
path/to/envs/pycoqc/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed
sequencing_summary.txt
, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
path/to/envs/pycoqc/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning:

numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88

Reduce size of PromethION .html files

The PycoQC .html files are quite large (~500 mb) for a few PromethION summaries that I've run - it would be great to have an option to reduce this size.

Parse Barcode values when available + plot

Barcodes values are indicated in the field barcode_arrangement if Albacore (2.0+?) was called with the option --barcoding

They could be easily parsed and could be used to generate a barcode distribution plot per runid.

Add fastq support

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Be able to pass a fastq file as input rather than a sequencing summary file.

Describe alternatives you've considered
Using other tools? But their plots and reports aren't as nice as yours.

Additional context
I think a big use case for this is when I demultiplex a sample I have a bunch of single fastq files and it would nice to be able to see the quality metrics for just these.
Additionally, I don't always have the sequencing summary file for a fastq.

pycoqc v2.2.4 on conda?

I was wondering if version 2.4 was already on conda. It seems to be but I cannot updat my current version:

pycoqc 2.2.3.3 dev_0

summary of read stats per barcode

HI,

Awesome tool, thanks!

What I would like to see though is the readlength stats, N50 and quality etc split out per barcode. Now this is done per summary file, but an option to recalculate all the readsstats per barcode would be great.

Suggestions for pycoQC

Just tested pycoQC, here are a few thoughts:

I like the interactive plots but the labels on two of the plots ended up being cluttered, "Distribution of read length” and "Output over experiment time”.
(The Distribution of read length was cluttered because the narrow distribution for the poly(A) run, would probably not be an issue in most cases)

I think you should include an option to generate a new, filtered sequencing_summary.txt file.
I ended up generating these when I was analysing the poly(A) data.

Also it would make more sense if the parameters for —verbose_level were 2,1,0 (0 = silent).

Split pass/fail/calibration instead of filtering out

After the runid filtering, I would be more interesting to split the df in 3 when pass_filtering and calibration information are available, rather than completely discarding the filtered rows.
This would allow to generate overall dataset metrics including the number of pass/fail/calibration per runid.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.