Giter Site home page Giter Site logo

google / deeppolisher Goto Github PK

View Code? Open in Web Editor NEW
20.0 4.0 1.0 93.3 MB

Transformer-based sequence correction method for genome assembly polishing

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.10% Python 24.04% Jupyter Notebook 75.86%
assembly-polishing genome-assembly human-genome

deeppolisher's Introduction

DeepPolisher

DeepPolisher is a transformer-based sequencing correction method similar to DeepConsensus. DeepPolisher is designed to identify errors in genome assemblies. DeepPolisher takes haplotype-specific reads aligned to phased assemblies and produces a VCF file containing potential errors in the assembly. Currently, DeepPolisher can take PacBio HiFi-based assemblies and read alignments to identify potential errors.

Please see the case study of DeepPolisher for step-by-step instructions on how to polish an assembly.

How it works

DeepPolisher works in two steps:

  • make_images: In make images step, DeepPolisher looks at the reads aligned to the assembly and creates tensor-like examples of region with potential assembly errors. It encoded several features of the region including read bases, base quality, mapping quality, match or mismatch.
  • inference: In inference step, the examples from the previous step is passed to a transformer model that predicts a sequence for the region. The predicted sequence is compared against the assembly sequence and any observed difference is recorded as a potential error and reported in the VCF.

A diagram on how DeepPolisher works is presented here:

Input and output

Input: DeepPolisher takes two primary inputs:

  • A haplotype-specific genome assembly generated using PacBio HiFi reads. This can be generated with assemblers like Verkko or HiFiasm.
  • Haplotype-specific PacBio HiFi reads aligned to corresponding to the input assembly. This can be generated using HPRC's PHARAOH pipeline.

Output:

  • The output of DeepPolisher is a VCF file that contains potential errors in the assembly.

Currently, DeepPolisher only works on PacBio HiFi assembly and reads.

How was the model trained?

The current release model of DeepPolisher was trained to polish HG002 HiFiasm v0.19.5 diploid assembly generated with PacBio HiFi data. The reads were phased with PHARAOH pipeline. We took HG002 T2T v0.9 assembly as the truth for maternal and paternal contigs and projected high-confidence blocks of genome-in-a-bottle and trained only on the regions that can be confidently mapped between assemblies. The current model can only work on PacBio HiFi assembly and read data. For training, we trained on contigs that map to chr1-chr19 of the truth, used chr21-chr22 for tune and completely held out chr20 for evaluation.

How to install DeepPolisher

You can either use Docker container to run DeepPolisher or install using pip.

How to use docker

The docker is available using:

sudo docker pull google/deepconsensus:polisher_v0.1.0

How to install using pip

Setup pip:

sudo -H apt-get -qq -y update
sudo -H apt-get -y install python3-dev python3-pip
sudo -H apt-get -y update
python3 -m pip install --upgrade pip

Setup virtual environment:

sudo apt install python3-venv
python3 -m venv ~/workspace/polisher-venv/
source ~/workspace/polisher-venv/bin/activate

# This activates the virtual env
echo "$(pip --version)"

Install requirements:

# For pysam
export HTSLIB_CONFIGURE_OPTIONS=--enable-plugins
pip install -r requirements.txt

Install DeepPolisher:

pip install .[cpu]

Check installation:

polisher --version
> 0.1.0

which polisher
> /path/to/workspace/polisher-venv/bin/polisher

Prerequisites

  • Unix-like operating system (cannot run on Windows)
  • Python 3.8

Contribution Guidelines

Please open a pull request if you wish to contribute to DeepPolisher. Note, we have not set up the infrastructure to merge pull requests externally. If you agree, we will test and submit the changes internally and mention your contributions in our release notes. We apologize for any inconvenience.

If you have any difficulty using DeepPolisher, feel free to open an issue. If you have general questions not specific to DeepPolisher, we recommend that you post on a community discussion forum such as BioStars.

Acknowledgements

We thank Mira Mastoras, Mobin Asri and Benedict Paten from the Genomics Institute at University of California, Santa Cruz for providing critical feedback and evaluation on the performance of DeepPolisher. We also thank Mira Mastoras for the image describing how DeepPolisher works presented in this page.

License

BSD-3-Clause license

Disclaimer

This is not an official Google product.

NOTE: the content of this research code repository (i) is not intended to be a medical device; and (ii) is not intended for clinical use of any kind, including but not limited to diagnosis or prognosis.

deeppolisher's People

Contributors

kishwarshafin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ningshuang-yao

deeppolisher's Issues

Q: Model similarity between DeepVariant and DeepPolisher?

Hello,

Am I correctly interpreting DeepPolisher as a blend between DeepVariant + DeepConcensus? Are there any docs, existing or in progress, to illustrate the differences and similarities between the tensors/channels used in the DeepVariant model(s) and the DeepPolisher model? For example, describing how the model listed in the case study here (/opt/models/pacbio model/checkpoint) differs from the one listed under the PacBio case study for DeepVariant (--model_type PACBIO).

I have a custom DeepVariant model (WGS & WGS.AF) for my species that we plan to use for polishing Verrko assemblies through traditional variant calling. I assume an Illumina-based DeepVariant model is incompatible as a replacement for the listed DeepPolisher model, but I'd like to understand why. Thanks!

block_utils

Hi all,

When trying to execute find_homozygous_regions.py ran into a problem of missing block_utils module.
Could only find up42-blockutils, but that didn't seem to be correct, as there was nothing about Alignment there.

Best,

Evgeny Leushkin

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.