Giter Site home page Giter Site logo

bcgsc / chromeqc Goto Github PK

View Code? Open in Web Editor NEW
15.0 18.0 3.0 2.03 MB

ChromeQC: Summarize sequencing library quality of 10x Genomics Chromium linked reads

Home Page: https://bcgsc.github.io/chromeqc

License: MIT License

Makefile 0.35% Ruby 0.01% Awk 0.08% Python 1.84% TeX 0.02% Shell 0.02% HTML 97.68% CSS 0.01%
genome genome-sequencing linked-reads quality-control

chromeqc's Introduction

ChromeQC logo

ChromeQC: Summarize library quality of 10x Genomics Chromium linked reads

This tool provides a quick report on the quality of a 10x Genomics Chromium linked reads library. The report summarizes the sizes of the molecules, the number of reads per molecule, the number of molecules per barcode, and the amount of DNA per barcode. The idea is to provide a FastQC-like tool in terms of speed but to contain information provided by the Summary page of the Loupe software of 10x Genomics. ChromeQC is developed in Python 3, R, AWK, RMarkdown, and Flexdashboard, and uses BWA-MEM for read alignment.

Usage

-w --whitelist     : default='whitelist_barcodes', type=str
-k --subsample_size: default=4000                , type=int
-i --in            : default='-'                 , type=str
-o --out           : default='stdout'            , type=str
-s --seed          : default=1334                , type=int
-m --max_read_pairs: default=-1                  , type=int  , note: -1 means all read pairs
-p --stats_out_path: default='.'                 , type=str  , note: the directory needs to be created already
-v --verbose       : default=False               , no value  , note: If supplied, will be set to true, else will be false.

Examples

# Install Homebrew on macOS, Linux, or Windows: https://brew.sh
which pigz || brew install pigz
python3 select_random_subset/random_sampling_from_whitelist.py -v -w data/4M-with-alts-february-2016.txt -i data/read-RA_si-GAGTTAGT_lane-001-chunk-0002.fastq.gz | pigz -p4 >data/subsampled.fq.gz

The pipeline starts with raw FASTQ files of interleaved paired end reads provided by the 10x Chromium platform.

Dependencies

pip3 install -r requirements.txt
brew bundle
  • BWA or Minimap2
  • Pysam
  • Python 3
  • Samtools

Prerequisites

The analysis and report will be created using R, the Tidyverse, RMarkdown, and Flexdashboard. Familiarity with some of these tools is useful, but not necessary to participate in this project. Non-technical participants are welcome to design the aesthetics of the report, prepare and deliver the presentation, and coordinate writing a brief paper about the tool.

Team Lead: Shaun Jackman | [email protected] | @sjackman | Grad Student | BC Cancer Agency Genome Sciences Centre

chromeqc's People

Contributors

baraaorabi avatar cheny19 avatar emreerhan avatar hyounesy avatar jakelever avatar justinchu avatar rylosqualo avatar sean-la avatar sjackman avatar sm30 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chromeqc's Issues

Generate final Rmarkdown report

I'll first work on a rmarkdown template that includes potential figures and statistics we want to report. And when different components of the pipeline are done, I'll glue them together and generate the final report.

Updates

Subscribe to this issue to receive updates.

Pipeline

  • bx @baraaorabi Extract the barcode from the read sequence with longranger basic or a new implementation
  • subsample @rylosqualo Select a random subset of barcodes (deterministic for a specified random seed), and extract the reads of those barcodes
  • map @emreerhan Map those reads to the reference genome using BWA-MEM, minimap2 sr, Lariat, or Longranger align
  • @emreerhan Convert SAM format to TSV format (if needed).
    Columns: Rname, Start, End, BX, Mapq, AS, NM
  • filter Filter by alignment quality (e.g. Mapq > 0, AS โ‰ฅ 100, NM < 5)
  • sort @sjackman Group alignments by barcode, chromosome, sort by position, with samtools sort -tBX
  • molecule @JustinChu Group reads into molecules, possibly with bxtools mol.
    Columns: Rname, Start, End, BX, Reads, MI
    Optional Columns: Mapq_median, AS_median, NM_median
  • stats @tmozgach Compute molecule and barcode statistics with R (see below)
  • report @sm30 Generate the report with RMarkdown notebook or dashboard
  • multiqc @cheny19 Aggregate multiple reports with MultiQC

Report statistics and plot distributions ofโ€ฆ

  • size of the molecules
  • number of reads per molecule
  • number of molecules per barcode
  • number of reads per barcode
  • amount of DNA per barcode

Design goals

  • Fast
  • Streaming
  • One pass

Stretch goal

Generate the report de novo without using a reference genome.

  • assemble the reads, using perhaps BCALM
  • select the largest contigs of the assembly
  • determine the barcodes that map to those contigs
  • reassemble those barcodes to create a new assembly
  • run the above analysis using that assembly as the reference

Chat

๐Ÿ‘‹ Welcome. Let's chat!

Extract the barcode from the read sequence

The input is a single raw FASTQ file. It will have this format:

Line 1: read name
Line 2: [16 bases of the barcode] [rest of the actual sequence]
Line 3: +
Line 4: [quality score of barcode][quality score of the actual sequence]
Line 5: read name
Line 6: [actual sequence]
Line 7: +
Line 8: [quality score of the actual sequence]

The output should be something like this:

Line 1: read name BX:Z:[16 bases for the barcode]-1
Line 2: [rest of the actual sequence]
Line 3: +
Line 4: [quality score of barcode][quality score of the actual sequence]
Line 5: read name BX:Z:[16 bases for the barcode]-1
Line 6: [actual sequence]
Line 7: +
Line 8: [quality score of the actual sequence]

De novo chromeqc

Generate the report de novo without using a reference genome.

assemble the reads, using perhaps BCALM
select the largest contigs of the assembly
determine the barcodes that map to those contigs
reassemble those barcodes to create a new assembly
run the above analysis using that assembly as the reference

Notes

My notes from our discussions (feel free to edit or ignore)

Data

  • There are about a million drops (GEMs)
  • The barcode is 16bp so theoretically 4Billion combinations.
  • Out of those, there are 4 million barcodes. We have 1 million barcodes for this experiment.
  • each droplet (GEM) will have a capsule full of one barcode
  • around 10 molecules per droplet
  • molecule size in order of 100K
  • fragment size is ~300nt
  • each fragment: 16-6-128 - 150
  • result is single stranded DNA
  • Afterward: wash, Then add the second illumina adapter, Then PCR Duplicate

Statistical Questions

  • What is DNA size
  • How many GEMS,
  • How many molecules per GEM
  • 10x genomics assumption when 60x depth : if distance between two reads is less than 60kb you infer they are from same molecule. if more, different molecule
    Since our depth is lower, we have to use a different distance perhaps
  • instead of 60x depth we are doing 2x, so we get around 6 reads (3 fragments) per molecule.
  • We will infer the size of molecule assuming a uniform random distribution. : Known Statistical problem: German Tank problem, Estimation of a Uniform Distribution

Process

we will process one fastq
take 4000 barcodes out of 4,000,000. which of those to choose? randomly from white list?
align to ref
group and generate stats (reads per molecule, molecule size, number of GEMs, molecules per GEM)
Rmarkdown (Rdashboard) to generate report
error correction? not in the first pass

External tools:

longranger is default tool from 10x, but slow.
Fastqc for one library
MultiQC takes multiple Fastqc reports, and generates aggregate: our data 3 has individuals.
milller command line tool to manipulate tsv files

Stretch goals:

Estimate molecule size using alignment

Python script that accepts a bam file to estimate molecule size.

  • Add Parsing options for AS and NM
  • Add some IGV figures showing molecules using IGV 2gb BAM file and TSV and actual bam file

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.