Compute molecule and barcode statistics with R

The output of this step may look something like this output generated by Longranger wgs:
https://github.com/hackseq/2017_project_6/blob/master/hg002g1_lrwgsvc/outs/summary.csv

I'll first work on a rmarkdown template that includes potential figures and statistics we want to report. And when different components of the pipeline are done, I'll glue them together and generate the final report.

Mapping extracted reads to reference

Testing these tools: BWA-MEM, minimap2 sr, Lariat, or Longranger align, MHAP (or something using MinHashing).

Install software on participants' laptops

Please install the following software on your laptop:

macOS

Homebrew
RStudio brew cask install r rstudio
IGV brew cask install igv

Linux

Linuxbrew
R sudo apt-get install r-base or brew install homebrew/science/r
RStudio
IGV brew install homebrew/science/igv

Windows (all versions)

Windows 10

Windows < 10

Generating summary report with Rmd

Updates

Subscribe to this issue to receive updates.

Pipeline

Report statistics and plot distributions of…

size of the molecules
number of reads per molecule
number of molecules per barcode
number of reads per barcode
amount of DNA per barcode

Design goals

Fast
Streaming
One pass

Stretch goal

Generate the report de novo without using a reference genome.

assemble the reads, using perhaps BCALM
select the largest contigs of the assembly
determine the barcodes that map to those contigs
reassemble those barcodes to create a new assembly
run the above analysis using that assembly as the reference

Chat

👋 Welcome. Let's chat!

Extract the barcode from the read sequence

The input is a single raw FASTQ file. It will have this format:

Line 1: read name
Line 2: [16 bases of the barcode] [rest of the actual sequence]
Line 3: +
Line 4: [quality score of barcode][quality score of the actual sequence]
Line 5: read name
Line 6: [actual sequence]
Line 7: +
Line 8: [quality score of the actual sequence]

The output should be something like this:

Line 1: read name BX:Z:[16 bases for the barcode]-1
Line 2: [rest of the actual sequence]
Line 3: +
Line 4: [quality score of barcode][quality score of the actual sequence]
Line 5: read name BX:Z:[16 bases for the barcode]-1
Line 6: [actual sequence]
Line 7: +
Line 8: [quality score of the actual sequence]

Map reads to the reference genome

Possible options

BWA-MEM
minimap2 sr
abyss-map

De novo chromeqc

Generate the report de novo without using a reference genome.

assemble the reads, using perhaps BCALM
select the largest contigs of the assembly
determine the barcodes that map to those contigs
reassemble those barcodes to create a new assembly
run the above analysis using that assembly as the reference

Notes

My notes from our discussions (feel free to edit or ignore)

Data

There are about a million drops (GEMs)
The barcode is 16bp so theoretically 4Billion combinations.
Out of those, there are 4 million barcodes. We have 1 million barcodes for this experiment.
each droplet (GEM) will have a capsule full of one barcode
around 10 molecules per droplet
molecule size in order of 100K
fragment size is ~300nt
each fragment: 16-6-128 - 150
result is single stranded DNA
Afterward: wash, Then add the second illumina adapter, Then PCR Duplicate

Statistical Questions

What is DNA size
How many GEMS,
How many molecules per GEM
10x genomics assumption when 60x depth : if distance between two reads is less than 60kb you infer they are from same molecule. if more, different molecule
Since our depth is lower, we have to use a different distance perhaps
instead of 60x depth we are doing 2x, so we get around 6 reads (3 fragments) per molecule.
We will infer the size of molecule assuming a uniform random distribution. : Known Statistical problem: German Tank problem, Estimation of a Uniform Distribution

Process

we will process one fastq
take 4000 barcodes out of 4,000,000. which of those to choose? randomly from white list?
align to ref
group and generate stats (reads per molecule, molecule size, number of GEMs, molecules per GEM)
Rmarkdown (Rdashboard) to generate report
error correction? not in the first pass

External tools:

longranger is default tool from 10x, but slow.
Fastqc for one library
MultiQC takes multiple Fastqc reports, and generates aggregate: our data 3 has individuals.
milller command line tool to manipulate tsv files

Stretch goals:

reference free: use bcalm2 genome assembly. create contigs.
create mash signatures of each barcode: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x

broken link

Sample report link for ChromeQC appears broken: https://hackseq.github.io/2017_project_6/report/

Estimate molecule size using alignment

Python script that accepts a bam file to estimate molecule size.

Add Parsing options for AS and NM
Add some IGV figures showing molecules using IGV 2gb BAM file and TSV and actual bam file

bcgsc / chromeqc Goto Github PK

chromeqc's Introduction

ChromeQC: Summarize library quality of 10x Genomics Chromium linked reads

Usage

Examples

Dependencies

Prerequisites

chromeqc's People

Contributors

Stargazers

Watchers

Forkers

chromeqc's Issues