Giter Site home page Giter Site logo

gotcha's Introduction

GoT-ChA: Genotyping of Targeted loci with single-cell Chromatin Accessibility

Installation

The Gotcha R package is currently in beta. To install the release version:

devtools::install_github(repo = "landau-lab/Gotcha")

Tutorials (more coming soon...)

How to run Gotcha in slurm clusters with parallel computing

How to perform noise correction and genotype labeling

Why use GoT-ChA?

Somatic mutations are crucial for cancer initiation and evolution, and have been identified across a number of healthy tissues in the human body. These mutations can disrupt normal cellular functions, leading to aberrant clonal expansions via acquired fitness advantages or skewed differentiation topologies. GoT-ChA and similar methods (e.g. GoT, TARGET-seq) aim to pair targeted genotyping with single-cell sequencing approaches in order to understand the impact of somatic mutations directly in human patient samples, in both malignant and non-malignant contexts.

GoT-ChA is unique in that it is a high-throughput single-cell method that pairs targeted genotyping with chromatin accessibility measurements based on the broadly utilized scATAC-seq platform from 10x Genomics. Previous single-cell genotyping approaches were largely based on scRNA-seq protocols, utilizing expressed and captured mRNA transcripts as sources for genotyping information. This results in a limiting dependence on gene expression and mutation location (due to 3' end bias of most scRNA-seq methods), precluding the usage of such technologies on lowly expressed or inconveniently located loci of interest. GoT-ChA surmounts these limitations via direct utilization of genomic DNA for genotyping, whilst simultaneously assaying chromatin accessibility.

How does GoT-ChA work?

image In order to capture genotypes within droplet-based scATAC-seq, two GoT-ChA primers are added into the cell barcoding PCR reaction that are designed to flank the locus of interest. One primer contains the partial Nextera Read 1N sequence in its handle, which allows for GoT-ChA amplicons to interact with the 10x Genomics gel bead oligonucleotide and obtain a unique cell barcode, just as tagmented chromatin fragments do. Further, the second GoT-ChA primer allows for exponential amplification of GoT-ChA amplicons while tagmented chromatin fragments are only linearly amplified.

image After the single-cell emulsion is broken, a small portion of the sample is taken for GoT-ChA library construction, comprised of a hemi-nested PCR, biotin-streptavidin pull-down, and an on-bead sample indexing PCR. The final GoT-ChA library can be pooled with scATAC-seq libraries and sequenced together using standard ATAC sequencing parameters.

Designing a GoT-ChA experiment

image To utilize GoT-ChA, three primers need to be designed that flank the genomic region of interest. GoT-ChA_R1 (containing the partial Nextera Read1 sequence in its handle) and GoT-ChA_Rev flank the loci of interest, ideally forming an amplicon between 200-500 bps in size. GoT-ChA_nested is utilized in the hemi-nested PCR during GoT-ChA library construction, and crucially needs to bind within 50bp (inclusive of binding site) of the mutation to be covered with standard scATAC-seq sequencing parameters.

Overview of the Gotcha R package

The Gotcha R package is designed to provide a pipeline for end-to-end processing of the sequecing libraries generated through the GoT-ChA method. It containes functions for both data pre-processing (i.e., filtering of fastq files by base quality, split of fastqs into chunks for parallel processing) as well as mutation calling and noise correction (estimation of noise from empty droplets) and genotype calling. In addition, we have included functions for downstream analysis that work seemingly with the ArchR (https://www.archrproject.com) single cell ATAC-seq processing package, and allow to calculate differential motif accessibility and gene accessibility scores, as well as calculation of co-accessibility for subgroups of cells and visualization of genome tracks.

Data pre-processing

In order to pre-process the fastq files generated through GoT-ChA genotyping, we included two functions within the Gotcha R package as follows:

1 - FastqFiltering: reads in the fastqs generated by gotcha, and filters them based on pre-determined base quality scores.

2 - FastqSplit: if parallelization is available through a slurm cluster, this function allows to split the fastq files into chunks for parallel computing of by the BatchMutationCalling function.

These two functions allow to set up the initial files for running the mutation calling step.

Mutation calling

To perform mutation calling, we provide two main functions. The core of the Gotcha R package is the ability to genotype each read and map it to the corresponding cell barcode. We apply pattern matching including the primer sequence, to guarantee the detection of wildtype or mutant reads containing the primer sequence ("primed reads"). Only those reads that were correctly primed are kept in downstream steps. Once the genotype for each read has been defined, we calculate the counts of reads per cell for each of the specified genotypes. As output, we provide a data frame containing the cell barcode, wildtype read counts, mutant reads counts, and fraction of mutant reads for each cell.

3 - MutationCalling: this is the core function of the Gotcha R package. It allows local parallelization via multiple cores.

4 - BatchMutationCalling: this function allows to speed up computation times in a slurm cluster. It allows to apply the MutationCalling function in the fastq files split by the FastqSplit function, and submit a job to the cluster for each file. Using this function, processing a GoT-ChA genotyping library of 40 million reads takes ~ 4h. As output, it generates a data frame within each folder of the fastq split files, that can be merged using the following function.

5 - MergeMutationCalling: this function allows to merge the outputs of the BatchMutationCalling present in each split folder into one single file. This is the final output of the mutation calling step, and can be merged to the metadata of scATAC-seq objects as generated by Signac or ArchR.

Defining genotype calls

Once we have measured the abundance of wildtype and mutant reads for each cell barcode present in the experiment, we need to define the genotype for each cell. This can be done via two different noise correction methods: Cluster-based genotyping or Empty droplet-based genotyping.

Cluster-based background noise correction and genotype assignment

This approach relies on the expectation of a population of cells for which no genotyping call can be made, due to experimental constraints on capture. These cells may still have non-zero read counts for each allele, representing the level of background signal for each allele. Thus, the data will typically follow a bimodal distribution of supporting reads for either the mutated or wildtype allele. One mode reflects cells with true capture of the mutated allele and the second mode reflects cells where reads reflect background noise. For more details, see Materials and Methods section in the GoT-ChA preprint.

This approach is currently available as a Python script in this repository (/Gotcha/Python/gotcha_labeling.py). Future versions of the Gotcha R package will implement a wrapper function to provide R to Python interface allowing to call the gotcha_labeling.py script from within the R environment using the reticulate R package.

Empty droplet-based noise correction and genotype assignment

The second approach to estimate the background noise present in the genotyping data, is based on leveraging the presence of empty droplets obtained in every 10X run, as has previously been done for noise correction in single cell protein expression. First, background noise is estimated for either wildtype or mutant reads independently. Given the zero-inflated distribution of genotyping reads present in empty droplets and to avoid the potential presence of outlier values (i.e. a droplet that contains a cell but was assigned as empty), we estimate the value of the background noise as that of the 99th percentile of the read number distribution for each genotype independently. Once the background noise is quantified, we proceed to subtract the value for each genotype read count from the barcodes containing real cells. In addition, cells are required to contain a minimum number of genotyping reads (>250 after background subtraction). For more details, see Materials and Methods section in the GoT-ChA preprint. This approach is currently available by appling two functions included in the Gotcha R package:

1 - AddGenotyping: this function allows to estimate and substract the background noise of the wildtype and mutant read counts, and add it directly into the metadata of an ArchR project.

2 - FilterGenotyping: this function generates an additional column in the ArchR project metadata containing a logical indicating if the cell barcode total genotyping read calls are above the specified threshold.

Additional functions for downstream analysis

We provide the functions used for downstream analysis and the corresponding documentation in the /Functions for downstream analysis/ folder. This includes DiffLMM, a function that allows intra-cluster comparisons between genotypes using linear mixture models; DiffCoAccess, a function that leverages ArchR co-accessibility calculations via Cicero to calculate per-genotype co-accessibility; and PlotDCA, a function that leverages ArchR for plotting genomic tracks displaying the co-accessibility loops in a given genomic region.

Testing GoT-ChA

Data availability for testing the Gotcha pipeline

Initial fastq files for the genotyping GoT-ChA library as well as metadata containing the final genotype calls will be publicly available at GEO (GSE204911) upon publication.

gotcha's People

Contributors

fizzo13 avatar theob0t avatar sanjaykottapalli avatar rbmke avatar

Stargazers

 avatar Min Lu avatar  avatar  avatar  avatar  avatar Yuzhe Li avatar Chang Y avatar  avatar  avatar  avatar

Watchers

 avatar  avatar Kostas Georgiou avatar  avatar

gotcha's Issues

zcat: can't stat ...

When running FastqSplit() on MacOS I get the error zcat: can't stat ....

Would you be able to change the zcat command to gunzip -c in FastqSplit()? That would fix this problem under MacOS.

system(paste0("zcat ",path,"/",x," | split - -l ",format(reads*4,scientific=F)," --filter='gzip -f - > $FILE.gz' ",out,"Split/Split_ --additional-suffix _", gsub(x, pattern = ".fastq.gz", replacement = ""), ".fastq"))

Thank you,

livius

Error while running GotchaLabeling()

I got the following error while running the noise correction step:

genotypes <- GotchaLabeling(path = file.path(path_to_output, "MergedOuts/"), 
                                                   infile = "metadata_cells.csv", gene_id = "JAK2",  
                                                   sample_id = "JAK2_single")
Reading in file.
Number of cells: 90147
Noise correcting WT read counts.

 Error in py_call_impl(callable, dots$args, dots$keywords) : 
  FileNotFoundError: [Errno 2] No such file or directory: '~/Split/Filtered/MergedOuts/JAK2_single/WT_kde_initial.pdf'

I've tried changing the format of the metadata_cells.csv just in case, with no luck.

Have you encountered a similar error? Any idea on how to solve this?
Any help would be very much appreciated!

> sessionInfo()
R version 4.1.3 (2022-03-10)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS/LAPACK: /fast/work/users/cofu10_c/miniconda/envs/gotcha/lib/libopenblasp-r0.3.21.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Gotcha_1.0.1        BiocManager_1.30.19 reticulate_1.26    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9                  stringdist_0.9.10          
 [3] here_1.0.1                  lattice_0.20-45            
 [5] deldir_1.0-6                png_0.1-7                  
 [7] Rsamtools_2.10.0            Biostrings_2.62.0          
 [9] rprojroot_2.0.3             assertthat_0.2.1           
[11] utf8_1.2.2                  rslurm_0.6.1               
[13] R6_2.5.1                    GenomeInfoDb_1.30.1        
[15] ShortRead_1.52.0            stats4_4.1.3               
[17] pillar_1.8.1                zlibbioc_1.40.0            
[19] rlang_1.0.6                 rstudioapi_0.14            
[21] minqa_1.2.5                 nloptr_2.0.3               
[23] S4Vectors_0.32.4            Matrix_1.5-3               
[25] splines_4.1.3               lme4_1.1-31                
[27] BiocParallel_1.28.3         readr_2.1.3                
[29] RCurl_1.98-1.9              DelayedArray_0.20.0        
[31] compiler_4.1.3              pkgconfig_2.0.3            
[33] BiocGenerics_0.40.0         tidyselect_1.2.0           
[35] SummarizedExperiment_1.24.0 tibble_3.1.8               
[37] GenomeInfoDbData_1.2.7      IRanges_2.28.0             
[39] matrixStats_0.63.0          fansi_1.0.3                
[41] crayon_1.5.2                dplyr_1.0.10               
[43] tzdb_0.3.0                  GenomicAlignments_1.30.0   
[45] MASS_7.3-58.1               bitops_1.0-7               
[47] grid_4.1.3                  nlme_3.1-160               
[49] jsonlite_1.8.3              lifecycle_1.0.3            
[51] DBI_1.1.3                   magrittr_2.0.3             
[53] cli_3.4.1                   XVector_0.34.0             
[55] hwriter_1.3.2.1             latticeExtra_0.6-30        
[57] ellipsis_0.3.2              generics_0.1.3             
[59] vctrs_0.5.1                 boot_1.3-28.1              
[61] RColorBrewer_1.1-3          tools_4.1.3                
[63] interp_1.1-3                Biobase_2.54.0             
[65] glue_1.6.2                  jpeg_0.1-9                 
[67] hms_1.1.2                   MatrixGenerics_1.6.0       
[69] parallel_4.1.3              GenomicRanges_1.46.1   

Error: "None of ['WhiteListMatch'] are in the columns"Traceback:

Hello the Gotcha authors,

I'm trying your genotyping algorithm and encounter the following error:

"None of ['WhiteListMatch'] are in the columns"

Traceback:

  1. GotchaLabeling(path = "/lab-share/Public/data/DNV/barcode3/",
    . infile = "Genemeta.csv", gene_id = "Gene", sample_id = "RM30")
  2. reticulate::import_from_path(module = "gotcha_labeling", path = find.package("Gotcha"))$GotchaLabeling(path = path,
    . infile = infile, gene_id = gene_id, sample_id = sample_id)
  3. py_call_impl(callable, call_args$unnamed, call_args$named)

My file has a format matching your demo:
image

Could you please tell me in which file should I 'WhiteListMatch'
Thank you,
Sunny

problem with barcode matching

I am having a problem with barcode matching when using the MutationCalling function:

MutationCalling(out = "/home/nathan.lawson-umw/scATACruns/2023_9_28_pilotATACgeno1/gotchaPilot/fastqOutSubset/Split/Filtered/aa/",

  •                  barcodes.file.path = "/home/nathan.lawson-umw/scATACruns/2023_9_28_pilotATACgeno1/scATACgeno1/outs/filtered_peak_bc_matrix/barcodes.tsv",
    
  •                  wt.max.mismatch = 0,
    
  •                  mut.max.mismatch = 0,
    
  •                  keep.raw.reads = F,
    
  •                  reverse.complement = T,
    
  •                  which.read = "R1",
    
  •                  testing = F,
    
  •                  primer.sequence = "CCTGCCTACGCAACAAGATGTGCTT",
    
  •                  primed.max.mismatch = 3,
    
  •                  wt.sequence = "CGA",
    
  •                  mut.sequence= "CAA",
    
  •                  mutation.start = 31,
    
  •                  mutation.end = 34
    
  • )
    ------- BEGIN MUTATION CALLING FUNCTION -------
    ------- FASTQ FILES LOADED -------
    ------- SEQUENCES OBTAINED FROM FASTQ FILES -------
    Data merged for each lane...
    ------- DATAFRAME FOR EACH SAMPLE GENERATED -------
    ------- PRIMED READS IDENTIFIED -------
    number of starting reads = 24177
    number of primed reads = 22574
    % of primed reads = 93.37
    ------- BARCODES IDENTIFIED -------
    ------- WARNING: Cell barcodes (CB) have been reverse complemented due to sequencer used -------
    ------- BARCODE MATCHING BEGINNING -------
    ------- BARCODE MATCHING DONE -------
    total number of primed barcodes = 22574
    total number of primed matched barcodes = 0
    % barcode matching = 0
    Error in $<-.data.frame(*tmp*, "ReadGenotype", value = NA) :
    replacement has 1 row, data has 0

This is using a subset of 25K reads out of the total to check if the package is working for me. Most of my reads have the primer and I can see both mutant and wild type sequences if I just browse through the fastq file. Also, if I pull out the barcodes from the R2 fastq.gz, reverse complement them and use the list as a text file to grep the barcodes.tsv file, it looks like about 260 are matching. So, there are definitely matching barcodes in my 25K subset. I tried using a reverse complemented R2 fastq.gz, as well as using the reverse complement for all reads as primary input (separate tries), but get the same error.

Session info:

sessionInfo()
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] Gotcha_1.0.1

loaded via a namespace (and not attached):
[1] Rcpp_1.0.11 stringdist_0.9.10 lattice_0.20-45 deldir_1.0-9
[5] png_0.1-8 Rsamtools_2.14.0 Biostrings_2.66.0 utf8_1.2.3
[9] rslurm_0.6.2 R6_2.5.1 GenomeInfoDb_1.34.9 ShortRead_1.56.1
[13] stats4_4.2.2 pillar_1.9.0 zlibbioc_1.44.0 rlang_1.1.1
[17] rstudioapi_0.14 minqa_1.2.6 nloptr_2.0.3 S4Vectors_0.36.2
[21] Matrix_1.6-1.1 reticulate_1.32.0 splines_4.2.2 lme4_1.1-34
[25] BiocParallel_1.32.6 readr_2.1.4 RCurl_1.98-1.12 DelayedArray_0.24.0
[29] compiler_4.2.2 rtracklayer_1.58.0 pkgconfig_2.0.3 BiocGenerics_0.44.0
[33] tidyselect_1.2.0 SummarizedExperiment_1.28.0 tibble_3.2.1 GenomeInfoDbData_1.2.9
[37] IRanges_2.32.0 codetools_0.2-19 matrixStats_1.0.0 XML_3.99-0.14
[41] fansi_1.0.4 crayon_1.5.2 dplyr_1.1.3 tzdb_0.4.0
[45] GenomicAlignments_1.34.1 MASS_7.3-58.2 bitops_1.0-7 grid_4.2.2
[49] nlme_3.1-162 jsonlite_1.8.7 lifecycle_1.0.3 magrittr_2.0.3
[53] cli_3.6.1 XVector_0.38.0 hwriter_1.3.2.1 latticeExtra_0.6-30
[57] generics_0.1.3 vctrs_0.6.3 boot_1.3-28 rjson_0.2.21
[61] restfulr_0.0.15 RColorBrewer_1.1-3 tools_4.2.2 interp_1.1-4
[65] Biobase_2.58.0 glue_1.6.2 hms_1.1.3 MatrixGenerics_1.10.0
[69] jpeg_0.1-10 yaml_2.3.7 parallel_4.2.2 GenomicRanges_1.50.2
[73] BiocIO_1.8.0

barcode matching very slow

Hello,

the barcode matching part in MutationCalling() from 737K-cratac-v1.txt or singlecell.csv is a real bottleneck - it takes hours and hours.

Have you thought about using just the filtered_peak_bc_matrix/barcodes.tsv as whitelist? This would narrow the search space considerably from 500k+ to 5-10k barcodes.

The other thought I have is whether it doesn't make sense to just create a dictionary of all possible mismatched cell barcodes plus their whitelisted counterparts and look up each entry from R2 in that dictionary.

For example if your whitelist cell barcode is AAA and you allow 1 mismatch, you would create the following dictionary:
AAA AAA
CAA AAA
ACA AAA
AAC AAA
TAA AAA
ATA AAA
AAT AAA
GAA AAA
AGA AAA
AAG AAA

Now you can just look up the whitelisted barcode from column 2 for each R2 read by using column 1 as index. If you don't find anything, you discard the read.

livius

lsf compatibility

Our cluster uses LSF so the BatchMutationCalling function will not work as in the provided example. Are there options for implementing parallelization using an LSF job manager?

Installation problem.

We are using module environment and python is installed in a shared file system (not the default location). at the moment of installation we get the following error:

  • installing source package ‘Gotcha’ ...
    ** using staged installation
    ** R
    ** byte-compile and prepare package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** testing if installed package can be loaded from temporary location
    Failed to import the site module
    Traceback (most recent call last):
    File "/gpfs/share/apps/python/cpu/3.6.5/lib/python3.6/site-packages/site.py", line 73, in
    __boot()
    File "/gpfs/share/apps/python/cpu/3.6.5/lib/python3.6/site-packages/site.py", line 39, in __boot
    raise ImportError("Couldn't find the real 'site' module")
    ImportError: Couldn't find the real 'site' module
    ERROR: loading failed
  • removing ‘/gpfs/share/apps/R/4.1.2/lib64/R/library/Gotcha’
    Warning message:
    In i.p(...) :
    installation of package ‘/tmp/RtmpGaJ7Z6/file1594c225541a9/Gotcha_1.0.1.tar.gz’ had non-zero exit status

any help is appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.