Giter Site home page Giter Site logo

scrna-seq_contamination's Introduction

The effect of background noise and its removal on the analysis of single-cell expression data

This repository contains the necessary code to reproduce the analysis for the manuscript "The effect of background noise and its removal on the analysis of single-cell expression data".

Snakemake

We process multiple single-cell and single-nucleus RNA-seq datasets of mouse kidney with a common pipeline to estimate levels of background RNA contamination and compare different methods for correction of background RNA. This includes:

  • cell calling
  • cell type assignment: Reference based classification of single-cells using a publicly available annotated dataset.
  • genotyping and strain assignment: Using a list of genetic variants between mouse strains downloaded from the Mouse Genomes Project we determine the strain identity of each cell by genotyping single cells with cellsnp-lite and demultiplexing with vireo. Importantly, this also provides matrices of allele counts per cell barcode.
  • genotype estimation: We identify cross-strain contamination based on genetic variants and estimate background RNA levels per single cell from this.
  • filtering and preprocessing: Basic filtering, processing and clustering steps to prepare the count matrix for further analysis.
  • applying different correction methods: We compare three methods that are designed to remove noise originating from ambient/ background RNA: CellBender, SoupX and DecontX in a range of different parameter settings.
  • evaluation of correction performance: We evaluated the output of each method by comparing to our genotype based estimations and calculating metrics to assess denoising performance.

Snakemake_benchmark

Since you might not be interested in running the whole pipeline from start to finish, we provide a reduced version of the workflow here that only covers benchmarking: application of different methods & performance evaluation.
To complete the input for this pipeline, some bigger files have to downloaded from zenodo (see next chapter): Please copy for each dataset the files seurat.RDS and seurat_CAST.RDS into the folder input/{dataset} and the files filtered_feature_bc_matrix.h5 and raw_feature_bc_matrix.h5 into a subfolder input/{dataset}/cellranger.
The config.yml file can be modified to select benchmarking datasets, methods and parameter settings. Each method is applied to the selected benchmark datasets for background noise corrections and the outputs are evaluated with several evaluation metrics described in the manuscript:

workflow

If you want to add and evaluate a new method, this can be achieved by adding a new script and rule to the Snakefile that produces as output a denoised count matrix (benchmark/corrected/{method}/{dataset}/{parameter_setting}_cormat.RDS) and a table with estimated background noise levels per cell (benchmark/corrected/{method}/{dataset}/{parameter_setting}_contPerCell.RDS), which are required for all evaluation steps.

Benchmark Data availability

We analysed 5 mouse 10X experiments. Each is a mix of kidney cells from 3 mouse strains (BL6, SVLMJ, CAST). The data can be downloaded at zenodo.

We provide files with cell type, strain and contamination information for each replicate in a zip-folder, where each contains 5 files:

  • filtered_feature_bc_matrix.h5 - CellRanger output, filtered count matrix
  • raw_feature_bc_matrix.h5 - CellRanger output, raw count matrix
  • seurat_CAST.RDS - Processed Seurat object with cell type annotations, M.m. castaneus cells only
  • seurat.RDS - Processed Seurat object with cell type annotations, all cells
  • perCell_noMito_CAST_binom.RDS - Estimated background noise levels per cell in M.m. castaneus cells

Analysis

Beyond the standardized pipeline, we perform further analysis to compare empty droplet, contamination and endogenous profiles (Deconvolution) and summarize evaluation metrics of the method benchmark (Benchmark).
This folder also contains some files that are necessary to reproduce the analysis and figures:

  • cell_metadata.RDS: cell-wise metadata information including replicate, celltype, Strain, background noise level (contPerCell_binom) and some cell QC metrics.
  • benchmark_metrics.RDS: For each combination of method (CellBender, SoupX, DecontX, raw), parameter setting and replicate this table contains a collection of metrics to evaluate the performance in estimating background noise levels and improving downstream analysis after correction.
  • Proximal tubule cells markers: 1) Downloaded from PanglaoDB that were detected (panglao_markers_Mm.RDS) and the top10 markers with the highest average expression in PT cells (top10_PT_markers.RDS). 2) Genes that were detected as DE between PT and other cells after correction with CellBender/SoupX/DecontX (DE_seurat_sigUP.RDS)
  • Stats related to informative variants and their coverage per cell (per_cell_stats_CAST_variants.RDS, position_stats_summary.RDS)

Figure scripts

  • 01_dataset_description: Strain and cell type composition of input datasets, as well as some additional statistics about the genotype estimation strategy (related to Figure 1, Suppl. Figure S3).
  • 02_backgroundRNA_estimates: Visualization of background noise fractions per cell (related to Figure 2, Suppl. Figures S1,S2,S4).
  • 03_origin_of_background RNA: Comparison of endogenous expression, background noise contamination and empty droplet profiles (related to Figure 3, Suppl. Figures S5,S6).
  • 03_02_barcode_swapping: Identification and quantification of barcode swapping events originating from PCR chimera (related to Suppl. Figure S7)
  • 04_effect_on_downstream_analysis: Impact of background noise on specificity and detectability of marker genes (related to Figure 4, Suppl. Figures S8,S9).
  • 05_benchmark_estimation: Comparison of background noise estimation accuracy of different computional methods (related to Figure 5, Suppl. Figure S10)
  • 06_benchmark_downstream_analysis: Summarizing the method comparison results on downstream analysis effect based on the Snakemake pipeline across datasets and parameter settings (related to Figure 6, Suppl. Figures S11-15).

scrna-seq_contamination's People

Contributors

phjanssen avatar ineshellmann avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.