sandberg-lab / spreading-correction Goto Github PK

Supplementary information to "Computational correction of index switching in multiplexed sequencing libraries" (Larsson et. al 2018).

License: MIT License

Jupyter Notebook 97.52% Python 2.48%

single-cell-sequencing bioinformatics

spreading-correction's Introduction

Computational correction of index switching in multiplexed sequencing libraries

Supplementary information to the article "Computational correction of index switching in multiplexed sequencing libraries" by Anton JM Larsson, Geoff Stanley, Rahul Sinha, Irving L. Weissman, Rickard Sandberg available now in Nature Methods (http://dx.doi.org/10.1038/nmeth.4666)

The Jupyter notebook (Correcting_Spreading_of_Signal_Notebook.ipynb) contains Python code for the analysis and correction of index-swapping, including the generation of Figures 1, 2A-C, S1 and S2 in the manuscript (written by Anton JM Larsson). The notebook (sandbergCorrection_analyzeClustering.ipynb) contains the R code to reproduce Figures 2D-E and S3 of the manuscript (written by Geoff Stanley).

unspread.py

unspread.py estimates the percentage of contaminating reads in the experiment, estimates the 'rate of spreading', and corrects the read counts if the experiment is affected to a sufficient degree. The unspread.py script requires a table of read counts supplied as a .csv file with added information regarding each cell's index barcodes.

System Requirements

unspready.py is a python3 script with dependencies:

pandas: 0.19.2
numpy: 1.9.0
matplotlib: 2.0
statsmodels: 0.6.1
scipy: 1.0.0
patsy: 0.4.1

No further installation is needed.

Usage

usage: unspread.py [-h] [--i5 STRING] [--i7 STRING] [--rows INTEGER] [--cols INTEGER] [--idx_col INTEGER] [--sep CHAR] [--h INTEGER] [--c INTEGER] [--t FLOAT] [--idx_in_id BOOLEAN] [--delim_idx CHAR] [--column BOOLEAN] filename

Unspread: Computational correction of barcode index spreading

positional arguments:

filename .csv file with counts

optional arguments:

-h, --help show this help message and exit

--i5 STRING Index name of i5 barcodes (default: 'i5.index.name')

--i7 STRING Index name of i7 barcodes (default: 'i7.index.name')

--rows INTEGER Number of rows in plate (default: 16)

--cols INTEGER Number of columns in plate (default: 24)

--idx_col INTEGER Which column serves as the index (default: 0)

--sep CHAR The separator in the .csv file (default ',')

--h INTEGER The number of reads to use to be considered highly expressed in only one cell (default: 30)

--c INTEGER Cutoff to remove addition false positives (default: 5)

--t FLOAT Threshold for acceptable fraction of spread counts (default: 0.05)

--idx_in_id BOOLEAN If the index is in the cell id (i.e. cellid_i5_i7) (Default: 0 (False), set to 1 otherwise (True))

--delim_idx CHAR If the index is in the cell id, the delimiting character (Default: '_')

--column BOOLEAN If each column is represents a cell, otherwise each row. (default: 1 (True), set to 0 otherwise (False))

Output

unspread.py outputs a set of figures with diagnostic information comparable to the figures in the article. A log file is also saved. If the plate is affected a corrected .csv file will also be made.

Example

An example from the first plate in the manuscript:

cell.name	N.index.name	S.index.name	0610005C13Rik	0610007C21Rik	...
HSC02_a_p1c7r2_P01	N701	S522	0	117	...
HSC02_a_p1c5r5_P03	N702	S522	0	5	...

In this particular example, genes are structured by column and cells by rows but the converse is also supported.

To run the correction of the first plate in the manuscript:

./unspread.py mHSC_plate1HiSeq_counts_IndexInfo_anon.csv --i5 'S.index.name' --i7 'N.index.name' --column 0 --sep ' '

This command should not take longer than a minute.

The expected command line output is:

Reading file: mHSC_plate1HiSeq_counts_IndexInfo_anon.csv

Estimating spreading from mHSC_plate1HiSeq_counts_IndexInfo_anon.csv

Found expression to be biased along a certain column and row combination 753 times out of 899

Estimated the median rate of spreading to be 0.0098

Estimated fraction of spread reads to be 0.14827 and variance explained R-squared = 0.8996

Saving figure from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_figures.pdf

Saving log file from analysis to mHSC_plate1HiSeq_counts_IndexInfo_anon_unspread.log

Correcting spreading for each gene

Saving correction to mHSC_plate1HiSeq_counts_IndexInfo_anon_corrected.csv

The genes in the manuscript, Mki67 and Tacr, have ID 7963 and 12319 respectively.

spreading-correction's People

Contributors

Stargazers

Watchers

Forkers

bio-la healthvivo ssyang145 gerverska

spreading-correction's Issues

[--i5 STRING] [--i7 STRING]

Hi,

Would you please more explain how to define the [--i5 STRING] [--i7 STRING], I am using read counts from a 384 well plate with cell bar-code names as the column names.

Thanks in advance!

Tipps for applying "Spreading-Correction" on Single Cell Genome Assemblies?

Hello,
I want to try to apply this tool in order to rescue our bacterial single cell data that showed extremely high levels of cross-contamination when multiplexed on the HiSeq Xten. Since our librarys are based on Multiple Displacement Amplification (MDA) products, they have highly uneven coverage, similar to transcriptome data, and simple coverage cutoffs are not enough to identify cross-contaminants.
Also it is not enough to simply identify contigs occuring in multiple samples and just attributing them to the one library where they have the highest coverage, since I may have a few single cells originating from the same species. So your tool seems to be a blessing here.

My plan to apply your workflow for contamination here is as follows:
1.) assembled the data of each library seperately
2.) cluster the contigs of all assemblies using strict identity cutoffs, to obtain representative, mappable consensus-contigs for each (potential) contaminant
3.) Map reads onto the clustered assemblies in order to obtain coverage values for each contig-cluster in each library and create input files similar to your transcriptome input.
4.)use your script to correct the coverage data with respect to the cross-contamination and then filter contaminating contigs from my datasets based on coverage cutoffs.

However, since the contigs are differently sized, (and moreover are unlikely to be fully complete in all cross-contaminated samples), I do not want to use simple read counts per contig, but rather average coverage values (e.g. mean read coverage per base position). This would mostly result in decimal values.

Since your example input seems to consist exclusively of integer values: Does your script also accept decimal/float coverage values in the input table?Or do I need to round such data? Or would you recommend a completely different approach here?

Incomplete sequencing plate / incomplete index values

Hello!
We would like to use your tool to check some of our data as we (probably) had some cross-contamination. Our data is from 16S amplicon analysis to figure out the microbial community. Within the data, some samples have a very high count of certain ASV, some have very few and some (should) have none. We believe that the unspread.py script might help us solve this issue.

As far as we understand we can specify the number of samples by giving the number of rows and columns. However, when the samples were sequenced, not the whole sequencing plate was used. Thus, some combinations of the two indices are empty and we get the error message: "number of cells in count file not same as specified". Now we don’t know how to fix that, as filling up with 0’s could mess up the regression. Do you have any suggestion how to handle this situation? That would be very appreciated.
Thank you very much in advance!

Data availability

Hi, where can I find the data used for the correction, mHSC_plate1HiSeq_counts_IndexInfo.csv and mHSC_plate1NextSeq_counts_IndexInfo.csv ?
Many thanks.

IndexInfo file

Dear all,

During the process of a typical Illumina sequencing run, where do you retrieve the equivalent of the "mHSC_plate1HiSeq_counts_IndexInfo_anon.csv" you are using as input?

Is it a raw file you retrieve at a given step?
Is the file created from the concatenation / parsing of many files?

I did read the Nature Methods article and the Git repo but I am still unsure about this.
My wet lab team is asking for more details to retrieve such a file, thus the questions above.

Best regards

Suggestion for using Spreading-Correction on whole genome resequencing?

Hello, Anton.
We would like to apply this tool on our genome resequencing data obtained by Illumina novaseq 6000. Do you have any suggestion on how to calculate the read counts for each cell in WGS? How about read depth on certain position over the chromosomes?