sandberg-lab / spreading-correction Goto Github PK

Supplementary information to "Computational correction of index switching in multiplexed sequencing libraries" (Larsson et. al 2018).

License: MIT License

Jupyter Notebook 97.52% Python 2.48%

single-cell-sequencing bioinformatics

spreading-correction's Issues

[--i5 STRING] [--i7 STRING]

Hi,

Would you please more explain how to define the [--i5 STRING] [--i7 STRING], I am using read counts from a 384 well plate with cell bar-code names as the column names.

Thanks in advance!

Tipps for applying "Spreading-Correction" on Single Cell Genome Assemblies?

Hello,
I want to try to apply this tool in order to rescue our bacterial single cell data that showed extremely high levels of cross-contamination when multiplexed on the HiSeq Xten. Since our librarys are based on Multiple Displacement Amplification (MDA) products, they have highly uneven coverage, similar to transcriptome data, and simple coverage cutoffs are not enough to identify cross-contaminants.
Also it is not enough to simply identify contigs occuring in multiple samples and just attributing them to the one library where they have the highest coverage, since I may have a few single cells originating from the same species. So your tool seems to be a blessing here.

My plan to apply your workflow for contamination here is as follows:
1.) assembled the data of each library seperately
2.) cluster the contigs of all assemblies using strict identity cutoffs, to obtain representative, mappable consensus-contigs for each (potential) contaminant
3.) Map reads onto the clustered assemblies in order to obtain coverage values for each contig-cluster in each library and create input files similar to your transcriptome input.
4.)use your script to correct the coverage data with respect to the cross-contamination and then filter contaminating contigs from my datasets based on coverage cutoffs.

However, since the contigs are differently sized, (and moreover are unlikely to be fully complete in all cross-contaminated samples), I do not want to use simple read counts per contig, but rather average coverage values (e.g. mean read coverage per base position). This would mostly result in decimal values.

Since your example input seems to consist exclusively of integer values: Does your script also accept decimal/float coverage values in the input table?Or do I need to round such data? Or would you recommend a completely different approach here?

Incomplete sequencing plate / incomplete index values

Hello!
We would like to use your tool to check some of our data as we (probably) had some cross-contamination. Our data is from 16S amplicon analysis to figure out the microbial community. Within the data, some samples have a very high count of certain ASV, some have very few and some (should) have none. We believe that the unspread.py script might help us solve this issue.

As far as we understand we can specify the number of samples by giving the number of rows and columns. However, when the samples were sequenced, not the whole sequencing plate was used. Thus, some combinations of the two indices are empty and we get the error message: "number of cells in count file not same as specified". Now we don’t know how to fix that, as filling up with 0’s could mess up the regression. Do you have any suggestion how to handle this situation? That would be very appreciated.
Thank you very much in advance!

Data availability

Hi, where can I find the data used for the correction, mHSC_plate1HiSeq_counts_IndexInfo.csv and mHSC_plate1NextSeq_counts_IndexInfo.csv ?
Many thanks.

Suggestion for using Spreading-Correction on whole genome resequencing?

Hello, Anton.
We would like to apply this tool on our genome resequencing data obtained by Illumina novaseq 6000. Do you have any suggestion on how to calculate the read counts for each cell in WGS? How about read depth on certain position over the chromosomes?

IndexInfo file

Dear all,

During the process of a typical Illumina sequencing run, where do you retrieve the equivalent of the "mHSC_plate1HiSeq_counts_IndexInfo_anon.csv" you are using as input?

Is it a raw file you retrieve at a given step?
Is the file created from the concatenation / parsing of many files?

I did read the Nature Methods article and the Git repo but I am still unsure about this.
My wet lab team is asking for more details to retrieve such a file, thus the questions above.

Best regards

sandberg-lab / spreading-correction Goto Github PK

spreading-correction's Issues

[--i5 STRING] [--i7 STRING]

Tipps for applying "Spreading-Correction" on Single Cell Genome Assemblies?

Incomplete sequencing plate / incomplete index values

Data availability

Suggestion for using Spreading-Correction on whole genome resequencing?

IndexInfo file

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent