paulranum11 / split-seq_demultiplexing Goto Github PK
View Code? Open in Web Editor NEWAn unofficial demultiplexing strategy for SPLiT-seq RNA-Seq data
License: MIT License
An unofficial demultiplexing strategy for SPLiT-seq RNA-Seq data
License: MIT License
hello,
when I ran splitseqdemultiplex, I find that the step 4 takes a long time.
it seems that the step4 has been running 1 day:
Beginning STEP4: Extracting UMIs. Current time : 2021-05-03 06:45:57.
Do you think it is fine?
Hi, Paul
Thank you very much for developing this useful tool! I am new to the SPLiT-seq data analysis, just want to make sure how could I prepare the roundXbarcode files from my primer list. Below are my 1st round primers that I used in experiments.
AGGTCAGAGCATTGAAACATCGTTTTTTTTTTTTTTTVN
AGGTCAGAGCATTGATGCCTAATTTTTTTTTTTTTTTVN
AGGTCAGAGCATTGAGTGGTCATTTTTTTTTTTTTTTVN
AGGTCAGAGCATTGATCATTCCNNNNNN
AGGTCAGAGCATTGATTGGCTCNNNNNN
AGGTCAGAGCATTGCAAGGAGCNNNNNN
So the corresponding file should be:
AGGTCAGAGCATTGAAACATCG
AGGTCAGAGCATTGATGCCTAA
AGGTCAGAGCATTGAGTGGTCA
AGGTCAGAGCATTGATCATTCC
AGGTCAGAGCATTGATTGGCTC
AGGTCAGAGCATTGCAAGGAGC
It looks like your example file barcode is much shorter, could you please explain a little bit more about the file preparation?
Thanks for your help!
Best,
Monica
Hi there,
I noticed that we are having an issue where barcode 1 seems to be missing for alot of the reads, which reduces the cell counts and reads per cell. I'm wondering if this may be due to us using v3 of split-seq and whether this pipeline reflects that version? They changed the positioning of barcode 1 for v3.
If the UMI is bases 1-10 of the barcode read, then according to this schematic, the barcode corresponding to oligo-dT/random hexamers should be the round1 barcode, but should be the third barcode sequenced, since sequencing proceeds outside-in (ie UMI-BC3-BC2-BC1). If I'm understanding the collapse script (both versions) correctly, it looks like we're collapsing based on the first barcode rather than the third.
Maybe we're misunderstanding something about the amplification or direction of sequencing, or perhaps the demultiplexing python script is (correctly) reading the barcodes from right to left...? Anyway can you please confirm/clarify this?
Thanks!
Hi-
We finally generated some full-scale SPLiT-seq data, and the runtime for this tool is quite large. We have ~850M reads; the initial demultiplexing step as well as the collapsing ODT/random hexamers each take days to complete.
Are there any performance improvements you might be able to make to get this tool to scale better with data input size? zUMIs by comparison can do most of this in hours or less...
Thanks!
It would be helpful to describe what steps are recommended once the FASTQs are separated by cell. The SPLiT-seq paper utilizes one of the Drop-seq tools (TagReadWithGeneExon), followed by Starcode to collapse UMIs of aligned reads that were within 1 nt mismatch of another UMI. They then don't describe how they generated their final cell x gene expression matrix, but I assume it's the DigitalExpression tool from the Drop-seq toolkit as well.
I'm trying to figure out how best to connect the dots from the output of your method to those steps, or if some other approach is better.
Any advice would be appreciated!
Hi,
Thank you for this good work.
I am testing your last version of SPLiT-Seq_demultiplexing0.1.1.
As, i understood from the SplitSeq Protocol, the Read1 and Read2 of fastq files are described as :
Read 1 (66 nt) = transcript
Read 2 (94 nt) = UMI + BC3 + spacer + BC2 +spacer + BC1
where the UMI starts at - 1nts, BC3- Starts at 11 nts, BC2- Starts at 48 nts, BC1- Starts at 86 nts
So, when i look at the results of SPLiT-Seq_demultiplexing0.1.1 ( using small test fastq files downloaded along with this logiciel ) of merged fastq files, the UMI is always the first 10 nts of READ1 not first 10nts of READ2 !!!
For example ( for this read 1 and read 2 ):
Read1:
@SRR6750041.1 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
Read2:
@SRR6750041.1 1/2
NNTACTAAAGGCTAACGAGTGGCCGCTGTTTCGCATCGGCGTACGACTATTGAGGAATCCACGTGCTTGAGAGGCCAGAGCATTCGAACGCTTA
+
##AAAEEEEEEEEEEEEEE/6E/EE6/EAE<66666AEAAAE66<<</<<EEEEEEEEAAAAE666666AAE<<EEEEEEEE<AEAEEEEEEE/
SPLiT-Seq_demultiplexing0.1.1 produces the output as :
@SRR6750041.1_AGCATTCGAACGCTTAATTGAGGAATCCAGCTAACGAGTGGCC_CTGGANAAGT 1/1
CTGGANAAGTGAAATAATATAAATTTTTCCACTATTGAATAAAAGCAACTTAAATTTTCTAAGTCG
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEA<AAEEEEE<6
Where in the header last characters _CTGGANAAGT is the UMI, which is the first 10nts of READ1 not READ2 !!!!
So, Can you please tell me whether this is correct, or it is me who is wrong.
I hope i could explain you clearly.
Thank you in advance for your reply,
with best wishes,
Duma
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.