Comments (21)
This is now published in simpleaf
0.4.0. which was just merged in bioconda.
from scrnaseq.
Hi @fmalmeida,
Thanks for pointing this out. I updated the docs when we released 0.4.0, but forgot to bump the version info in the docs. It should be consistent with 0.4.0, but I’ll update the RTD later. It would be nice to figure out how to have our release action bump this too!
from scrnaseq.
Hi @fmalmeida,
For 1.
I will not require "transcripts_fasta" anymore, right? Actually simpleaf will only use genome fasta and transcript gtf. Right?
For standard single-cell and single-nucleus analyses, absolutely. The only reasons the user may want/need to pass in the reference "transcript" set directly are (a) they are working in a non-model organism without a good genome assembly and instead have a de novo transcriptome assembly or (b) they are indexing e.g. feature barcode sequences in the process of doing e.g. a CITE-seq analysis (which may not be supported by the pipeline yet).
For the --rlen
parameter, we studied the effect of this parameter a bit in the alevin-fry
paper. The effect is, in general, very small. So if the user cannot provide it explicitly, I think we can just go with a common 10x biological sequence read length (e.g. 91). However, I do think we should add an extra parameter to allow the user to specify it if they want.
from scrnaseq.
https://alevin-fry.readthedocs.io/en/latest/getting_started.html
from scrnaseq.
I was planning to start the implementations for this issue. But I have some doubts:
- Where to start? First add / update modules in nf-core/modules, right?
- What is actually to add? I was checking, and it seems that the first steps is pretty much what we already have with salmon in the current workflow, right? With
salmon alevin
?- Then, actually, the required would be to just go on adding the subsequent modules?
alevin-fry generate-permit-list
alevin-fry collate
alevin-fry quant
- Then, actually, the required would be to just go on adding the subsequent modules?
Is that what is expected?
from scrnaseq.
Hi @fmalmeida,
The first step is to run alevin
, but the parameters change somewhat. Specifically, in the alevin-fry
pipeline, we run alevin
only to obtain the mappings that are used for quantification. So, many of the different parameters for e.g. "whitelist" filtering and whatnot, no longer need to be passed to alevin
. Further, alevin
doesn't need a transcript-to-gene map, but the quant
step of alevin-fry
does.
When running alevin
you need to either pass the --sketch
flag (what I would use by default unless the user requests selective alignment) or the --rad
flag.
The other big difference is that I would recommend preparing a splici reference sequence in the case the user provides a genome + annotation. This can be done easily using our pyroe
python package. You can take a look at our quantaf
workflow to see how we're currently handling this in nextflow.
@DongzeHE and I are happy to answer any questions you might have!
from scrnaseq.
Hi @rob-p,
Thank you for this first explanation. I will have some reading on this later this week to organize my thoughts and the steps required and, when doing it, I will keep making contact to ensure it is following the standards 👍🏼 😄
from scrnaseq.
Did not start yet, but have a question.
You guys think is better to put on all the processes or to use https://simpleaf.readthedocs.io/en/latest/ ?
from scrnaseq.
That's a good question @fmalmeida. I certainly think that simpleaf
will make getting things up and running ... simpler ;P. Right now it's pretty flexible, and it's also easy for us to modify and add functionality if it's deemed useful. I would say that right now, the primary flexibility present using the raw commands that's missing from simpleaf
is:
-
It is possible but awkward to construct a non-splici reference index — i.e. if one is quantifying in a de novo reference without an annotation of spliced transcripts and introns. This can be accomplished by just passing the spliced transcriptome file as the extra spliced parameter to the index constructor, but it is less straightforward than just not making a non-splici reference in the first place. Of course, this is a rather niche issue because neither CellRanger nor StarSolo afaik can even do this at all.
-
The ability to control the
k
parameter used for index creation. This one is really easy to add support for. This will be important when folks want to do e.g. feature barcode processing, where an index should be constructed over the set of tags rather than the reference genome, typically with a much smallerk
. -
The ability to provide an unfiltered permit-list for technologies other than
10xv2
and10xv3
. Right now, the only technologies for which a known list of possible barcodes is readily available is chromium v2 and v3. In fact, if you don't have those "whitelists" in yourALEVIN_FRY_HOME
directory,simpleaf
will even go and fetch them for you automatically. However, for other chemistries, one has to use a different filtering approach since the "whitelists" are not common knowledge. It may be useful to provide the user the ability to provide their own "whitelist" for non10xv2
and10xv3
chemistries. -
The ability to select the non-default "selective alignment" option when mapping the reads during the quantification phase. Right now, the mapping through
simpleaf
is always done using--sketch
mode, which uses pseudoalignment with structural constraints. We should probably add an optional flag to allow the user to select the other alignment mode if they want. Again, this is easy to do and not a big deal because standard recommendation would be to default to--sketch
mode anyway.
So, right now, the above are the only major differences in flexibility I can think of between using simpleaf
versus using all of the raw commands directly. As I mentioned, none of them are fundamental, and those capabilities can all be exposed in simpleaf
if we deem it useful. The benefit of simpleaf
over the raw commands, of course, is that setting up the initial pipeline should be much easier, and future releases will benefit from capabilities added to simpleaf
without having to re-express/re-implement those capabilities in terms of the raw commands.
I'm happy to have a deeper conversation about the potential pros and cons and what you think makes the most sense in the context of nfcore-scrnaseq.
from scrnaseq.
Thanks for the extensive response! Most of it does indeed sound less important for now and straightforward to expose through simpleaf
in the future.
The most critical point for me would be (3), because I think at least at some point in the future scrnaseq
should be able to support non 10x chemistries.
Would there be a performance advantage of having different alevin-fry stages run in different processes? E.g. allocating more cpu/memory to the quant
stage than for collate
/generate-permit-list
?
This is currently a limitation of running cellranger through nextflow: it reserves a lot of cpus, but only uses them in one stage.
from scrnaseq.
Hi @grst,
So to be clear, it's already possible to run other chemistries through simpleaf
. The set of already-know chemistries is here and the user can also provide a "custom" geometry to deal with barcode, umi, read layouts that don't match any of the pre-specified chemistries (see here).
The only limitation is on how cell filtering is done in 10x versus other chemistries. Specifically, for 10x chemistries, a "whitelist" of possible/expected cell barcodes is known (e.g. for 10xv3 ~6M out of the 4.2B possible length 16 barcodes). This enables alevin-fry to do unfiltered
permit-list generation, where each barcode is checked against the known list of possible barcodes. When such an external list is not available, then another filtering strategy (e.g. filtering based on the knee method) must be used. My comment was intended to point out that there may be other chemistries in the future where a "whitelist" is available, but right now a "whitelist" can only be used with the chromium chemistries when run through simpleaf.
Would there be a performance advantage of having different alevin-fry stages run in different processes? E.g. allocating more cpu/memory to the quant stage than for collate/generate-permit-list?
This is a great point! So the biggest memory distinction is already exposed in simpleaf
, which is that building the index requires more memory than every other step. In the quant
phase, it is true that the mapping step probably requires the most memory. However, a primary design consideration of alevin-fry anyway is that memory usage is quite low. For example, if you use the --sparse
index, all mapping and quantification should be possible in <8G of RAM, and even with the dense index it's not too much more (e.g. for mouse/human sized organisms). That being said, I think in terms of memory usage one generally has the order of index construction > mapping > quantification, while simpleaf
groups mapping and quantification together. In terms of thread usage, again, the mapping step is the one that can effectively make the most use of many threads. However, in quantification, every step apart from permit-list generation is highly multithreaded (and quite fast in absolute terms). So, the only penalty thread-wise is probably that some potential thread resources would be wasted during the permit-list generation step if simpleaf
is used and the quant
phase is executed with many threads. Whether or not the difference / requirements are big enough to warrant splitting the steps up is really a judgement call.
from scrnaseq.
My comment was intended to point out that there may be other chemistries in the future where a "whitelist" is available, but right now a "whitelist" can only be used with the chromium chemistries when run through simpleaf.
I see! That sounds indeed like a very niche use-case.
Whether or not the difference / requirements are big enough to warrant splitting the steps up is really a judgement call.
To me, this sounds like it doesn't. After all, there's also some overhead for creating a nextflow process, especially if data needs to be transferred between cloud instances.
Happy to hear other opinions, but overall it seems simpleaf
makes things easier for us without a lot of disadvantages.
from scrnaseq.
Since I claimed that the things I said here were easy to do, I wanted to back up the claim! In my latest round of pushes to the dev
branch of simpleaf
, I have not addressed all of these limitations. That is, all of these things are not possible to express in simpleaf
(I have not pushed these to main
and cut a release yet though). I think this then removes any of these points as potential objections to using simpleaf
.
from scrnaseq.
Docs https://simpleaf.readthedocs.io/en/latest/ still point to 0.3.0, where can I find the one for 0.4.0 so I can check during implementation?
from scrnaseq.
Hi @rob-p,
I am trying to get access to the docker and singularity images of simpleaf on biocontainers, since it is in bioconda, but I am not being able to find them.
https://bioconda.github.io/recipes/simpleaf/README.html
Could you take a look just to see if everything is allright?
😄
from scrnaseq.
Hi @fmalmeida,
So I found this, but when I try to actually pull it down from docker I get :
❯ docker pull ghcr.io/channel-mirrors/bioconda/linux-64/simpleaf:0.4.0-h9f5acd7_0
0.4.0-h9f5acd7_0: Pulling from channel-mirrors/bioconda/linux-64/simpleaf
533580ef6314: Pulling fs layer
4083c173038d: Pulling fs layer
30fb06b2b06a: Pulling fs layer
invalid rootfs in image configuration
Any idea what this means? This is all automatically done by conda, but I guess we could push our own docker image if the conda one is problematic.
from scrnaseq.
tbh, I never came across the ghcr.io
registry before in the context of biocontainers. I thought the official registry is quay.io
, but for some reason simpleaf never ended up there.
I asked on the bioconda gitter if someone has an idea what's going on.
from scrnaseq.
Thanks to @grst's inquiry, the brilliant bioconda folks have "unhidden" these images on quay.io
(e.g. you should be able to do docker pull quay.io/biocontainers/simpleaf:0.4.0--h9f5acd7_0
).
from scrnaseq.
Awesome! I'll take back to this ! Thanks.
from scrnaseq.
Hi @rob-p,
I am now trying to change the INDEX module. And I have a couple of questions:
- I will not require "transcripts_fasta" anymore, right? Actually simpleaf will only use genome fasta and transcript gtf. Right?
Open to see example!
process SIMPLEAF_INDEX {
tag "$transcript_gtf"
label "process_medium"
conda (params.enable_conda ? 'bioconda::simpleaf=0.4.0' : null)
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/simpleaf:0.4.0--h9f5acd7_0' :
'quay.io/biocontainers/simpleaf:0.4.0--h9f5acd7_0' }"
input:
path genome_fasta
path transcript_gtf
output:
path "salmon" , emit: index
path "versions.yml" , emit: versions
when:
task.ext.when == null || task.ext.when
script:
def args = task.ext.args ?: ''
"""
# export required var
export ALEVIN_FRY_HOME=.
# prep simpleaf
simpleaf set-paths
# run simpleaf index
simpleaf \\
index \\
--threads $task.cpus \\
--fasta $genome_fasta \\
--gtf $transcript_gtf \\
$args \\
-o salmon
cat <<-END_VERSIONS > versions.yml
"${task.process}":
simpleaf: 0.4.0
END_VERSIONS
"""
}
- When running, I have this :
error: The following required arguments were not provided: --rlen <RLEN>
Is there a way to automatically set this, or let it calculate? If not, users will have to set it right ... Maybe on a new parameter? But my question is, how to determine a good length for the param?
😄
from scrnaseq.
Alright, we just merged #139 so this should be fine now. Remaining points need to be checked in separate issues so thanks everyone for the discussions :-)
from scrnaseq.
Related Issues (20)
- Samplesheet error in v2.2.0 when more than 3 fields are present HOT 4
- Support additional sequencing platforms through seqspec
- Auto rename fastq files for cellranger input HOT 2
- accept custom geometry strings when `alevin` is used as aligner HOT 17
- Cellranger fails to write _jobinfo.tmp HOT 3
- Universc "essential container in task exited" HOT 2
- Null object when running MultiQC HOT 1
- Option to skip postprocessing HOT 2
- Extend Pipeline to analyze CAR-T-cell datasets HOT 1
- scRNA-seq with Cellranger fails at Seurat conversion stage HOT 2
- Cellranger / spaceranger input file handling HOT 2
- ERROR ~ `NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT` input file name collision HOT 4
- Support for 10x FFPE scRNA HOT 3
- Failed to pull singularity image HOT 6
- Launch tool not working for various pipelines, including scRNAseq and RNAseq. HOT 2
- samplesheet with 3 columns HOT 1
- Error at SIMPLEAF_INDEX for user supplied genome/annotation
- Error converting mtx to seurat when using singularity HOT 2
- smartseq 1/2 support HOT 1
- Universc options for BD data HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from scrnaseq.