Is your feature request related to a problem? Please describe <cod

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

That's a good question <a class="user-mention notranslate" data-hovercard-type="user"

Switch from alevin to alevin-fry,about nf-core/scrnaseq

Comments (21)

rob-p commented on May 23, 2024 3

This is now published in simpleaf 0.4.0. which was just merged in bioconda.

from scrnaseq.

rob-p commented on May 23, 2024 1

Hi @fmalmeida,

Thanks for pointing this out. I updated the docs when we released 0.4.0, but forgot to bump the version info in the docs. It should be consistent with 0.4.0, but I’ll update the RTD later. It would be nice to figure out how to have our release action bump this too!

from scrnaseq.

rob-p commented on May 23, 2024 1

Hi @fmalmeida,

For 1.

I will not require "transcripts_fasta" anymore, right? Actually simpleaf will only use genome fasta and transcript gtf. Right?

For standard single-cell and single-nucleus analyses, absolutely. The only reasons the user may want/need to pass in the reference "transcript" set directly are (a) they are working in a non-model organism without a good genome assembly and instead have a de novo transcriptome assembly or (b) they are indexing e.g. feature barcode sequences in the process of doing e.g. a CITE-seq analysis (which may not be supported by the pipeline yet).

For the --rlen parameter, we studied the effect of this parameter a bit in the alevin-fry paper. The effect is, in general, very small. So if the user cannot provide it explicitly, I think we can just go with a common 10x biological sequence read length (e.g. 91). However, I do think we should add an extra parameter to allow the user to specify it if they want.

from scrnaseq.

apeltzer commented on May 23, 2024

https://alevin-fry.readthedocs.io/en/latest/getting_started.html

from scrnaseq.

fmalmeida commented on May 23, 2024

I was planning to start the implementations for this issue. But I have some doubts:

Where to start? First add / update modules in nf-core/modules, right?
What is actually to add? I was checking, and it seems that the first steps is pretty much what we already have with salmon in the current workflow, right? With salmon alevin?
- Then, actually, the required would be to just go on adding the subsequent modules?
  - alevin-fry generate-permit-list
  - alevin-fry collate
  - alevin-fry quant

Is that what is expected?

from scrnaseq.

rob-p commented on May 23, 2024

Hi @fmalmeida,

The first step is to run alevin, but the parameters change somewhat. Specifically, in the alevin-fry pipeline, we run alevin only to obtain the mappings that are used for quantification. So, many of the different parameters for e.g. "whitelist" filtering and whatnot, no longer need to be passed to alevin. Further, alevin doesn't need a transcript-to-gene map, but the quant step of alevin-fry does.

When running alevin you need to either pass the --sketch flag (what I would use by default unless the user requests selective alignment) or the --rad flag.

The other big difference is that I would recommend preparing a splici reference sequence in the case the user provides a genome + annotation. This can be done easily using our pyroe python package. You can take a look at our quantaf workflow to see how we're currently handling this in nextflow.

@DongzeHE and I are happy to answer any questions you might have!

from scrnaseq.

fmalmeida commented on May 23, 2024

Hi @rob-p,
Thank you for this first explanation. I will have some reading on this later this week to organize my thoughts and the steps required and, when doing it, I will keep making contact to ensure it is following the standards 👍🏼 😄

from scrnaseq.

fmalmeida commented on May 23, 2024

Did not start yet, but have a question.

You guys think is better to put on all the processes or to use https://simpleaf.readthedocs.io/en/latest/ ?

from scrnaseq.

rob-p commented on May 23, 2024

That's a good question @fmalmeida. I certainly think that simpleaf will make getting things up and running ... simpler ;P. Right now it's pretty flexible, and it's also easy for us to modify and add functionality if it's deemed useful. I would say that right now, the primary flexibility present using the raw commands that's missing from simpleaf is:

It is possible but awkward to construct a non-splici reference index — i.e. if one is quantifying in a de novo reference without an annotation of spliced transcripts and introns. This can be accomplished by just passing the spliced transcriptome file as the extra spliced parameter to the index constructor, but it is less straightforward than just not making a non-splici reference in the first place. Of course, this is a rather niche issue because neither CellRanger nor StarSolo afaik can even do this at all.
The ability to control the k parameter used for index creation. This one is really easy to add support for. This will be important when folks want to do e.g. feature barcode processing, where an index should be constructed over the set of tags rather than the reference genome, typically with a much smaller k.
The ability to provide an unfiltered permit-list for technologies other than 10xv2 and 10xv3. Right now, the only technologies for which a known list of possible barcodes is readily available is chromium v2 and v3. In fact, if you don't have those "whitelists" in your ALEVIN_FRY_HOME directory, simpleaf will even go and fetch them for you automatically. However, for other chemistries, one has to use a different filtering approach since the "whitelists" are not common knowledge. It may be useful to provide the user the ability to provide their own "whitelist" for non 10xv2 and 10xv3 chemistries.
The ability to select the non-default "selective alignment" option when mapping the reads during the quantification phase. Right now, the mapping through simpleaf is always done using --sketch mode, which uses pseudoalignment with structural constraints. We should probably add an optional flag to allow the user to select the other alignment mode if they want. Again, this is easy to do and not a big deal because standard recommendation would be to default to --sketch mode anyway.

So, right now, the above are the only major differences in flexibility I can think of between using simpleaf versus using all of the raw commands directly. As I mentioned, none of them are fundamental, and those capabilities can all be exposed in simpleaf if we deem it useful. The benefit of simpleaf over the raw commands, of course, is that setting up the initial pipeline should be much easier, and future releases will benefit from capabilities added to simpleaf without having to re-express/re-implement those capabilities in terms of the raw commands.

I'm happy to have a deeper conversation about the potential pros and cons and what you think makes the most sense in the context of nfcore-scrnaseq.

from scrnaseq.

grst commented on May 23, 2024

Thanks for the extensive response! Most of it does indeed sound less important for now and straightforward to expose through simpleaf in the future.

The most critical point for me would be (3), because I think at least at some point in the future scrnaseq should be able to support non 10x chemistries.

Would there be a performance advantage of having different alevin-fry stages run in different processes? E.g. allocating more cpu/memory to the quant stage than for collate/generate-permit-list?

This is currently a limitation of running cellranger through nextflow: it reserves a lot of cpus, but only uses them in one stage.

from scrnaseq.

rob-p commented on May 23, 2024

Hi @grst,

So to be clear, it's already possible to run other chemistries through simpleaf. The set of already-know chemistries is here and the user can also provide a "custom" geometry to deal with barcode, umi, read layouts that don't match any of the pre-specified chemistries (see here).

The only limitation is on how cell filtering is done in 10x versus other chemistries. Specifically, for 10x chemistries, a "whitelist" of possible/expected cell barcodes is known (e.g. for 10xv3 ~6M out of the 4.2B possible length 16 barcodes). This enables alevin-fry to do unfiltered permit-list generation, where each barcode is checked against the known list of possible barcodes. When such an external list is not available, then another filtering strategy (e.g. filtering based on the knee method) must be used. My comment was intended to point out that there may be other chemistries in the future where a "whitelist" is available, but right now a "whitelist" can only be used with the chromium chemistries when run through simpleaf.

Would there be a performance advantage of having different alevin-fry stages run in different processes? E.g. allocating more cpu/memory to the quant stage than for collate/generate-permit-list?

This is a great point! So the biggest memory distinction is already exposed in simpleaf, which is that building the index requires more memory than every other step. In the quant phase, it is true that the mapping step probably requires the most memory. However, a primary design consideration of alevin-fry anyway is that memory usage is quite low. For example, if you use the --sparse index, all mapping and quantification should be possible in <8G of RAM, and even with the dense index it's not too much more (e.g. for mouse/human sized organisms). That being said, I think in terms of memory usage one generally has the order of index construction > mapping > quantification, while simpleaf groups mapping and quantification together. In terms of thread usage, again, the mapping step is the one that can effectively make the most use of many threads. However, in quantification, every step apart from permit-list generation is highly multithreaded (and quite fast in absolute terms). So, the only penalty thread-wise is probably that some potential thread resources would be wasted during the permit-list generation step if simpleaf is used and the quant phase is executed with many threads. Whether or not the difference / requirements are big enough to warrant splitting the steps up is really a judgement call.

from scrnaseq.

grst commented on May 23, 2024

My comment was intended to point out that there may be other chemistries in the future where a "whitelist" is available, but right now a "whitelist" can only be used with the chromium chemistries when run through simpleaf.

I see! That sounds indeed like a very niche use-case.

Whether or not the difference / requirements are big enough to warrant splitting the steps up is really a judgement call.

To me, this sounds like it doesn't. After all, there's also some overhead for creating a nextflow process, especially if data needs to be transferred between cloud instances.

Happy to hear other opinions, but overall it seems simpleaf makes things easier for us without a lot of disadvantages.

from scrnaseq.

rob-p commented on May 23, 2024

Since I claimed that the things I said here were easy to do, I wanted to back up the claim! In my latest round of pushes to the dev branch of simpleaf, I have not addressed all of these limitations. That is, all of these things are not possible to express in simpleaf (I have not pushed these to main and cut a release yet though). I think this then removes any of these points as potential objections to using simpleaf.

from scrnaseq.

fmalmeida commented on May 23, 2024

Docs https://simpleaf.readthedocs.io/en/latest/ still point to 0.3.0, where can I find the one for 0.4.0 so I can check during implementation?

from scrnaseq.

fmalmeida commented on May 23, 2024

Hi @rob-p,
I am trying to get access to the docker and singularity images of simpleaf on biocontainers, since it is in bioconda, but I am not being able to find them.

https://bioconda.github.io/recipes/simpleaf/README.html

Could you take a look just to see if everything is allright?
😄

from scrnaseq.

rob-p commented on May 23, 2024

Hi @fmalmeida,

So I found this, but when I try to actually pull it down from docker I get :

❯ docker pull ghcr.io/channel-mirrors/bioconda/linux-64/simpleaf:0.4.0-h9f5acd7_0
0.4.0-h9f5acd7_0: Pulling from channel-mirrors/bioconda/linux-64/simpleaf
533580ef6314: Pulling fs layer
4083c173038d: Pulling fs layer
30fb06b2b06a: Pulling fs layer
invalid rootfs in image configuration

Any idea what this means? This is all automatically done by conda, but I guess we could push our own docker image if the conda one is problematic.

from scrnaseq.

grst commented on May 23, 2024

tbh, I never came across the ghcr.io registry before in the context of biocontainers. I thought the official registry is quay.io, but for some reason simpleaf never ended up there.

I asked on the bioconda gitter if someone has an idea what's going on.

from scrnaseq.

rob-p commented on May 23, 2024

@fmalmeida,

Thanks to @grst's inquiry, the brilliant bioconda folks have "unhidden" these images on quay.io (e.g. you should be able to do docker pull quay.io/biocontainers/simpleaf:0.4.0--h9f5acd7_0).

from scrnaseq.

fmalmeida commented on May 23, 2024

Awesome! I'll take back to this ! Thanks.

from scrnaseq.

fmalmeida commented on May 23, 2024

Hi @rob-p,
I am now trying to change the INDEX module. And I have a couple of questions:

I will not require "transcripts_fasta" anymore, right? Actually simpleaf will only use genome fasta and transcript gtf. Right?

Open to see example!

process SIMPLEAF_INDEX {
    tag "$transcript_gtf"
    label "process_medium"

    conda (params.enable_conda ? 'bioconda::simpleaf=0.4.0' : null)
    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
        'https://depot.galaxyproject.org/singularity/simpleaf:0.4.0--h9f5acd7_0' :
        'quay.io/biocontainers/simpleaf:0.4.0--h9f5acd7_0' }"

    input:
    path genome_fasta
    path transcript_gtf

    output:
    path "salmon"       , emit: index
    path "versions.yml" , emit: versions

    when:
    task.ext.when == null || task.ext.when

    script:
    def args = task.ext.args ?: ''
    """
    # export required var
    export ALEVIN_FRY_HOME=.

    # prep simpleaf
    simpleaf set-paths

    # run simpleaf index
    simpleaf \\
        index \\
        --threads $task.cpus \\
        --fasta $genome_fasta \\
        --gtf $transcript_gtf \\
        $args \\
        -o salmon
    
    cat <<-END_VERSIONS > versions.yml
    "${task.process}":
        simpleaf: 0.4.0
    END_VERSIONS
    """
}

When running, I have this :

error: The following required arguments were not provided:
 --rlen <RLEN>

Is there a way to automatically set this, or let it calculate? If not, users will have to set it right ... Maybe on a new parameter? But my question is, how to determine a good length for the param?

😄

from scrnaseq.

apeltzer commented on May 23, 2024

Alright, we just merged #139 so this should be fine now. Remaining points need to be checked in separate issues so thanks everyone for the discussions :-)

from scrnaseq.

Switch from alevin to alevin-fry about scrnaseq HOT 21 CLOSED

Comments (21)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent