Giter Site home page Giter Site logo

Comments (6)

biOliver avatar biOliver commented on June 3, 2024 1

Ah, got it. Thank you.

You have answered all the questions I had regarding this heuristic. And so I will close this issue.

Thank you a lot Rob, again. :)

Kind regards
Oliver

from alevin-fry.

rob-p avatar rob-p commented on June 3, 2024

Hi @biOliver,

Thank you again for the brilliantly detailed analysis here. So, let me try to be a bit more clear about what I think are the main things that changed between these versions. There are of course several updates here, so there are also other minor changes and those may have some effect as well. The two main things that come to mind are :

  1. The "prefer-spliced" heuristic. This is the one I mentioned in my e-mail. This was present since, I believe 0.4.0 (which corresponds to when the paper was first published), but was absent in 0.3.0. Consider that I have a mapping like the following: { (UMI1, G1-spliced), (UMI1, G2—unspliced) }. That is, this is a gene ambiguous UMI. It maps both to a spliced molecule of G1 and an unspliced molecule of G2. How this gets resolved depends on the UMI resolution mode one is using. If you are using the cr-like resolution mode though (e.g. the one we used throughout the paper), then in v0.3.0 those UMIs are being discarded. They are gene ambiguous, and so they are not assigned to any gene (i.e. we expect the total UMI count in the output matrix will be lower for not including these). However, under the "prefer-spliced" heuristic, this UMI is assigned to G1. Specifically, the rule is that if the UMI maps to only a single spliced molecule (say, belonging to G1) then we ask:
    • Is it splicing ambiguous in G1 (i.e. does it also map to G1-unspliced)?; in that case it is counted as A
    • Does it map to "too many" other unspliced molecules $< \tau$ (I believe the default is 10). If it maps in a spliced manner only to G1 and maps to $< \tau$ other molecules of different genes, it is assigned to G1, otherwise if it maps to $\ge \tau$ other molecules of different genes, it is discarded.

So, this may explain how you see extra counts for this gene appearing from "nowhere" with respect to the 0.3.0 result. Previously, those were being discarded, but in the new version (and back to 0.4) they are being assigned as spliced because they align to the spliced version of this gene and not to too many other molecules. Further, as I mentioned above, the differences you observe here can depend on the resolution mode you are using. While cr-like discards gene-ambiguous reads (initially all of them, and then eventually those that aren't rescued by the "prefer-spliced" heuristic), the cr-like-em algorithm instead considers probabilistically allocating those UMIs among the different places where they map. The confidence of those assignments may not be high if there is not a lot of other evidence to prefer one gene over the others, but at least in that case you'd expect the mass to be more conserved.

  1. The other relevant change is more on the salmon side (and I suspect that this may relate to changes you are seeing in the A matrix). Specifically, between 1.5.2 and 1.9.0 several changes were made to improve sensitivity. Salmon places a cutoff on how many occurrences a k-mer can have and still be used for mapping, as well as the maximum number of mappings an entire read can have and still be reported. This is primarily to prevent super-repetitive regions from slowing down the entire analysis. However, the nature of repetitive sequence in non-coding regions is different than that in coding regions, and so one may want to be a bit more permissive in what they allow. To address this, we implemented a "recovery" procedure. Essentially, we have a default threshold $n_1$ such that if seeds occur $> n_1$ times they are not used. Previously, if all seeds occurred $> n_1$ times, then the read would go unmapped. The recovery procedure instead says, if all seeds occur $> n_1$ times, then let $m = \min( \text{occ}_1, \text{occ}_2, \dots, \text{occ}_j )$ where $\text{occ}_j$ is the number of occurrences of seed $j$. Then, if $m < n_2$ (where $n_2$ is some number much larger than $n_1$) we collect all seeds that occur $m$ times and perform the mapping based on these seeds. That's a lot of technical detail, but the long-and-short of it is that this procedure recovers mappings for reads that map in highly repetitive regions. Instead of those reads simply appearing as unmapped (in the older version) because their seeds matched to too many places, they may map in the newer version. This can also explain why the newer pipeline has more reads to work with.

Together, I think these differences could explain the effects you are seeing. However, I'm happy to discuss any followup questions or comments you may have here. Again, thanks for the wonderfully detailed issue!

from alevin-fry.

biOliver avatar biOliver commented on June 3, 2024

Thank you so much for this explanation!

I see now, that I didn't understand the "prefer-spliced" heuristic the first time around. We now agree that it probably is the main reason for the 'emerging' counts that we see.
So far so good. :)

I believe that I saw somewhere [unfortunately, I cannot find where I originally saw it] that the rational for "prefer spliced" was that 'origin of reads are mainly spliced'. But that doesn't hold as an argument for our data, which is mainly (~80%) unspliced.

Have I understood the rationale right? And if so, would it be wrong of us to use "prefer-spliced" when our data is mainly unspliced?

Kind regards
Oliver

from alevin-fry.

rob-p avatar rob-p commented on June 3, 2024

Hi @biOliver,

Thanks for the follow-up. I have one more clarification, that may shift the understanding a bit. But first, I think I mentioned the prefer-spliced logic in an e-mail. However, as I mentioned, this idea goes back further and also exists in other workflows / tools. Alex Dobin describes the basic idea in this tweet (https://twitter.com/a_dobin/status/1417921407204438017?s=20&t=tK8iFy8qJoKXKY5399f5gA), with reference to the Ding et al. paper ('20). That rule was later implemented in STARSolo as well.

Now, to the clarification ;P. The rule is actually implemented here : https://github.com/COMBINE-lab/alevin-fry/blob/master/src/utils.rs#L296. Specifically, the code the implements the "prefer spliced" rule is this:

alevin-fry/src/utils.rs

Lines 343 to 373 in 2ebfaa9

3..=10 => {
// if we don't have *too* many distinct genes matching this UMI
// then apply the prefer-spliced rule.
// See if there is precisely 1 spliced gene, and if so take it
// but assign the read as ambiguous if it is for this gene
let mut iter = labels.iter();
// search for the first spliced index
if let Some(sidx) = iter.position(|&x| is_spliced(x)) {
// if we found a spliced gene, check if there are any more
if let Some(_sidx2) = iter.position(|&x| is_spliced(x)) {
// in this case we had 2 spliced genes, so this is
// gene ambiguous and we just drop it.
} else {
// we only had one spliced gene. Check to see if the
// index following the spliced gene we found is its
// unspliced variant or not. If so, add it as ambiguous
// otherwise, add it as spliced
if let Some(sg) = labels.get(sidx) {
if let Some(ng) = labels.get(sidx + 1) {
if same_gene(*sg, *ng, true) {
let idx = ambig_offset + (*sg >> 1) as usize;
counts[idx] += *count as f32;
continue;
}
}
counts[(*sg >> 1) as usize] += *count as f32;
}
}
}
}

I recognize you may not be a rust native, but hopefully the comments clarify a bit more what is happening. Specifically, though I've called the heuristic "prefer spliced", it will look for the gene that has a spliced transcript mapping this UMI. However, if the UMI is ambiguous within this gene (i.e. it maps to both a spliced and unspliced transcript of this gene), then the UMI itself will be assigned as ambiguous for this gene. So the UMI has been "rescued" (it's no longer discarded), but it is assigned as an ambiguous count for the gene for which it is rescued. Of course, if it is not ambiguous in this gene (maps only to the spliced transcript), then it will be assigned as spliced.

The "prefer-spliced" heuristic is, of course, still a heuristic. And so, I agree, it may not be optimal in all cases. However, in cases such as this, where the UMI is otherwise gene-ambiguous, such UMIs would instead simply be discarded under e.g. the cr-like UMI resolution policy. The other way to deal with such cases is to instead use a resolution policy like cr-like-em or parsimony-em that will allow UMIs to map to multiple genes, but then assign them probabilistically after all other UMIs in the cell have been allocated. However, it is worth mentioning that a good probabilistic model for tagged-end single-cell RNA-seq is still a topic of active research, since the very low coverage per-cell means that the EM algorithm has relatively little other information from which to make an informed decision — much of the time this means that the multimapping UMIs may just be split evenly among the multi-mapping locations. This, of course, isn't just an challenge for alevin-fry, but for all existing single-cell quantification methods.

In such cases, which method to prefer (how to handle such UMIs) is really up to the end user based on their expectations and the analysis they wish to perform. However, I'll mention that these cases are exactly the types of things we're interested in, since it's one of the areas where there is the most room to make substantial progress with improved methods. So, we'd be happy to follow up further or help dig in to this data in more detail; perhaps it will help suggest a fundamentally better approach to such cases.

Best,
Rob

from alevin-fry.

biOliver avatar biOliver commented on June 3, 2024

Thank you a lot for the detailed explanation. :)

I have looked at the tweet and through the Ding et al. paper (and eventually the STARsolo preprint) and I am not able to find any mention of any prefer spliced heuristic/rational.
Would you maybe be able to clarify where in e.g. Ding et al. I could find any mention of and/or justification for it?

Kind regards
Oliver

from alevin-fry.

rob-p avatar rob-p commented on June 3, 2024

Hi @biOliver,

Sure; the relevant text in the paper is:

Annotating each alignment with a gene tag
We use featureCounts40 from the Subread package, v.1.6.2, to add a gene tag to each alignment. To count reads overlapping with introns for single-nucleus RNA-seq data, we used a two-step approach to first count the reads overlapping with exons. In the second step, the reads not overlapping with exons were recounted if they overlapped with introns. We only included reads aligning in the sense orientation with the genome annotation, except for Smart-seq2, which does not generate strand-specific data.

They don’t specifically call it the “prefer spliced” heuristic, but this is what Alex is referring to and that is what this rule is explicitly doing. So they first look to assign a UMI to any exon it overlaps. Only in the case that it fails to overlap an exon (i.e. if there is no “spliced” alignment), do they consider annotating it with an intron.

Best,
Rob

from alevin-fry.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.