Hello again. I originally posted about a "Difference in Agrp express

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Difference in expression of spliced/ambigous transcripts, between old and new versions of Salmon/Alevin-fry about alevin-fry HOT 6 CLOSED

combine-lab commented on June 3, 2024

Difference in expression of spliced/ambigous transcripts, between old and new versions of Salmon/Alevin-fry

from alevin-fry.

Comments (6)

biOliver commented on June 3, 2024 1

Ah, got it. Thank you.

You have answered all the questions I had regarding this heuristic. And so I will close this issue.

Thank you a lot Rob, again. :)

Kind regards
Oliver

from alevin-fry.

rob-p commented on June 3, 2024

Hi @biOliver,

Thank you again for the brilliantly detailed analysis here. So, let me try to be a bit more clear about what I think are the main things that changed between these versions. There are of course several updates here, so there are also other minor changes and those may have some effect as well. The two main things that come to mind are :

The "prefer-spliced" heuristic. This is the one I mentioned in my e-mail. This was present since, I believe 0.4.0 (which corresponds to when the paper was first published), but was absent in 0.3.0. Consider that I have a mapping like the following: { (UMI1, G1-spliced), (UMI1, G2—unspliced) }. That is, this is a gene ambiguous UMI. It maps both to a spliced molecule of G1 and an unspliced molecule of G2. How this gets resolved depends on the UMI resolution mode one is using. If you are using the cr-like resolution mode though (e.g. the one we used throughout the paper), then in v0.3.0 those UMIs are being discarded. They are gene ambiguous, and so they are not assigned to any gene (i.e. we expect the total UMI count in the output matrix will be lower for not including these). However, under the "prefer-spliced" heuristic, this UMI is assigned to G1. Specifically, the rule is that if the UMI maps to only a single spliced molecule (say, belonging to G1) then we ask:
- Is it splicing ambiguous in G1 (i.e. does it also map to G1-unspliced)?; in that case it is counted as A
- Does it map to "too many" other unspliced molecules $< \tau$ (I believe the default is 10). If it maps in a spliced manner only to G1 and maps to $< \tau$ other molecules of different genes, it is assigned to G1, otherwise if it maps to $\ge \tau$ other molecules of different genes, it is discarded.

So, this may explain how you see extra counts for this gene appearing from "nowhere" with respect to the 0.3.0 result. Previously, those were being discarded, but in the new version (and back to 0.4) they are being assigned as spliced because they align to the spliced version of this gene and not to too many other molecules. Further, as I mentioned above, the differences you observe here can depend on the resolution mode you are using. While cr-like discards gene-ambiguous reads (initially all of them, and then eventually those that aren't rescued by the "prefer-spliced" heuristic), the cr-like-em algorithm instead considers probabilistically allocating those UMIs among the different places where they map. The confidence of those assignments may not be high if there is not a lot of other evidence to prefer one gene over the others, but at least in that case you'd expect the mass to be more conserved.

The other relevant change is more on the salmon side (and I suspect that this may relate to changes you are seeing in the A matrix). Specifically, between 1.5.2 and 1.9.0 several changes were made to improve sensitivity. Salmon places a cutoff on how many occurrences a k-mer can have and still be used for mapping, as well as the maximum number of mappings an entire read can have and still be reported. This is primarily to prevent super-repetitive regions from slowing down the entire analysis. However, the nature of repetitive sequence in non-coding regions is different than that in coding regions, and so one may want to be a bit more permissive in what they allow. To address this, we implemented a "recovery" procedure. Essentially, we have a default threshold $n_1$ such that if seeds occur $> n_1$ times they are not used. Previously, if all seeds occurred $> n_1$ times, then the read would go unmapped. The recovery procedure instead says, if all seeds occur $> n_1$ times, then let $m = \min( \text{occ}_1, \text{occ}_2, \dots, \text{occ}_j )$ where $\text{occ}_j$ is the number of occurrences of seed $j$. Then, if $m < n_2$ (where $n_2$ is some number much larger than $n_1$) we collect all seeds that occur $m$ times and perform the mapping based on these seeds. That's a lot of technical detail, but the long-and-short of it is that this procedure recovers mappings for reads that map in highly repetitive regions. Instead of those reads simply appearing as unmapped (in the older version) because their seeds matched to too many places, they may map in the newer version. This can also explain why the newer pipeline has more reads to work with.

Together, I think these differences could explain the effects you are seeing. However, I'm happy to discuss any followup questions or comments you may have here. Again, thanks for the wonderfully detailed issue!

from alevin-fry.

biOliver commented on June 3, 2024

Thank you so much for this explanation!

I see now, that I didn't understand the "prefer-spliced" heuristic the first time around. We now agree that it probably is the main reason for the 'emerging' counts that we see.
So far so good. :)

I believe that I saw somewhere [unfortunately, I cannot find where I originally saw it] that the rational for "prefer spliced" was that 'origin of reads are mainly spliced'. But that doesn't hold as an argument for our data, which is mainly (~80%) unspliced.

Have I understood the rationale right? And if so, would it be wrong of us to use "prefer-spliced" when our data is mainly unspliced?

Kind regards
Oliver

from alevin-fry.

rob-p commented on June 3, 2024

Hi @biOliver,

Thanks for the follow-up. I have one more clarification, that may shift the understanding a bit. But first, I think I mentioned the prefer-spliced logic in an e-mail. However, as I mentioned, this idea goes back further and also exists in other workflows / tools. Alex Dobin describes the basic idea in this tweet (https://twitter.com/a_dobin/status/1417921407204438017?s=20&t=tK8iFy8qJoKXKY5399f5gA), with reference to the Ding et al. paper ('20). That rule was later implemented in STARSolo as well.

Now, to the clarification ;P. The rule is actually implemented here : https://github.com/COMBINE-lab/alevin-fry/blob/master/src/utils.rs#L296. Specifically, the code the implements the "prefer spliced" rule is this:

alevin-fry/src/utils.rs

Lines 343 to 373 in 2ebfaa9

    
           3..=10 => { 
        
               // if we don't have *too* many distinct genes matching this UMI 
        
               // then apply the prefer-spliced rule. 
        
               // See if there is precisely 1 spliced gene, and if so take it 
        
               // but assign the read as ambiguous if it is for this gene 
        
               let mut iter = labels.iter(); 
        
               // search for the first spliced index 
        
               if let Some(sidx) = iter.position(|&x| is_spliced(x)) { 
        
                   // if we found a spliced gene, check if there are any more 
        
                   if let Some(_sidx2) = iter.position(|&x| is_spliced(x)) { 
        
                       // in this case we had 2 spliced genes, so this is 
        
                       // gene ambiguous and we just drop it. 
        
                   } else { 
        
                       // we only had one spliced gene.  Check to see if the 
        
                       // index following the spliced gene we found is its 
        
                       // unspliced variant or not.  If so, add it as ambiguous 
        
                       // otherwise, add it as spliced 
        
                       if let Some(sg) = labels.get(sidx) { 
        
                           if let Some(ng) = labels.get(sidx + 1) { 
        
                               if same_gene(*sg, *ng, true) { 
        
                                   let idx = ambig_offset + (*sg >> 1) as usize; 
        
                                   counts[idx] += *count as f32; 
        
                                   continue; 
        
                               } 
        
                           } 
        
                           counts[(*sg >> 1) as usize] += *count as f32; 
        
                       } 
        
                   } 
        
               } 
        
           }

I recognize you may not be a rust native, but hopefully the comments clarify a bit more what is happening. Specifically, though I've called the heuristic "prefer spliced", it will look for the gene that has a spliced transcript mapping this UMI. However, if the UMI is ambiguous within this gene (i.e. it maps to both a spliced and unspliced transcript of this gene), then the UMI itself will be assigned as ambiguous for this gene. So the UMI has been "rescued" (it's no longer discarded), but it is assigned as an ambiguous count for the gene for which it is rescued. Of course, if it is not ambiguous in this gene (maps only to the spliced transcript), then it will be assigned as spliced.

The "prefer-spliced" heuristic is, of course, still a heuristic. And so, I agree, it may not be optimal in all cases. However, in cases such as this, where the UMI is otherwise gene-ambiguous, such UMIs would instead simply be discarded under e.g. the cr-like UMI resolution policy. The other way to deal with such cases is to instead use a resolution policy like cr-like-em or parsimony-em that will allow UMIs to map to multiple genes, but then assign them probabilistically after all other UMIs in the cell have been allocated. However, it is worth mentioning that a good probabilistic model for tagged-end single-cell RNA-seq is still a topic of active research, since the very low coverage per-cell means that the EM algorithm has relatively little other information from which to make an informed decision — much of the time this means that the multimapping UMIs may just be split evenly among the multi-mapping locations. This, of course, isn't just an challenge for alevin-fry, but for all existing single-cell quantification methods.

In such cases, which method to prefer (how to handle such UMIs) is really up to the end user based on their expectations and the analysis they wish to perform. However, I'll mention that these cases are exactly the types of things we're interested in, since it's one of the areas where there is the most room to make substantial progress with improved methods. So, we'd be happy to follow up further or help dig in to this data in more detail; perhaps it will help suggest a fundamentally better approach to such cases.

Best,
Rob

from alevin-fry.

biOliver commented on June 3, 2024

Thank you a lot for the detailed explanation. :)

I have looked at the tweet and through the Ding et al. paper (and eventually the STARsolo preprint) and I am not able to find any mention of any prefer spliced heuristic/rational.
Would you maybe be able to clarify where in e.g. Ding et al. I could find any mention of and/or justification for it?

Kind regards
Oliver

from alevin-fry.

rob-p commented on June 3, 2024

Hi @biOliver,

Sure; the relevant text in the paper is:

Annotating each alignment with a gene tag
We use featureCounts40 from the Subread package, v.1.6.2, to add a gene tag to each alignment. To count reads overlapping with introns for single-nucleus RNA-seq data, we used a two-step approach to first count the reads overlapping with exons. In the second step, the reads not overlapping with exons were recounted if they overlapped with introns. We only included reads aligning in the sense orientation with the genome annotation, except for Smart-seq2, which does not generate strand-specific data.

They don’t specifically call it the “prefer spliced” heuristic, but this is what Alex is referring to and that is what this rule is explicitly doing. So they first look to assign a UMI to any exon it overlaps. Only in the case that it fails to overlap an exon (i.e. if there is no “spliced” alignment), do they consider annotating it with an intron.

Best,
Rob

from alevin-fry.

Difference in expression of spliced/ambigous transcripts, between old and new versions of Salmon/Alevin-fry about alevin-fry HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	3..=10 => {
	// if we don't have too many distinct genes matching this UMI
	// then apply the prefer-spliced rule.

	// See if there is precisely 1 spliced gene, and if so take it
	// but assign the read as ambiguous if it is for this gene
	let mut iter = labels.iter();
	// search for the first spliced index
	if let Some(sidx) = iter.position(\|&x\| is_spliced(x)) {
	// if we found a spliced gene, check if there are any more
	if let Some(_sidx2) = iter.position(\|&x\| is_spliced(x)) {
	// in this case we had 2 spliced genes, so this is
	// gene ambiguous and we just drop it.
	} else {
	// we only had one spliced gene. Check to see if the
	// index following the spliced gene we found is its
	// unspliced variant or not. If so, add it as ambiguous
	// otherwise, add it as spliced
	if let Some(sg) = labels.get(sidx) {
	if let Some(ng) = labels.get(sidx + 1) {
	if same_gene(sg, ng, true) {
	let idx = ambig_offset + (*sg >> 1) as usize;
	counts[idx] += *count as f32;
	continue;
	}
	}
	counts[(sg >> 1) as usize] += count as f32;
	}
	}
	}
	}