combine-lab / alevin-fry Goto Github PK

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.

Home Page: https://alevin-fry.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Rust 98.31% Shell 0.68% Python 1.01%

single-cell rust transcriptomics rna-seq single-cell-rna-seq quantification 10x

alevin-fry's Introduction

alevin-fry

alevin-fry is a suite of tools for the rapid, accurate and memory-frugal processing single-cell and single-nucleus sequencing data. It consumes RAD files generated by piscem or salmon alevin, and performs common operations like generating permit lists, and estimating the number of distinct molecules from each gene within each cell. The focus in alevin-fry is on safety, accuracy and efficiency (in terms of both time and memory usage).

You can read the paper describing alevin fry, "Alevin-fry unlocks rapid, accurate, and memory-frugal quantification of single-cell RNA-seq data" here, and the pre-print on bioRxiv.

Note: We recommend using piscem as the back-end mapper, rather than salmon, as it is substantially more resource-frugal, faster, and is a larger focus of current and future development.

Getting started with `alevin-fry` and dedicated documentation

While this README contains some useful information to get started and some pointers, alevin-fry has it's own dedicated documentation site, hosted on ReadTheDocs.

More information

Quickstart guide using the simpleaf wrapper
Relationship to alevin: Alevin-fry has been designed as the successor to alevin. It subsumes the core features of alevin, while also providing important new capabilities and considerably improving the performance profile. We anticipate that new method development and feature additions will take place primarily within the alevin-fry codebase. Thus, we encourage users of alevin to migrate to alevin-fry when feasible. That being said, alevin is still actively-maintained and supported, so if you are using it and not ready to migrate you can continue to ask questions and post issues in the salmon repository.

FAQs

Are you curious about processing details like whether to use a sparse or dense index? Do you have a question that isn't necessarily a bug report or feature request, and that isn't readily answered by the documentation or tutorials? Then please feel free to ask over in the Q&A.

Sister repositories

The generation of the reduced alignment data (RAD) files processed by alevin-fry is done by either piscem or salmon. The latest version of both are available on GitHub and via bioconda.
The simpleaf repository contains a dedicated wrapper / workflow runner for processing data with alevin-fry that vastly simplifies both the creation of extended references and the subsequent quantification of samples. If you find that simpleaf is missing a feature that you'd like to have, please consider submitting a feature request in the simpleaf repository.
The pyroe repository provides tools to help easily construct an enhanced (spliced + intronic or spliced + unspliced) transcriptome from a reference genome and GTF file.
The fishpond package — maintained by @mikelove and his lab — contains the recommended relevant functions for reading alevin-fry output (particularly USA-mode output) into the R ecosystem, in the form of a singleCellExperiment object.
The alevinqc package — maintained by @csoneson — provides tool and functions for performing quality control and assessment downstream of alevin-fry.

Installing from bioconda

Alevin-fry is available for both x86 linux and OSX platforms using bioconda. On Apple silicon, you can either build (easily) from source (see below) or run alevin-fry under the rosetta 2 emulation layer.

With bioconda in the appropriate place in your channel list, you should simply be able to install via:

$ conda install -c bioconda alevin-fry

Installing from crates.io

Alevin-fry can also be installed from crates.io using cargo. This can be done with the following command:

$ cargo install alevin-fry

Building from source

If you want to use features or fixes that may only be available in the latest develop branch (or want to build for a different architecture), then you have to build from source. Luckily, cargo makes that easy; see below.

Alevin-fry is built and tested with the latest (major & minor) stable version of Rust. While it will likely compile fine with slightly older versions of Rust, this is not a guarantee and is not a support priority. Unlike with C++, Rust has a frequent and stable release cadence, is designed to be installed and updated from user space, and is easy to keep up to date with rustup. Thanks to cargo, building should be as easy as:

$ cargo build --release

subsequent commands below will assume that the executable is in your path. Temporarily, this can be done (in bash-like shells) using:

$ export PATH=`pwd`/target/release/:$PATH

Citing alevin-fry

If you use alevin-fry in your work, please cite:

He, D., Zakeri, M., Sarkar, H., Soneson, C., Srivastava, A., and Patro, R. Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 19, 316–322 (2022). https://doi.org/10.1038/s41592-022-01408-3

BibTeX:

@Article{He2022,
author={He, Dongze and Zakeri, Mohsen and Sarkar, Hirak and Soneson, Charlotte and Srivastava, Avi and Patro, Rob},
title={Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data},
journal={Nature Methods},
year={2022},
month={Mar},
day={01},
volume={19},
number={3},
pages={316-322},
issn={1548-7105},
doi={10.1038/s41592-022-01408-3},
url={https://doi.org/10.1038/s41592-022-01408-3}
}

alevin-fry's People

Contributors

Stargazers

Watchers

Forkers

mohsenzakeri vinay-swamy genomicsnx dongzehe vonrosenchild rimanb velvetfox19 jespergrud gaura remy-ai liujamin jaclyn-taroni cameronraysmith statisticallearningai

alevin-fry's Issues

Error in reduce_impl() when following the `alevin-fry-tutorials`

I follow the tutorial in https://combine-lab.github.io/alevin-fry-tutorials/2021/alevin-fry-velocity/.
But it got an error:

============processing gtf to get spliced and introns============
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
'select()' returned 1:1 mapping between keys and columns
Extracting spliced transcript features
Extracting introns using the separate approach
Error in reduce_impl(.x, .f, ..., .init = .init, .dir = .dir) : 
  argument ".f" is missing, with no default

Here is my .fa and .gtf:

head .fa
>1 dna_sm:primary_assembly primary_assembly:ASM223467v1:1:1:37713152:1 REF
CAAcccccccccccACCACCCCGGGGAAAAAAGCCGAATGTAGGAATCCGTTATTTGTTT
TGGTTTTATTTAGGGTCAAACTTGTGCATTCTTAGAGGCGTGTCTTCTTGATTGACAGGT
GGGCGGACCCATGGGTCAACTTAATTCGCTTTGACCCACGCCATTATTATTATTCTTAGA
ACGCAGAATGTTCTCGCCTAGTCACTGACTTTCTGTCCAGTTTATACATCCATAGTAACA
AGTGAACATTCTGTAAATTTATTGAAGATAGATTTTCTGAAAAGAAGTCATTGTGGTTCT
GTGAGTTGTGTTCAGTTTTTTCTTGTATGAAGTAAGGTTTGAGTCAGGTTTGTGGCAGAG
TCTGAATCTCCTCCAATTTCTATTTCtttttttAATCTCTGCAAAGGAATATTTGAATTT
TTCCATCCGCACAAACGTTTGGTAGCACTGGTCTGCATTTGCATGATTTTCCAGTTTACA
TTTCCGTCACCTTCCTCCTGAAGCCAGAAGATGCTGCAGAAGTTCAGCAGAGCAGAAAAC

head .gtf
#!genome-build ASM223467v1
#!genome-version ASM223467v1
#!genome-date 2017-07
#!genome-build-accession NCBI:GCA_002234675.1
#!genebuild-last-updated 2018-07
3       ensembl gene    98189   100231  .       +       .       gene_id "ENSORLG00000018315"; gene_version "2"; gene_source "ensembl"; gene_biotype "protein_coding";
3       ensembl transcript      98189   100231  .       +       .       gene_id "ENSORLG00000018315"; gene_version "2"; transcript_id "ENSORLT00000022934"; transcript_version "2"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";
3       ensembl exon    98189   100231  .       +       .       gene_id "ENSORLG00000018315"; gene_version "2"; transcript_id "ENSORLT00000022934"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; exon_id "ENSORLE00000206643"; exon_version "2";
3       ensembl CDS     98189   100228  .       +       0       gene_id "ENSORLG00000018315"; gene_version "2"; transcript_id "ENSORLT00000022934"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; protein_id "ENSORLP00000022933"; protein_version "2";
3       ensembl start_codon     98189   98191   .       +       0       gene_id "ENSORLG00000018315"; gene_version "2"; transcript_id "ENSORLT00000022934"; transcript_version "2"; exon_number "1"; gene_source "ensembl"; gene_biotype "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding";

Thank you for your help!

Can we set barcode and umi ranges manually?

Hi,
Can wen set barcode and umi ranges to support custom single cell protocol? I can't find a way in the tutorial.
If I missed anything in the tutorial, I appologize.
Best,
Di

Old vs New Velocity Tutorial Questions

Hello @rob-p and all,

I have a couple questions about the old vs new RNA-velocity with Alevin tutorials.

After reading through them several times, and having fully implemented the old tutorial into a pipeline, I have a couple questions about converting.

The major benefit of the new tutorial seems to be that alevin-fry produces the "usa mode" result when fed a three column tx2gene map file which is then simple to parse into the h5ad spec expected by scanpy/scvelo. With the previous tutorial, instead there was a two column tx2gene map with -I suffixes to indicate the unspliced sequences. Is it possible to get standard alevin to produce the "usa mode" outputs without fully converting to alevin-fry?
(If not, I'd like to propose this as a feature request)

Additionally, it appears that the flank length has been changed from read length -1 in the old tutorial to read length -5 in the new tutorial. Is there a particular significance to this change?

The new tutorial uses a simple "splici reference" and no longer adds the full genome as a decoy sequence. Is is no longer advisable to use the full decoy or partial decoy indices when computing with either alevin or alevin-fry (assuming that speed/memory are not the highest concern)? If adding the decoys is not contraindicated, is it possible to run the new tutorial pipeline with an index which has decoy sequences added to it ala the old tutorial?

Edit: I see that decoys are not compatible with alevin-fry mode?

Automatically detect and exit if alevin is run with an index including decoy sequences when using the --rad and/or --sketch flags. This functionality is not currently supported, and mapping against such an index can cause (cryptic) errors in downstream processing. Now, if such an index is passed when using these flags, an informative error message is printed and the program will exit with a return code of 1.

Parsing the USA mode file combines the Spliced and Ambiguous reads into a single layer rather than assigning the ambiguous reads to the expected ambiguous layer. Would it not be better to assign the ambiguous reads to the ambiguous layer (which tools like scvelo expect) and then let scvelo handle the interpretation of this layer? Combining S+A "early" seems to negate the benefit of the new USA mode over the old tutorial where alevin was allowed to assign these ambiguous reads as it saw fit (learning the expected proportions from the data). Assuming my understanding is correct here would it be possible to get instructions for extracting the three layers separately so that they can be assembled into the expected three layer h5ad file?

Since the previous velocity tutorial explicitly requested a decoy aware index and the new alevin-fry mode doesn't support this. It seems like an option to produce the USA mode result file from standard alevin would provide a "best of both worlds" situation.

I'm sure I'll have some more questions as I work through the new velocity tutorial but that's all I've got for now, thanks!

Originally posted by @ACastanza in COMBINE-lab/salmon#719

Transcript level quantification?

Hi team,
very nice work! I noticed that the tutorial was made to generate gene-level quantification (https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/). Can I use alevin-fry to get transcript-level count for each single cell?
Thanks for developing the tool!

Questions regarding barcodes with zero reads after deduplication

I've been reading and working through various tutorials and bits of documentation, and have found the alevinQC package very helpful for understanding what is going on with the generate-permit-list step.

I'm a bit confused about the mapping step after the permit list step. It feels like there's a missing QC step here. I'm describing my understanding of the process, both to check I've got it right and in case it's helpful to someone else.

In the (very nice) alevinQC report, we get something like this for the white list, telling us which of the barcodes are most likely "real cells" rather than empty droplets:

These reads are then mapped, and some of them end up not mapping to any genes at all. This gives us the actual counts for each white-listed barcode, and looks like this:

I have a couple of questions:

The barcodes with 0 / very few genes assigned to them are useless, right? Shouldn't we also have a step to exclude them? For example a second knee step based on number of genes detected?
More out of curiosity, what is going on with these barcodes?? The plot below shows that we get many barcodes whose frequencies are similar to "real" cells, and even ones with super high frequencies (e.g. the highest frequency in this sample), where the reads do not map to any genes at all. Is this expected?

Thanks!
Will

quality-control metrics in R: extremely low counts of genes per cell

Hello,

I am testing alevin-fry on some public sc-RNAseq data (GEO accession: GSE81076). The issue that I have is that when I process the gene count x cell matrix in R and perform per-cell metrics, I get extremely low counts of genes per cell. I repeated the analysis several times changing some parameters but I don't seem to grasp the error.

The data has 18 samples, I am now processing only one of them to try to understand where the problem lies. Here is a detailed explanation of what I am doing:

Characteristics of the data that I'm using: paired-end, forward stranded, read 1 is composed of 8 first bases corresponding to the cell-barcode followed by 4 bases of the UMI sequence. Read 2 is just the transcript sequence.
First, I built the salmon/alevin splici index following the tutorial "Resolving the splicing origins of UMIs to improve the specificity of single-cell RNA-seq transcriptome mapping with alevin-fry"
Mapping the reads:
salmon alevin -i ../alevin_index/grch38_splici_idx -p 10 -l ISF --bc-geometry 1[1-8] --umi-geometry 1[9-12] --read-geometry 2[1-end] --rad -1 73_1.fastq.gz -2 73_2.fastq.gz -o 73_SAM_pbmc_10k_v3_map --tgMap t2g.tsv --forceCells 96
Permit-list:
alevin-fry generate-permit-list -d fw --force-cells 96 -i 73_SAM_pbmc_10k_v3_map -o 73_SAM_pbmc_10k_v3_quant
Collating the file:
alevin-fry collate -t 10 -i 73_SAM_pbmc_10k_v3_quant -r 73_SAM_pbmc_10k_v3_map
UMI quantification:
alevin-fry quant -t 10 -i 73_SAM_pbmc_10k_v3_quant -o 73_SAM_pbmc_10k_v3_quant_res --tg-map ../tg2_3col.tsv --resolution cr-like-em --use-mtx
Getting the SA gene-by-cell matrix:
I used the RScript provided in the tutorial to generate the SingleCellExperiment object.

Then, I extract the counts assay from the generated SCE object and take a look at it by converting it into a matrix. This is what is looks like:

As you can see, the gene counts are not whole numbers. After, I perform some quality-control metrics also in R, obtaining the following results for the gene counts per cell:

summary(per.cell$detected)

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.00 31.00 61.00 63.12 92.00 154.00

I tried with different samples but all of them yield very low count results. The percented of mapped reads is 42%, not too high, but I also obtained a similar percentage with other mapping tools, so I suspect that the error could be in the downstream steps. Please, let me know if there is something wrong with my code.

Thanks in advance,

Aintzane.

No "usa_mode" flag

Hello~ I'm following the RNA-velocity tutorial using alevin-fry. Everything works fine without any error but I don't have the usa_mode flag in the meta_info.json file. The num_genes is strange too, as I only have about 35k. I wonder what could be the cause. I'm using salmon 1.6.0 and alevin-fry 0.1.0 and working on human data. Thank you for your time.

`quickersort` dependency no longer available

Upon cloning the main branch and attempting to build alevin-fry with cargo 1.60.0 (d1fd9fe 2022-03-01) (rustc 1.60.0 (7737e0b5c 2022-04-04)), I get the following output:

Updating crates.io index
Updating git repository `https://github.com/parazodiac/SingleCellExperiment`
error: no matching package named `quickersort` found
location searched: registry `crates-io`
required by package `alevin-fry v0.5.0 (/scratch/bclab/james/2022.05.10.3T3_libGli2-pool1_10X_spike-in/alevin-fry)`

Indeed, quickersort has been yanked from crates.io, and the project page notes:

There's really no point in using this library any more. Everything good about it has been incorporated into std::sort_unstable in the Rust standard library, which even uses a better algorithm. Just use that.

A quick review suggests that quickersort is mainly used in pugutils.rs, rad_types.rs, eq_class.rs, and collate.rs. sort_unstable seems like it ought to be a drop-in replacement for these.

barcode correction of concatenated barcodes does not find any correctable barcodes

Hi,

I am working with scifi-RNA-seq data which has both a 13bp RT barcode and 16bp 10X bead barcode. The full R1 structure is: 8bp umi + 16bp bead baracode + 13bp RT barcode. The method produces a lot of background barcodes so I initially started using the -u flag for generate-permit-list to pass a file with all possible concatenated combinations of barcodes. However, I get a warning that zero barcodes had exactly one single edit neighbor. I figured that maybe the two barcodes combined are too noisy, so I used another program to preprocess and correct all RT barcodes, only retaining reads with correct/correctable RT barcodes. I then generated new R1 FASTQs with the corrected RT barcodes.

However, I still run into the same warning:

salmon alevin -i ../Combined_Ref/H38M39_index -p 32 -l IU --read-geometry 2[1-end] --bc-geometry 1[9-37] --umi-geometry 1[1-8] --sketch -1 1R1_765subcr.fastq -2 1R2_765subcr.fastq -o 765subcr_map --tgMap ../Combined_Ref/t2g.tsv --minScoreFraction .6

alevin-fry generate-permit-list -d fw -i 765subcr_map -o 765subcr_quant -u ../Combined_Ref/scifiwhitelist.txt                                                              2021-11-03 15:53:31 INFO number of unfiltered bcs read = 283,115,520
2021-11-03 15:53:31 INFO paired : false, ref_count : 376,950, num_chunks : 2,113
2021-11-03 15:53:31 INFO read 2 file-level tags
2021-11-03 15:53:31 INFO read 2 read-level tags
2021-11-03 15:53:31 INFO read 1 alignemnt-level tags
2021-11-03 15:53:31 INFO File-level tag values FileTags { bclen: 29, umilen: 8 }
2021-11-03 15:53:33 INFO observed 10,502,772 reads (10,030,355 orientation consistent) in 2,113 chunks --- max ambiguity read occurs in 2,262 refs
2021-11-03 15:53:33 INFO minimum num reads for barcode pass = 10
2021-11-03 15:53:37 INFO num_passing = 217,821
2021-11-03 15:53:39 INFO found 217,821 cells with non-trivial number of reads by exact barcode match
2021-11-03 15:53:40 INFO There were 2161815 distinct unmatched barcodes, and 0 that can be recovered
2021-11-03 15:53:40 INFO Matching unmatched barcodes to retained barcodes took 855.5647ms
2021-11-03 15:53:40 INFO Of the unmatched barcodes
============
2021-11-03 15:53:40 INFO        0 had exactly 1 single-edit neighbor in the retained list
2021-11-03 15:53:40 INFO        0 had >1 single-edit neighbor in the retained list
2021-11-03 15:53:40 INFO        5,701,852 had no neighbor in the retained list
2021-11-03 15:53:40 INFO total number of distinct corrected barcodes : 0
2021-11-03 15:53:41 WARN found 0 corrected barcodes; please check the input.

Since I only pre-corrected RT barcodes, I ran the pipeline twice with either only the uncorrected bead barcode segment selected or the corrected RT barcode segment selected. Both performed as expected, with the RT barcode run showing 384 barcodes and 100% match rate. The bead barcode run showed some correctable barcodes.

RT barcode only:

 salmon alevin -i ../Combined_Ref/H38M39_index -p 32 -l IU --read-geometry 2[1-end] --bc-geometry 1[25-37] --umi-geometry 1[1-8] --sketch -1 1R1_765subcr.fastq -2 1R2_765subcr.fastq -o 765subcr_map --tgMap ../Combined_Ref/t2g.tsv --minScoreFraction .6

alevin-fry generate-permit-list -d fw -i 765subcr_map -o 765subcr_quant -u ../Combined_Ref/scifibarcodelist.txt
2021-11-03 15:50:13 INFO number of unfiltered bcs read = 384
2021-11-03 15:50:13 INFO paired : false, ref_count : 376,950, num_chunks : 2,116
2021-11-03 15:50:13 INFO read 2 file-level tags
2021-11-03 15:50:13 INFO read 2 read-level tags
2021-11-03 15:50:13 INFO read 1 alignemnt-level tags
2021-11-03 15:50:13 INFO File-level tag values FileTags { bclen: 13, umilen: 8 }
2021-11-03 15:50:14 INFO observed 10,502,776 reads (10,030,359 orientation consistent) in 2,116 chunks --- max ambiguity read occurs in 2,262 refs
2021-11-03 15:50:14 INFO minimum num reads for barcode pass = 10
2021-11-03 15:50:14 INFO num_passing = 384
2021-11-03 15:50:14 INFO found 384 cells with non-trivial number of reads by exact barcode match
2021-11-03 15:50:14 INFO There were 0 distinct unmatched barcodes, and 0 that can be recovered
2021-11-03 15:50:14 INFO Matching unmatched barcodes to retained barcodes took 2.2µs
2021-11-03 15:50:14 INFO Of the unmatched barcodes
============
2021-11-03 15:50:14 INFO        0 had exactly 1 single-edit neighbor in the retained list
2021-11-03 15:50:14 INFO        0 had >1 single-edit neighbor in the retained list
2021-11-03 15:50:14 INFO        0 had no neighbor in the retained list
2021-11-03 15:50:14 INFO total number of distinct corrected barcodes : 0
2021-11-03 15:50:14 WARN found 0 corrected barcodes; please check the input.

Bead barcode only:

salmon alevin -i ../Combined_Ref/H38M39_index -p 32 -l IU --read-geometry 2[1-end] --bc-geometry 1[9-24] --umi-geometry 1[1-8] --sketch -1 1R1_765subcr.fastq -2 1R2_765subcr.fastq -o 765subcr_map --tgMap ../Combined_Ref/t2g.tsv --minScoreFraction .6

alevin-fry generate-permit-list -d fw -i 765subcr_map -o 765subcr_quant -u ../Combined_Ref/737K-cratac-v1.txt                                                              2021-11-03 15:47:57 INFO number of unfiltered bcs read = 737,280
2021-11-03 15:47:57 INFO paired : false, ref_count : 376,950, num_chunks : 2,117
2021-11-03 15:47:57 INFO read 2 file-level tags
2021-11-03 15:47:57 INFO read 2 read-level tags
2021-11-03 15:47:57 INFO read 1 alignemnt-level tags
2021-11-03 15:47:57 INFO File-level tag values FileTags { bclen: 16, umilen: 8 }
2021-11-03 15:47:57 INFO observed 10,502,772 reads (10,030,355 orientation consistent) in 2,117 chunks --- max ambiguity read occurs in 2,262 refs
2021-11-03 15:47:57 INFO minimum num reads for barcode pass = 10
2021-11-03 15:47:57 INFO num_passing = 52,060
2021-11-03 15:47:57 INFO found 52,060 cells with non-trivial number of reads by exact barcode match
2021-11-03 15:47:57 INFO There were 368621 distinct unmatched barcodes, and 186640 that can be recovered
2021-11-03 15:47:57 INFO Matching unmatched barcodes to retained barcodes took 98.4977ms
2021-11-03 15:47:57 INFO Of the unmatched barcodes
============
2021-11-03 15:47:57 INFO        255,553 had exactly 1 single-edit neighbor in the retained list
2021-11-03 15:47:57 INFO        10,991 had >1 single-edit neighbor in the retained list
2021-11-03 15:47:57 INFO        330,518 had no neighbor in the retained list
2021-11-03 15:47:58 INFO total number of distinct corrected barcodes : 186,640

Given the results above, I would have expected the concatenated bead+corrected RT barcode would detect some single edit neighbors attributed to the bead barcode portion and be able to correct them. However, this doesn't seem to be the case and I don't understand why. Please let me know if I am approaching this the wrong way.

I'm attaching all the necessary files to reproduce this here.

Thank you!
Kai

Cannot update libradicl

I try to install alevin-fry version 0.6.0 with Rust 1.52.1 but the installation fails:

$ cargo install --path .
  Installing alevin-fry v0.6.0 (/var/vhosts/vitsoft-home/sib-easyconfigs.git/easybuild/easyconfigs/a/alevin-fry/alevin-fry-0.6.0)
    Updating crates.io index
    Updating git repository `https://github.com/COMBINE-lab/libradicl`
error: failed to compile `alevin-fry v0.6.0 (/var/vhosts/vitsoft-home/sib-easyconfigs.git/easybuild/easyconfigs/a/alevin-fry/alevin-fry-0.6.0)`, intermediate artifacts can be found at `/var/vhosts/vitsoft-home/sib-easyconfigs.git/easybuild/easyconfigs/a/alevin-fry/alevin-fry-0.6.0/target`

Caused by:
  failed to get `libradicl` as a dependency of package `alevin-fry v0.6.0 (/var/vhosts/vitsoft-home/sib-easyconfigs.git/easybuild/easyconfigs/a/alevin-fry/alevin-fry-0.6.0)`

Caused by:
  failed to load source for dependency `libradicl`

Caused by:
  Unable to update https://github.com/COMBINE-lab/libradicl

Caused by:
  failed to find branch `master`

Caused by:
  cannot locate remote-tracking branch 'origin/master'; class=Reference (4); code=NotFound (-3)

I think it comes from the master branch of libradicl that have been renamed main.

Error in alevin-fry quant

Hi alevin-fry Developers,

I am getting the following error when running alevin-fry quant:

thread 'main' panicked at 'could not quantify rad file.: Error(Deserialize { pos: Some(Position { byte: 0, line: 1, record: 0 }), err: DeserializeError { field: None, kind: Message("invalid length 1, expected a tuple of size 2") } })', src/main.rs:286:10
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

salmon 1.4.0
alevin-fry 0.1.0 from bioconda

I am following the alevin-fry tutorial with the following commands:

salmon alevin -l ISR --sketch --rad --chromiumV3 -p 20 -o ./map -i ./index/ --tgMap ./transcript_to_gene.2col.txt -1 ./pbmc_10k_v3_S1_L001_R1_001.fastq.gz -2 ./pbmc_10k_v3_S1_L001_R2_001.fastq.gz
alevin-fry generate-permit-list -d fw -k -i ./map -o ./quant
alevin-fry collate -t 20 -i ./quant -r ./map
alevin-fry quant   -t 20 -i ./quant -o ./quant --resolution cr-like --use-mtx --tg-map ./transcript_to_gene.2col.txt

The first 3 commands complete successfully. Full stdout is attached:
log.txt

Thanks!
Alex

[splici Reference] Mixed R2 read length

Hi there,
I am very excited about your tutorial about using a splici reference with alevin-fry and would love to test it out.
I have one question about the read_length parameter that is used to define the flank length added to each intron:

In my single-cell data, I have some samples with 90bp R2 reads, and some with 150bp R2 reads. As far as I understood, it should be safe to go for a read_length of 90bp, right? Or do you advise something different?

Thanks a lot.

Best,
Tobi

Rounding the matrix?

Hello,

I was wondering if rounding the quantification be justifiable in regards to the "correctness/accuracy" of the data? Specifically I'm running some pipeline that requires integers, from reading the new preprint and the initial Alevin paper it would seem reasonable/OK to round the matrix without too much negative effect?

Best,
Chang

read length on the splici index

Hello,

I tried to map some reads with variable lengths and obtained this output:
'processed 277 Million fragments
hits: 0, hits per frag: 0
Found 277568862 reads with N in the UMI sequence and ignored the reads.
Please report on github if this number is too large'

I was wondering if the read length I specified when building the splici index had something to do with the fact that none of the reads mapped. I originally constructed that index in order to map a different set of files, specifiying the length that those previous sequences had (and the mapping went fine), which does not match with the length of the reads of the current files I'm trying to map. Besides, the current files have variable read lengths. Do you think the read length parameter could be the problem here? Should I build a different index for the files I'm trying to map now? But again, they have reads of different lengths.

Thanks in advance,

Aintzane

Consider organization of metadata files to avoid name clashes with salmon

As pointed out by @jashapiro in issue 688 in the salmon repo, there is a name collision between the cmd_info.json file written by alevin-fry and the one written by salmon alevin if the quantifications are placed in the same directory as the input RAD file. Further, both tools produce their own meta_info.json file, though a clash is avoided because one writes this to an aux subdirectory while the other doesn't.

We should design a better scheme for naming and placement of metadata files, so that names don't clash even if the quantification output is written to the same directory as the initial RAD file.

Difference in expression of spliced/ambigous transcripts, between old and new versions of Salmon/Alevin-fry

Hello again.

I originally posted about a "Difference in Agrp expression depending on alignment strategy" (#81)

I closed the issue, since the issue never had anything to do with alignment strategy (Thanks a ton, @rob-p, for the work leading to this conclusion!). And the issue was ultimately solved by upgrading Salmon and Alevin-fry to the newest versions.

But we are still left wondering why we see the discrepancy between spliced counts (depending on which version of Alevin-fry/Salmon is used).

So, I am going to try and pick this up where we left it as good as I can. But first some nomenclature:

Original:
versions: Salmon v1.5.0, Alvein-fry v0.3.0
Updated:
versions: Salmon v1.9.0, Alvein-fry v0.7.0
U-, S- & A-counts:
unspliced, spliced & ambiguous counts

In an e-mail exchange you, @rob-p, proposed the following:

...I believe 0.4.0 is where we added a “prefer spliced gene” heuristic. If a read maps to more than one gene (but not too many), and all but one of the mappings are to unspliced sequence, then we allocate the read to the gene where it mapped to the spliced sequence....

After thinking about it, I do not think that this is the case. E.g. if you plot the total Agrp count across the same cells in the original and updated object you get this:

Here you can see that the extra abundance of Agrp counts doesn't come from the unspliced assay, but rather it just appears from nothing.

For scale, here is the same plot, but for accumulated gene counts:

Here, it is not as apparent, but it is still clear that we get a portion of S-counts (when going from original to updated) that is too big to be 'just' coming from the U-counts.

Another thing we tested was to plot all individual gene counts against each other (between original and updated object).
We did it for three tissues types (brain, adipose & liver), and distinguished between U-, S- & A-counts.

Here, you can see that there is a high agreement in the U-counts, but that there seems to be a significant 'bleeding' of expression towards the updated object in A-counts and particular in S-counts.

Given this info, would you have any idea of what why upgrading Salmon/Alevin-fry, in the described manner, could cause this discrepancy?
Let me know what you think and if you want me to send more plots (or code) along.

Kind regards
Oliver

collate step runs very slowly

Hi! I recently switched over to alevin-fry from alevin. I am testing it out on a HEK293T/NIH3T3 mixing experiment dataset, but it seems like the alevin-fry collate step progresses really slowly.

Here's the console:

alevin-fry collate -t 16 -i 383sub1_6_quant -r 383sub1_6_map
2021-10-25 15:43:12 INFO filter_type = Filtered
2021-10-25 15:43:12 INFO collated rad file will not be compressed
2021-10-25 15:43:13 INFO paired : false, ref_count : 376,950, num_chunks : 27,800, expected_ori : Forward
2021-10-25 15:43:13 INFO read 2 file-level tags
2021-10-25 15:43:13 INFO read 2 read-level tags
2021-10-25 15:43:13 INFO read 1 alignemnt-level tags
2021-10-25 15:43:13 INFO File-level tag values FileTags { bclen: 29, umilen: 8 }
2021-10-25 18:07:01 INFO deserialized correction map of length : 28,232,141
2021-10-25 18:07:06 INFO Generated 49 temporary buckets.
  [00:00:22] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]   27800/27800   partitioned records into temporary files.
  [00:00:27] [╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢╢]      49/49      gathered all temp files.
2021-10-25 18:07:56 INFO writing num output chunks (120,796) to header
2021-10-25 18:07:56 INFO expected number of output chunks 120,796
2021-10-25 18:07:56 INFO finished collating input rad file "383sub1_6_map/map.rad".

As you can see from the console output, it initializes, then runs for about 2 hours 30 minutes. During this time it's only using 15% of one cpu core on a 16 core 3950X. RAM and disk usage also appear to be minimal. There's about 139 million reads mapped from earlier in the pipeline going into this step.

Does this behavior seem normal? Is there anything I can do to speed it up?

Thank you!

Question: how to improve cell barcode correction?

Hi,
I have been having issues recovering the same amount of cell barcodes post correction in salmon alevin compared to cell ranger. Maybe alevin-fry can help in this task?

What I have

scRNA-seq reads from the 10X kit v3.1 from interspecific hybrid embryos
A diploid reference transcriptome
a preliminary alignment of the scRNA-seq to just one species with CellRanger

What I need
I have been running salmon alevin to get alignments and do some allele-specific analysis in a single-cell setting. I already use a whitelist containing only the cell barcodes that are passed as true cells by CellRanger (~80% of the total). This is the generic line I usually run:
salmon alevin -lISR -1 $Read1 -2 $Read2 --chromiumV3 -i $index -p 12 -o $outDir --tgMap $tsv --whitelist $whitelist --numCellBootstraps 20 --dumpFeatures

What the issue is
This only gives me a 37% mapping rate (expected is ~70% from running salmon pseudobulk on the data), and by troubleshooting it turns out that there are a large amount of reads that are discarded because of "noisy barcodes". From alevin_meta_info.json:
"total_reads": 82603614, "reads_with_N": 0, "noisy_cb_reads": 35104396, "noisy_umi_reads": 7532

Question
From my understanding, alevin-fry could help with the cell barcode correcting, so I tried following the docs for "generate-permit-list" but I still have some questions.

First, I actually am having trouble generating the RAD directory that the options --rad and --sketch should do. Running salmon alevin as above adding these two options -- either individually or together -- generated a "map.rad" file that alevin-fry doesn't take as input.

$ alevin-fry generate-permit-list --input map.rad --output-dir $outDir --expected-ori either --valid-bc $whitelist
error: Invalid value "map.rad" for '--input <INPUT>': No valid directory was found at this path.

For more information try --help

This brings me to my second question: the above code would take my whitelist that contains barcodes I already know are true cells and correct against it, right? But in theory salmon alevin also does this in my quantification step. I know the other option is to use a list of all available barcodes and change the --min-reads threshold, but is this actually better than knowing which barcodes are true cells? Why not set this true whitelist as --unfiltered-pl and then --min-reads 10?

I hope I was clear enough but I would obviously be happy to elaborate on any unclear part or anything I might have left out :)

Cheers,
Anna

How to keep all the cellbarcodes?

Hi Alevin-fry developers,

I am currently working on a combinatory indexing scRNA-seq dataset. And given that my dataset would have many ambient RNA, I would like to use the empty droplets for background estimation. So I am wondering if there is a way to keep all the cellbarcodes, no filtering, sth like the raw_feature_bc_matrix generated from 10x cellranger?

I initially thought with setting naiveEqclass and keepCBFraction 1.0 parameters would do the work but somehow the outputs are the same. I am wondering if there is sth I missed when running alevin-fry?

The codes I used:

salmon alevin -i /alevin_index_human_mouse/salmon_idx_humanv31_mouseM25_wo_decoy -l A -1 R1_fastq.gz -2 R2_fastq.gz \
        --naiveEqclass --keepCBFraction 1.0 \
        --read-geometry 2[1-end] \
        --umi-geometry 1[xx-xx] \
        --bc-geometry 1[xx-xx,xx-xx, xx-xx] \
        -p 20 -o ./alevin_RNA_fry --rad

alevin-fry generate-permit-list -d both -i ./alevin_RNA_fry --output-dir ./alevin_out_permit_knee -k
alevin-fry collate -r ./alevin_RNA_fry -t 16 -i ./alevin_out_permit_knee
alevin-fry quant -m /alevin_index_human_mouse/GRCh38-and-mm10_tsvtogene_all.tsv -i ./alevin_out_permit_knee -o ./alevin_counts -t 16 --resolution trivial --use-mtx

Thanks a lot for your help in advance.

Best,
Bingjie

`alevin-fry view` panicked at the prospect of providing help (or working)

I am trying to determine what all can be done to/with the RAD file output via a typical alevin-fry quant run, but I am struggling to find much info about the view command. Ultimately, I was hoping to convert back to BAM to spotcheck induction of a mutation in our model system.

I was hoping the command line help would yield some insight, but it throws an error:

alevin-fry view -h
thread 'main' panicked at '`help`s `-h` conflicts with `header`.

To change `help`s short, call `cmd.arg(Arg::new("help")...)`.', /opt/conda/conda-bld/alevin-fry_1645548081272/work/.cargo/registry/src/github.com-1ecc6299db9ec823/clap-3.1.1/src/build/command.rs:4249:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I looked through the repo code to find the proper args (-r, etc), but it still tosses the same error and doesn't seem this subcommand is functional yet.

I'm guessing you're well aware of this, and it's uncommon for people to want to interact with the RAD files directly. But it may be a good idea to remove it from the CLI help subcommand list until it's usable.

simpleaf singularity container won't run

Hi COMBINE-lab, I'm trying out the tutorial here, and I've run into an issue that you may want to know about because it could inhibit uptake of your tools by other new users. When I run singularity exec usefulaf_latest.sif or a more complex command similar to the tutorial's index-creeation step, I get this error.

ERROR  : Unknown image format/type: usefulaf_latest.sif
ABORT  : Retval = 255

I'm using singularity 2.4.2-dist on Ubuntu 18.04. I didn't expect this type of issue because I'm using AWS, and it's pretty much a blank slate with little else installed.

The tutorial also provides the option to pull from DockerHub, which could help in case I got a corrupted version of the .sif or something. But that hasn't worked either; here's what errs.

$ singularity pull docker://combinelab/usefulaf:latest
WARNING: pull for Docker Hub is not guaranteed to produce the
WARNING: same image on repeated pull. Use Singularity Registry
WARNING: (shub://) to pull exactly equivalent images.
Docker image path: index.docker.io/combinelab/usefulaf:latest
Cache folder set to /home/ubuntu/.singularity/docker
[16/16] |===================================| 100.0% 
Importing: base Singularity environment
tar: ./.exec: implausibly old time stamp -9223372036854775808
tar: ./.run: implausibly old time stamp -9223372036854775808
tar: ./.shell: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions/exec: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions/run: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions/shell: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions/start: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions/test: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/actions: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/env/01-base.sh: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/env/90-environment.sh: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/env/95-apps.sh: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/env/99-base.sh: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/env: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/libs: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/runscript: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d/startscript: implausibly old time stamp -9223372036854775808
tar: ./.singularity.d: implausibly old time stamp -9223372036854775808
tar: ./.test: implausibly old time stamp -9223372036854775808
tar: ./dev: implausibly old time stamp -9223372036854775808
tar: ./environment: implausibly old time stamp -9223372036854775808
tar: ./etc/hosts: implausibly old time stamp -9223372036854775808
tar: ./etc/resolv.conf: implausibly old time stamp -9223372036854775808
tar: ./etc: implausibly old time stamp -9223372036854775808
tar: ./home: implausibly old time stamp -9223372036854775808
tar: ./proc: implausibly old time stamp -9223372036854775808
tar: ./root: implausibly old time stamp -9223372036854775808
tar: ./singularity: implausibly old time stamp -9223372036854775808
tar: ./sys: implausibly old time stamp -9223372036854775808
tar: ./tmp: implausibly old time stamp -9223372036854775808
tar: ./var/tmp: implausibly old time stamp -9223372036854775808
tar: ./var: implausibly old time stamp -9223372036854775808
tar: .: implausibly old time stamp -9223372036854775808
Importing: /home/ubuntu/.singularity/docker/sha256:7b1a6ab2e44dbac178598dabe7cff59bd67233dba0b27e4fbd1f9d4b3c877a54.tar.gz

gzip: /home/ubuntu/.singularity/docker/sha256:7b1a6ab2e44dbac178598dabe7cff59bd67233dba0b27e4fbd1f9d4b3c877a54.tar.gz: not in gzip format
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
Cleaning up...
ERROR: pulling container failed!

Could there be something wrong with the image that the tutorial points to?

Calculating spliced/unpliced ratio and what to do with the ambigous count

Dear authors,

I want to calculate the spliced/unspliced gene ratio but I am not sure what to do with the ambiguous count table. Should I just remove it or combine it to one of the spliced or unspliced counts?

I am a beginner in this area and so I'd like to apologise in advance for the naive question.

Best,
Mikhael

no spike-in data in the gene-by-cell count matrices after UMI quantification

Hello,

I am working with data that has spike-ins and processing it in USA mode. I noticed that after generating the transcript-to-gene file by means of the provided code from the tutorial 'Resolving the splicing origins of UMIs to improve the specificity of single-cell RNA-seq transcriptome mapping with alevin-fry', the ' transcriptome_splici_fl71_t2g_3col.tsv' file did not contain the proper gene ID annotation for the spike in data:

Interestingly, the ' transcriptome_splici_fl71_t2g.tsv' had the correct annotation:

As a result, after the UMI quantification step (where I provided the 3 column file since I am using the USA mode), the matrices I obteined did not have spike-in data. I think the issue might have been in the '-' that the spike-ins have in their name, since Alevin-fry was able to detect the rest of the regular endogenous genes that have no '-' without any problems.

Although I was able to manually fix the 'transcriptome_splici_fl71_t2g_3col.tsv' file, I just wanted to report the issue in case you wanted to consider it for later releases.

Thanks you¡

Aintzane

Question(dev): Benchmarking script

Hi,

Is it possible to add some kind of benchmarking tests in the test suit, so that one can test for performance regression while implementing changes to the code? I'm not a biologist/gene guy(Just a random HPC enthusiast), so I don't have the brain power to do so myself.

Fewer true cells detected with alevin-fry.

Hi who's concerning,
I was using alevin-fry with splici index on a 10xv3 pbmc 5K dataset with simpleaf.
Here is my script,
singularity exec --cleanenv
--bind $singular_p:/workdir
--pwd /usefulaf/bash usefulaf_latest.sif
./simpleaf quant
-1 /workdir/$fastq1
-2 /workdir/$fastq2
-i /workdir/human_splici/index
-o /workdir/quants/$project
-f u -c v3 -r cr-like
-m /workdir/human_splici/ref/transcriptome_splici_fl86_t2g_3col.tsv
-t 16

However, I get fewer true cells. Only 3000 cells returned.

I used the old 10x quantile strategy to check.
Here is what I have,

af_all = load_fry(splici_path)
counts(af_all)->m
libs <- colSums(m)
o <- order(libs, decreasing = TRUE)
top <- libs[head(o, n = 5000)]
summary(top)
Min. 1st Qu. Median Mean 3rd Qu. Max.
79 156 6352 6028 8877 51043
quantile(top, 0.99)
99%
24923.19
table(libs>quantile(top, 0.99)*0.1)

FALSE TRUE
171451 3171

Any suggestion to solve this?
Cheers,
Yue

Too few genes being read into adata

Hi,

Why after running load_fry, the adata only have 9k genes? Does the data itself does only have 9k genes or the process upstream goes wrong?

Thanks!
Best

alevin-fry doesn't compile with rand 0.8.3.

Hi I was trying to build alevin-fry and encountered this error.

   Compiling alevin-fry v0.1.0 (/home/diane/proj/delaney/alevin-fry)
error[E0061]: this function takes 1 argument but 2 arguments were supplied
  --> src/main.rs:33:27
   |
33 |             let idx = rng.gen_range(0, CHARSET.len());
   |                           ^^^^^^^^^ -  ------------- supplied 2 arguments
   |                           |
   |                           expected 1 argument

error: aborting due to previous error

For more information about this error, try `rustc --explain E0061`.
error: could not compile `alevin-fry`.

I looked through the documentation for rng.gen_range and it seems to now want a "range" low..high.
https://docs.rs/rand/0.8.3/rand/trait.Rng.html

alevin-fry quant fails bc of mismatch in number of tx present vs number of tx in tg-map

Hi, I was testing out alevin-fry and got up to the quant step, and got this error message

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `140992`,
 right: `141131`: The tg-map must contain a gene mapping for all transcripts in the header', libradicl/src/quant.rs:467:5

I'm using the same file for --tg-map that I used for --tgMap in my salmon alevin command. the tx in this file also match up to the fasta I used for my salmon index command. I commented out the assertion causing the failure, rebuilt, and it ran okay. If I had to guess, the few transcripts that are removed for various reasons(too short?) during salmon index will be in my tg-map but not in the downstream data. Maybe change the assertion to fail only if there transcripts in the rad file that are not in the tg-map file

processing trimmed or un-trimmed reads with alevin-fry

Hello,

I would like to know if trimming the reads (removing the adaptor and low quality bases/reads) previous to mapping with alevin-fry makes a difference or not, since I wasn't able to find any mention of this in the tutorials. Does alevin-fry run some sort of trimming implemented in the mapping algorithm previous to aligning the reads or do you consider that it is not necessary?

Thanks in advance,

Aintzane

How to make alevin-fry --expect cells to be better at guessing the number of cells?

Discussed in #101

^{Originally posted by ljudevitluka January 31, 2023}
Hi all,
I am working on the scRNAseq dataset generated with the 10x Genomics platform. We aimed to collect 10, 000 cells per experiment, and we mapped the reads to the transcriptome (since no genomic reference is available at the moment). Everything seems to be working more or less ok with alevin-fry, but in my samples, I always get the overestimation of the cell number (as per the graph bellow). I am aware we could use --force-cells, but would be more comfortable if the knee is detected by the --expect-cells flag. I tried calling --expect-cells 5000, 8000, and 10000 cells but the results were always overestimated.
My question is: Is there a way to make --expect-cells more sensitive? If I use --force-cells how will this affect my downstream analysis if some of the samples are carrying noise? If there is no automated way, is it better to take a bit of noise and hope it will get filtered out based on the further cell QC (mitoDNA genes content, ribosomal content etc.) or to be on the secure side and take a bit less cell with a risk of loss of the cells that have a lower number of genes expressed?
Thank you very much for your help and opinions :) Hopefully, this question is not out of the scope of discussion.
All the best, Luka

Documentation "quick-start"

Hello

I have some questions regarding the documentation in https://github.com/COMBINE-lab/alevin-fry#a-quick-start-run-through-on-sample-data

I read and worked through the quick-start and pull the latest singularity image, singularity pull docker://combinelab/usefulaf:latest

There some discrepancies between that version and the documentation:

"Building the splici reference and index": there is no parameter -l 91, but there is a parameter -r.
"Quantifying the sample": error: Invalid value 'u' for '--forced-cells <FORCED_CELLS>': invalid digit found in string

So, I am not sure whether the documentation fits to the latest (docker/singularity) version?

Thank you and best regards,
R.

PS: and when I run with it, for example, with option --knee I get an unknown error

singularity exec --cleanenv --bind $AF_SAMPLE_DIR:/workdir --pwd /usefulaf/bash usefulaf_latest.sif simpleaf quant -1 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz -2 /workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,/workdir/data/pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz -i /workdir/human_CR_3.0_splici/index -o /workdir/quants/pbmc1k_v3 --knee -c v3 -r cr-like -m /workdir/human_CR_3.0_splici/ref/transcriptome_splici_fl86_t2g_3col.tsv -t 8
2023-02-02T12:28:23.288555Z  INFO simpleaf: deserializing from File { fd: 3, path: "/afhome/simpleaf_info.json", read: true, write: false }
2023-02-02T12:28:23.288601Z  INFO simpleaf: prog info = ReqProgs { salmon: Some(ProgInfo { exe_path: "/opt/conda/bin/salmon", version: "1.9.0" }), alevin_fry: Some(ProgInfo { exe_path: "/opt/conda/bin/alevin-fry", version: "0.8.1" }), pyroe: Some(ProgInfo { exe_path: "/opt/conda/bin/pyroe", version: "0.7.1" }) }
Error: custom geometry string doesn't contain ';' character

'main' panicked , could not open input rad file

Hey all, following this tutorial , https://combine-lab.github.io/alevin-fry-tutorials/2021/alevin-fry-velocity/ after someone at BioC Conference recommended it as a good way to do pre-processing for velocity analysis. While salmon is my goto tool for BulkSeq , I must admit I have very little experience with sister alevin for single cell as most of our data is 10X so have stayed in the STAR/Cellranger ecosystem.

everything works in the tutorial until this line:

alevin-fry generate-permit-list -d fw -k -i sample_v2_map -o sample_v2_quant

and then I get this error:

thread 'main' panicked at 'could not open input rad file: Os { code: 2, kind: NotFound, message: "No such file or directory" }', libradicl/src/cellfilter.rs:194:52

quick google search didnt find any hits , hoping it was a simple configuration issue.

alevin-fry 0.1.0
Python 3.8.5
Ubuntu 20.04 LTS

Feature request : Demultiplexing to process 10x Fixed RNA Profiling kit

The 10x Fixed RNA Profiling kit has some benefits with respect to the standard kit. For example, cells can be stored before library prep, there is a gene targeting mechanism based on specific targeted probes rather than standard polyA capture. Finally, and relevant to this feature request, samples can be efficiently multiplexed with the use of an additional barcode the pCS1 probe described here.

One could, of course, demultiplex the initial reads and process each pCS1 barcode separately with alevin -> alevin-fry, which is what the current recommendation would be. However, it would be much more efficient (and easier for the user) if this could be handled internally for samples prepared with this chemistry.

The purpose of this issue is to discuss how best to handle this feature. Currently, there are 3 different ideas (but I'm open to more).

Have salmon "handle" this, such that when this chemistry is processed, many different "quant" directories are created, one for each pCS1 barcode. The benefit here is that each barcode can then be handled separately without any modifications required to alevin-fry. The down sides are that we will crate many potential output directories, we don't know how many there will be or what their names are a priori, and this will add non-trivial complexity on the salmon side.
Have salmon "handle" this by outputting many distinct RAD files in the output quant directory — one for each pCS1 barcode. This is cleaner than 1 in my opinion, but still not ideal. Specifically, we don't know a priori what barcodes exist in the sample before processing, so we don't know what RAD files will be created, how many there will be, or how they should be named. This gets tricky as RAD files will need to be created dynamically and that's non-trivial from the highly-mutithreaded context that exists during read mapping.
So far, my favorite proposed solution. When processing this type of chemistry, have salmon output an augmented RAD file. In addition to the UMI and barcode, this RAD file will include a field for the pCS1 barcode of each processed read. However, all records will be routed to the same initial RAD file. Subsequently, we will add a command to alevin-fry (e.g. alevin-fry demux that will be capable of demultiplexing this into many distinct "standard" RAD files, which can then be processed as normal). This has the complication that we need to add the ability to parse and demultiplex the augmented RAD format, and we need to determine how this should affect the downstream steps (e.g. it may no longer be true that for such a sample, after demultiplexing, the quant directory contains a single map.rad file). However, this solution currently seems the cleanest to me, and I feel like it is generally a good principle to push such processing off to alevin-fry rather than require it upstream in salmon where possible.

Anyway, thanks to @ATpoint for raising this request. I'm also pinging @DongzeHE and @k3yavi for their input here.

Output summary QC statistics

I am working on a MultiQC module for alevinfry and noticed there is no QC statistics file similar to alevin_meta_info.json in the output. Would it be possible to add such a file so that I can generate a nice report from it?

Difference in Agrp expression depending on alignment strategy

Hello Alevin-Fry team, and thank you for you cool tool!

I have a question about Alevin-Fry, which I use in my daily work.

There is a discrepancy between the expression of some genes depending on if you run Alevin-Fry in regular pseudoalignment mode or in selective alignment mode.

We saw this, most notably, with the gene Agrp in snRNAseq mouse brain tissue.

Agrp cells are usually clustering together [in UMAP space], but we noticed that Agrp seemed quite dispersed when running selective alignment, compared to regular pseudoalignment on the exact same data:

Selective alignment:

Pseudoalignment:

I have not been able to make sense of where this deficit in Agrp expression would arise from reading your docs/tutorials, so I really hope that you can help shed some light on this for/with me.

Selective alignment was run according to https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/ and pseudoalignment was run according to https://combine-lab.github.io/alevin-fry-tutorials/2021/running-alevin-fry-fast/

Both tutorials using cellranger's refdata-gex-mm10-2020-A genome.

The only changes to the above tutorials are that both pipestances used unfiltered quant with 10x's whitelist, library type was changed to ISR and read length was corrected to 90 in the make_splici_txome part of selective alignment, to reflect our sequencing output.

I hope that I was able to explain this, and I look forward to hear your take on it. :)

alevin-fry quant --resolution cr-like-em hangs in rare cases

We have started running many samples through the alevin-fry pipeline, and were starting to run more tests with the cr-like-em resolution strategy, and encountered the following error in one case (so far!):

thread '<unnamed>' panicked at 'Alpha Sum too small', libradicl/src/em.rs:218:5
stack backtrace:
   0: std::panicking::begin_panic
   1: libradicl::em::em_optimize_subset
   2: libradicl::quant::do_quantify::{{closure}}

This was followed by alevin-fry appearing to hang, rather than exiting with an error. (CPU usage remained high for at least one thread)

The command used (with directories redacted) was:

alevin-fry quant --input-dir {in} --tg-map index.tx2gene_3col.tsv --resolution cr-like-em  -o {out} --use-mtx -t 2

I tried again with only a single thread and got the same error. Increasing the number of threads resulted in more chunks being processed (and continuing after the panic), but the process still hung before completion.

--resolution cr-like worked as expected.

This error occurred using alevin-fry version 0.4.1, both running directly on MacOS 11.6 and using the BioContainers Docker image (on an Ubuntu 20.04 system).

If there are other tests that I can help with, I would be happy to do so.

An example full command output with a full backtrace is included below:

2021-10-12 14:28:03 INFO quantifying from uncompressed, collated RAD file File { fd: 4, path: "[redacted]/map.collated.rad", read: true, write: false }
2021-10-12 14:28:03 INFO paired : false, ref_count : 625,428, num_chunks : 94,306
2021-10-12 14:28:03 INFO tg-map contained 60,319 genes mapping to 625,428 transcripts.
2021-10-12 14:28:03 INFO read 2 file-level tags
2021-10-12 14:28:03 INFO read 2 read-level tags
2021-10-12 14:28:03 INFO read 1 alignemnt-level tags
2021-10-12 14:28:03 INFO File-level tag values FileTags { bclen: 16, umilen: 12 }
⠤ [00:00:26] [╢╢╢╢╢╢░╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟╟]   14462/94306
thread '<unnamed>' panicked at 'Alpha Sum too small', libradicl/src/em.rs:218:5
stack backtrace:
   0:        0x100f38933 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b5b7de1c79b4b45
   1:        0x100de2f8e - core::fmt::write::h65ec8c8a6aa7b549
   2:        0x100f3637a - std::io::Write::write_fmt::hc52e560ab9e57db4
   3:        0x100f3caff - std::panicking::default_hook::{{closure}}::h0974624c30a6e2b5
   4:        0x100f3d360 - std::panicking::rust_panic_with_hook::h1cdaa4b48d8ea4a1
   5:        0x100e1c32d - std::panicking::begin_panic::{{closure}}::h5fb64ee7ea289a2a
   6:        0x100e199c8 - std::sys_common::backtrace::__rust_end_short_backtrace::ha5d99a8c5039d801
   7:        0x100fd357d - std::panicking::begin_panic::hd38a6e223ccbb64a
   8:        0x100e57ccb - libradicl::em::em_optimize_subset::h536b75956bd94226
   9:        0x100e22b7b - libradicl::quant::do_quantify::{{closure}}::h91f63fc336437966
  10:        0x100e199fb - std::sys_common::backtrace::__rust_begin_short_backtrace::h266e2e2ef561fccd
  11:        0x100e1c728 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h2e45eb7f25374de8
  12:        0x100f409bd - std::sys::unix::thread::Thread::new::thread_start::h59fdfe326de12d08
  13:     0x7fff204f78fc - __pthread_start

cr-like-em causes thread panick (Alpha Sum too small) w/ high duplicates

summary:
cr-like-em appears to hang alevin-fry quant if the FASTQ being processed has high duplicates (~>71%?) e.g.,:

data:
https://www.10xgenomics.com/resources/datasets/4-k-pbm-cs-from-a-healthy-donor-2-standard-2-1-0

cmd:

singularity exec --cleanenv --bind $PWD:/workdir --pwd /usefulaf/bash usefulaf_latest.sif ./simpleaf quant \
-1 /data/pbmc4k_S1_L001_R1_001.fastq.gz -2 /data/pbmc4k_S1_L001_R2_001.fastq.gz \
-i /workdir/human_CR_3.0_splici/index \
-m /workdir/human_CR_3.0_splici/ref/transcriptome_splici_fl93_t2g_3col.tsv \
-o /workdir/quants/test \
-f u -c v2 -r cr-like-em -t 16

error:

thread '<unnamed>' panicked at 'Alpha Sum too small', libradicl/src/em.rs:218:5
stack backtrace:
   0:     0x55d630004cc2 - std::backtrace_rs::backtrace::libunwind::trace::ha5edb8ba5c6b7a6c
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/../../backtrace/src/backtrace/libunwind.rs:90:5
   1:     0x55d630004cc2 - std::backtrace_rs::backtrace::trace_unsynchronized::h0de86d320a827db2
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x55d630004cc2 - std::sys_common::backtrace::_print_fmt::h97b9ad6f0a1380ff
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x55d630004cc2 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h14be7eb08f97fe80
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x55d62feb321f - core::fmt::write::h2ca8877d3e0e52de
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/fmt/mod.rs:1094:17
   5:     0x55d6300028e5 - std::io::Write::write_fmt::h64f5987220b618f4
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/io/mod.rs:1584:15
   6:     0x55d6300068fb - std::sys_common::backtrace::_print::h7f1a4097308f2e0a
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x55d6300068fb - std::sys_common::backtrace::print::h1f799fc2ca7f5035
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x55d6300068fb - std::panicking::default_hook::{{closure}}::hf38436e8a3ce1071
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:208:50
   9:     0x55d63000717f - std::panicking::default_hook::he2f8f3fae11ed1dd
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:225:9
  10:     0x55d63000717f - std::panicking::rust_panic_with_hook::h79a18548bd90c7d4
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:591:17
  11:     0x55d62feec754 - std::panicking::begin_panic::{{closure}}::hd1fbb432e5b7fa61
  12:     0x55d62fee95dc - std::sys_common::backtrace::__rust_end_short_backtrace::h73000f4d2864a3cf
  13:     0x55d62fe21f6a - std::panicking::begin_panic::h89d0de012afc9e31
  14:     0x55d62ff29cb7 - libradicl::em::em_optimize_subset::h8d61b85ad05da835
  15:     0x55d62fef6849 - libradicl::quant::do_quantify::{{closure}}::h16178b4066ec1cd8
  16:     0x55d62fee9c14 - std::sys_common::backtrace::__rust_begin_short_backtrace::h6f9c3711ffe16ea1
  17:     0x55d62feecbbc - core::ops::function::FnOnce::call_once{{vtable.shim}}::h15826baaec173458
  18:     0x55d63000a79a - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h75c2ca1daad47228
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/alloc/src/boxed.rs:1546:9
  19:     0x55d63000a79a - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hdf9f8afc9d34e311
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/alloc/src/boxed.rs:1546:9
  20:     0x55d63000a79a - std::sys::unix::thread::Thread::new::thread_start::hc238bac7748b195d
                               at /rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/sys/unix/thread.rs:71:17
  21:     0x7fba13604609 - start_thread
  22:     0x7fba133d6293 - clone

work around:
reducing duplicates (e.g., splitting FASTQ) allows the command to run to completion.

failed to run custom build command for `libz-sys v1.1.5`

Hi,
I am quite new to using cargo, so not sure what I am doing wrong. When installing alevin-fry using
cargo install alevin-fry
I got the error message below. Any idea what is going wrong? For your information, I have to install cargo and rust at a non-default location by setting the respective environment variables RUSTUP_HOME and CARGO_HOME. The CMAKE version I am running is 3.11.1
Thanks in advance!
Pieter

--------- ERROR MESSAGE -----------

Compiling csv v1.1.6
error: failed to run custom build command for libz-sys v1.1.5

Caused by:
process didn't exit successfully: /tmp/cargo-installvK6HDe/release/build/libz-sys-2215a56e4c018d2c/build-script-build (exit status: 101)
--- stdout
cargo:rerun-if-env-changed=LIBZ_SYS_STATIC
cargo:rerun-if-changed=build.rs
CMAKE_TOOLCHAIN_FILE_x86_64-unknown-linux-gnu = None
CMAKE_TOOLCHAIN_FILE_x86_64_unknown_linux_gnu = None
HOST_CMAKE_TOOLCHAIN_FILE = None
CMAKE_TOOLCHAIN_FILE = None
CMAKE_GENERATOR_x86_64-unknown-linux-gnu = None
CMAKE_GENERATOR_x86_64_unknown_linux_gnu = None
HOST_CMAKE_GENERATOR = None
CMAKE_GENERATOR = None
CMAKE_PREFIX_PATH_x86_64-unknown-linux-gnu = None
CMAKE_PREFIX_PATH_x86_64_unknown_linux_gnu = None
HOST_CMAKE_PREFIX_PATH = None
CMAKE_PREFIX_PATH = Some("/apps/antwerpen/broadwell/centos7/Python/3.8.3-intel-2020a:/apps/antwerpen/broadwell/centos7/Tk/8.6.10-intel-2020a:/apps/antwerpen/broadwell/centos7/X11/2020a-GCCcore-9.3.0:/apps/antwerpen/broadwell/centos7/baselibs/2020a-GCCcore-9.3.0")
CMAKE_x86_64-unknown-linux-gnu = None
CMAKE_x86_64_unknown_linux_gnu = None
HOST_CMAKE = None
CMAKE = None
running: "cmake" "/data/antwerpen/205/vsc20587/software/cargo/registry/src/github.com-1ecc6299db9ec823/libz-sys-1.1.5/src/zlib-ng" "-DBUILD_SHARED_LIBS=OFF" "-DZLIB_COMPAT=ON" "-DWITH_GZFILEOP=ON" "-DCMAKE_INSTALL_PREFIX=/tmp/cargo-installvK6HDe/release/build/libz-sys-b0b3954e73377225/out" "-DCMAKE_C_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_C_COMPILER=/apps/antwerpen/broadwell/centos7/GCCcore/9.3.0/bin/cc" "-DCMAKE_CXX_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_CXX_COMPILER=/apps/antwerpen/broadwell/centos7/GCCcore/9.3.0/bin/c++" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC -m64" "-DCMAKE_ASM_COMPILER=/apps/antwerpen/broadwell/centos7/GCCcore/9.3.0/bin/cc" "-DCMAKE_BUILD_TYPE=Release"
-- Using CMake version 3.11.1
-- ZLIB_HEADER_VERSION: 1.2.11
-- ZLIBNG_HEADER_VERSION: 2.1.0.devel
-- The C compiler identification is GNU 9.3.0
-- Check for working C compiler: /apps/antwerpen/broadwell/centos7/GCCcore/9.3.0/bin/cc
-- Check for working C compiler: /apps/antwerpen/broadwell/centos7/GCCcore/9.3.0/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Arch detected: 'x86_64'
-- Basearch of 'x86_64' has been detected as: 'x86'
-- Performing Test FNO_LTO_AVAILABLE
-- Performing Test FNO_LTO_AVAILABLE - Success
-- Architecture supports unaligned reads
-- Architecture supports unaligned reads of > 4 bytes
-- Looking for sys/sdt.h
-- Looking for sys/sdt.h - found
-- Looking for unistd.h
-- Looking for unistd.h - found
-- Looking for sys/types.h
-- Looking for sys/types.h - found
-- Looking for stdint.h
-- Looking for stdint.h - found
-- Looking for stddef.h
-- Looking for stddef.h - found
-- Check size of off64_t
-- Check size of off64_t - done
-- Looking for fseeko
-- Looking for fseeko - found
-- Looking for strerror
-- Looking for strerror - found
-- Looking for posix_memalign
-- Looking for posix_memalign - found
-- Performing Test HAVE_NO_INTERPOSITION
-- Performing Test HAVE_NO_INTERPOSITION - Success
-- Performing Test HAVE_ATTRIBUTE_VISIBILITY_HIDDEN
-- Performing Test HAVE_ATTRIBUTE_VISIBILITY_HIDDEN - Success
-- Performing Test HAVE_ATTRIBUTE_VISIBILITY_INTERNAL
-- Performing Test HAVE_ATTRIBUTE_VISIBILITY_INTERNAL - Success
-- Performing Test HAVE_BUILTIN_CTZ
-- Performing Test HAVE_BUILTIN_CTZ - Success
-- Performing Test HAVE_BUILTIN_CTZLL
-- Performing Test HAVE_BUILTIN_CTZLL - Success
-- Performing Test HAVE_PTRDIFF_T
-- Performing Test HAVE_PTRDIFF_T - Success
-- Performing Test HAVE_AVX2_INTRIN
-- Performing Test HAVE_AVX2_INTRIN - Success
-- Performing Test HAVE_AVX512_INTRIN
-- Performing Test HAVE_AVX512_INTRIN - Success
-- Performing Test HAVE_MASK_INTRIN
-- Performing Test HAVE_MASK_INTRIN - Success
-- Performing Test HAVE_AVX512VNNI_INTRIN
-- Performing Test HAVE_AVX512VNNI_INTRIN - Success
-- Performing Test HAVE_SSE41_INTRIN
-- Performing Test HAVE_SSE41_INTRIN - Success
-- Performing Test HAVE_SSE42CRC_INLINE_ASM
-- Performing Test HAVE_SSE42CRC_INLINE_ASM - Success
-- Performing Test HAVE_SSE42CRC_INTRIN
-- Performing Test HAVE_SSE42CRC_INTRIN - Success
-- Performing Test HAVE_SSE42CMPSTR_INTRIN
-- Performing Test HAVE_SSE42CMPSTR_INTRIN - Success
-- Performing Test HAVE_SSE2_INTRIN
-- Performing Test HAVE_SSE2_INTRIN - Success
-- Performing Test HAVE_SSSE3_INTRIN
-- Performing Test HAVE_SSSE3_INTRIN - Success
-- Performing Test HAVE_PCLMULQDQ_INTRIN
-- Performing Test HAVE_PCLMULQDQ_INTRIN - Success
-- Performing Test HAVE_VPCLMULQDQ_INTRIN
-- Performing Test HAVE_VPCLMULQDQ_INTRIN - Failed
-- Architecture-specific source files: arch/x86/x86_features.c;arch/x86/slide_hash_avx2.c;arch/x86/chunkset_avx.c;arch/x86/compare256_avx2.c;arch/x86/adler32_avx2.c;arch/x86/adler32_avx512.c;arch/x86/adler32_avx512_vnni.c;arch/x86/adler32_sse41.c;arch/x86/insert_string_sse42.c;arch/x86/chunkset_sse2.c;arch/x86/compare256_sse2.c;arch/x86/slide_hash_sse2.c;arch/x86/adler32_ssse3.c;arch/x86/crc32_fold_pclmulqdq.c
-- The following features have been enabled:

CMAKE_BUILD_TYPE, Build type: Release (selected)
AVX2_SLIDEHASH, Support AVX2 optimized slide_hash, using "-mavx2"
AVX_CHUNKSET, Support AVX optimized chunkset, using "-mavx2"
AVX2_COMPARE256, Support AVX2 optimized compare256, using "-mavx2"
AVX2_ADLER32, Support AVX2-accelerated adler32, using "-mavx2"
AVX512_ADLER32, Support AVX512-accelerated adler32, using "-mavx512f -mavx512dq -mavx512bw -mavx512vl -mtune=cascadelake"
AVX512VNNI_ADLER32, Support AVX512VNNI adler32, using "-mavx512f -mavx512dq -mavx512bw -mavx512vl -mavx512vnni -mtune=cascadelake"
SSE4_ADLER32, Support SSE41-accelerated adler32, using "-msse4.1"
SSE42_CRC, Support SSE4.2 optimized CRC hash generation, using "-msse4.2"
SSSE3_ADLER32, Support SSSE3-accelerated adler32, using "-mssse3"
PCLMUL_CRC, Support CRC hash generation using PCLMULQDQ, using "-mssse3 -msse4.2 -mpclmul"
WITH_GZFILEOP, Compile with support for gzFile related functions
ZLIB_COMPAT, Compile with zlib compatible API
ZLIB_ENABLE_TESTS, Build test binaries
WITH_SANITIZER, Enable sanitizer support
WITH_OPTIM, Build with optimisation
WITH_NEW_STRATEGIES, Use new strategies
WITH_UNALIGNED, Support unaligned reads on platforms that support it
WITH_UNALIGNED64, Support unaligned 64-bit reads on platforms that support it
WITH_AVX2, Build with AVX2
WITH_AVX512, Build with AVX512
WITH_AVX512VNNI, Build with AVX512 VNNI
WITH_SSE2, Build with SSE2
WITH_SSSE3, Build with SSSE3
WITH_SSE41, Build with SSE41
WITH_SSE42, Build with SSE42
WITH_PCLMULQDQ, Build with PCLMULQDQ

-- The following features have been disabled:

ZLIB_SYMBOL_PREFIX, Publicly exported symbols DO NOT have a custom prefix
ZLIB_DUAL_LINK, Dual link tests against system zlib
WITH_FUZZERS, Build test/fuzz
WITH_BENCHMARKS, Build test/benchmarks
WITH_NATIVE_INSTRUCTIONS, Instruct the compiler to use the full instruction set on this host (gcc/clang -march=native)
WITH_MAINTAINER_WARNINGS, Build with project maintainer warnings
WITH_CODE_COVERAGE, Enable code coverage reporting
WITH_INFLATE_STRICT, Build with strict inflate distance checking
WITH_INFLATE_ALLOW_INVALID_DIST, Build with zero fill for inflate invalid distances
WITH_VPCLMULQDQ, Build with VPCLMULQDQ
INSTALL_UTILS, Copy minigzip and minideflate during install

-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/cargo-installvK6HDe/release/build/libz-sys-b0b3954e73377225/out/build
running: "cmake" "--build" "." "--target" "install" "--config" "Release" "--parallel" "56"

--- stderr
CMake Warning:
Manually-specified variables were not used by the project:

  CMAKE_ASM_COMPILER
  CMAKE_ASM_FLAGS
  CMAKE_CXX_COMPILER
  CMAKE_CXX_FLAGS

Unknown argument --parallel
Unknown argument 56
Usage: cmake --build

[options] [-- [native-options]]
Options:
= Project binary directory to be built.
--target = Build instead of default targets.
May only be specified once.
--config = For multi-configuration tools, choose .
--clean-first = Build target 'clean' first, then build.
(To clean only, use --target 'clean'.)
--use-stderr = Ignored. Behavior is default in CMake >= 3.0.
-- = Pass remaining options to the native tool.
thread 'main' panicked at '
command did not execute successfully, got: exit status: 1

build script failed, must exit now', /user/antwerpen/205/vsc20587/data/software/cargo/registry/src/github.com-1ecc6299db9ec823/cmake-0.1.48/src/lib.rs:975:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
warning: build failed, waiting for other jobs to finish...
error: failed to compile alevin-fry v0.5.0, intermediate artifacts can be found at /tmp/cargo-installvK6HDe

Caused by:
build failed

Error in install.packages("BiocManager") : unable to install packages

Hi,
I was trying run this on a Ubuntu server. Both directly downloaded image or one converted from docker gave the same error:

$ singularity exec --cleanenv --bind $AF_SAMPLE_DIR:/workdir --pwd /usefulaf/bash usefulaf_latest.sif ./simpleaf index -f /workdir/human_CR_3.0/fasta/genome.fa -g /workdir/human_CR_3.0/genes/genes.gtf -l 91 -t 16 -o /workdir/human_CR_3.0_splici
ALEVIN_FRY_HOME=/workdir/.afhome
salmon 1.8.0
salmon version 1.8.0 is sufficiently new.
alevin-fry 0.6.0
alevin-fry version 0.6.0 is sufficiently new.
/opt/conda/bin/time command appears to execute a valid GNU time

Extracting the splici reference using command

 Rscript build_splici_ref.R /workdir/human_CR_3.0/fasta/genome.fa /workdir/human_CR_3.0/genes/genes.gtf 91 /workdir/human_CR_3.0_splici/ref/    --filename-prefix splici

Warning in install.packages("BiocManager") :
  'lib = "/opt/conda/lib/R/library"' is not writable
Error in install.packages("BiocManager") : unable to install packages
Execution halted

Feature request - Rhapsody whole genome sequences

Hi, with the help of Rob I managed to get a Rust BD Rhapsody analysis tool to work (https://github.com/stela2502/Rustody).
It supports all versions of BD's beads at the moment, but not the (new) whole genome part.

Rhapsody data consists of a variety of data: Antibody tags that are coded in R2 as well as Sample Tags coded there, too. In a new version they also have a TCR and BCR sequencing options. I have coded the targeted sequence matches using 32bp kmers from the provided fasta files ( only ~ 500 fasta sequences). But I am not very interested to blow this up to whole genome analysis.
Hence my question: Is there any way you could implement the BD cell ids in alevin fry? Could be as simple as to use my cellids class. Or could I use your library to map all sequences my tool can not handle?
The second option would of cause be my favorite. Like only use your genome representation and the mapper. It would be absolutely epic if you could help me here!

Thank you!

Generate splici; error in R script

Heya,

I've been trying to generate the splici index using the R function given in your tutorial (https://combine-lab.github.io/alevin-fry-tutorials/2021/improving-txome-specificity/)

Unfortunately, after displaying "extracting introns using the separate approach", I receive a given error of "Error in (function (x : attempt to apply a non-function"

It then proceeds to writing outs, but I want to make sure everything is alright after this. Do you have any guidance?

Thanks a million

Ollie

cargo failed to parse manifest at Cargo.toml

I try to install alevin-fry version 0.6.0 with Rust 1.52.1 but the installation fails:

$ cargo install --path .
error: failed to parse manifest at `.../Cargo.toml`

Caused by:
  feature `edition2021` is required

  consider adding `cargo-features = ["edition2021"]` to the manifest

It works if I add cargo-features = ["edition2021"] at the top of the Cargo.toml file.

Refresh documentation for alevin-fry ecosystem

I've been using alevin-fry and friends to do my first ever mapping of FASTQ files, and I'm really appreciating the speed of the package, and the functionality for handling spliced + unspliced reads very naturally.

However I'm finding myself frustrated by the documentation. Some observations / questions / frustrations:

The tutorials for the alevin-fry package do not refer to various useful additional packages. The only references to alevinQC and usefulaf in the alevin-fry codebase are in the README. (I only discovered them just now, so having followed the tutorial I think I have to rerun my pipeline adding --dumpFeatures to the salmon alevin step.)
The tutorials are currently in the form of blog posts, which makes it hard for users to know which is most relevant to them.
I think that some of the tutorials are not up-to-date (e.g. referring to a roll-your-own loadFry function, instead of the one implemented in fishpond and / or usefulaf).
I would find it useful to have something like a landing page, with a diagram showing all the relevant packages, their functionalities and how they interact.
One absolutely essential task is loading the counts matrices into either sce or anndata objects, and this currently feels suboptimal. Both usefulaf and fishpond have quite a bit of other functionality, and usefulaf has a cool but non-obvious name. I wonder if something like an AlevinToSCE package which includes some of the AlevinQC functionality could be an option? [sce import package] could include: create an sce object from expected outputs; perhaps allow the sce creation function to have different options, e.g. knee / all / min bcs (I don't know how that would fit with the rest of the pipeline); allow me to view knee plots (I know that alevin-fry has the -knee option, but I will almost always want to check this myself); extract the profile of the empty droplets.

I'm a pretty experienced user of R packages for single cell analysis, trying out mapping for the first time with alevin-fry. It could well be that I've misunderstood something here, or missed some functionality, in which consider that as feedback about documentation :) I've been able to find most things, but I think finding your way around could definitely be made easier.

I appreciate that you have far too many things to do, but if / when you get around to tweaking documentation, please take these points into consideration!

Thanks for all your hard work
Will

Avoid any barcode filtering

Hello,

Is there an option to run alevin-fry generate-permit-list without any barcode filtering?
(Like the --keepCBFraction 1 for alevin).
We would like to run Alevin-fry with running EmptyDrops() afterwards.

As I understand the two options are:

Use --unfiltered-pl whitelist with --min-reads 0
set --force-cells really high

Thank you!

Best,
Nadja

singularity running error

Dear all,

I have install alevin-fry and singularity by conda.

conda install alevin-fry
conda install -c conda-forge singularity
wget -v -O usefulaf.sif https://umd.box.com/shared/static/bcd8io9fbjc321pfgcomues5oe2a12cz

However, when I try to run simpleaf, there is still some errors like:

singularity exec --cleanenv --bind  /gpfs03/home/jingjing/project/test/alevin-fry/output/ usefulaf.sif ./simpleaf index -f \ human_CR_3.0/fasta/genome.fa -g human_CR_3.0/genes/genes.gtf -l 91 -t 40 -o output/human_CR_3.0_splici

ERROR  : Failed invoking the NEWUSER namespace runtime: Invalid argument
ABORT  : Retval = 255

I am newer to singularity, can anyone give me some suggestions for this error.

Thanks a lot!

Jingjing

Error when running `./simpleaf index`: dependency ‘fishpond’ is not available

Dear Authors,

Thank you for providing this wonderful tool.

I'm have an issue when running ./simpleaf index, which complains that the dependency ‘fishpond’ is not available. It looks like the fishpond package was left out the /usefulaf/R/build_splici_ref.R script.

Could you please help look into this?

Thanks very much!

Command error:                                                               
  Downloading GitHub repo COMBINE-lab/roe@HEAD
  Skipping 11 packages not available: SingleCellExperiment, fishpond, S4Vectors, GenomicRanges, GenomicFeatures, GenomeInfoDb, eisaR, BSgenome, Biostring
s, BiocGenerics, Biobase
  ERROR: dependency ‘fishpond’ is not available for package ‘roe’
  * removing ‘/opt/conda/lib/R/library/roe’
  Warning message:
  In i.p(...) :
    installation of package ‘/tmp/RtmpaVg9Dn/file1d18af58bf/roe_0.2.1.tar.gz’ had non-zero exit status
  Error in library(roe) : there is no package called ‘roe’
  Calls: suppressPackageStartupMessages -> withCallingHandlers -> library
  Execution halted

Installation via cargo failing, issue with mimalloc/stdatomic

This very well could be environment specific, but installation of alevin-fry via Cargo on our compute node is failing with the following errors/warnings:

$ cargo install alevin-fry

The following warnings were emitted during compilation:

warning: In file included from c_src/mimalloc/include/mimalloc-types.h:13:0,
warning:                  from c_src/mimalloc/include/mimalloc-internal.h:11,
warning:                  from c_src/mimalloc/src/static.c:17:
warning: c_src/mimalloc/include/mimalloc-atomic.h:34:23: fatal error: stdatomic.h: No such file or directory
warning:  #include <stdatomic.h>
warning:                        ^
warning: compilation terminated.

error: failed to run custom build command for `libmimalloc-sys v0.1.23`

Caused by:
  process didn't exit successfully: `/tmp/cargo-install1XEavd/release/build/libmimalloc-sys-48dc093a1c32b9e6/build-script-build` (exit status: 1)

error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-m64" "-I" "c_src/mimalloc/include" "-I" "c_src/mimalloc/src" "-Wall" "-Wextra" "-ftls-model=initial-exec" "-DMI_DEBUG=0" "-o" "/tmp/cargo-install1XEavd/release/build/libmimalloc-sys-4cb97a8c24518d43/out/c_src/mimalloc/src/static.o" "-c" "c_src/mimalloc/src/static.c" with args "cc" did not execute successfully (status code exit status: 1).

Happy to reach out to our computing folks if it's something system-side that needs to be addressed!

quant not exiting with correct error code

Hi Rob,

I think here there is an std::process::exit(1); missing. I just had a run with this error but it still exited with error code 0 rather than 1. Version 0.8.0

alevin-fry quant -t 1 -i sample1 -o quant_fry --tg-map splici_t2g_3col.tsv --resolution cr-like --use-mtx && echo "OK" || echo "FAIL"

...returned this error (because I messed up providing the permit files properly) but also the "OK" echo. Can you double-check, I think this should be an exit of 1.

What is the difference between generate-permit-list --expected-ori "either" and "both"?

I noticed that there areeither and both options for the --expected-ori flag when running generate-permit-list. Is there a difference between these two options, or do they both allow for retaining forward and rc mappings?

doc on read geometry

Hi,
I cannot find clear help on read_geometry. There are a couple of examples but they do not seem to work for me and I am struggling to change them.
Specifically I have several questions:

what does the "read_geometry" refers to? I saw that it is usually set to "2[1-end]", but why only read 2, and when does the whole read is not a read?
what is the pattern of inclusion exclusion? 1-10 starts at the first nucleotide but includes the ninth or the tenth too?
How does alevin-fry deals with unexact position, for instance in my case the cell tag can start anywhere between position 85 and 115 because of a variable polyA before.

thanks

combine-lab / alevin-fry Goto Github PK

alevin-fry's Introduction

alevin-fry

Getting started with alevin-fry and dedicated documentation

More information

FAQs

Sister repositories

Installing from bioconda

Installing from crates.io

Building from source

Citing alevin-fry

alevin-fry's People

Contributors

Stargazers

Watchers

Forkers

alevin-fry's Issues

Min. 1st Qu. Median Mean 3rd Qu. Max.

3.00 31.00 61.00 63.12 92.00 154.00

But we are still left wondering why we see the discrepancy between spliced counts (depending on which version of Alevin-fry/Salmon is used).

Discussed in #101

Recommend Projects

Recommend Topics

Recommend Org

Getting started with `alevin-fry` and dedicated documentation