broadinstitute / cellbender Goto Github PK

CellBender is a software package for eliminating technical artifacts from high-throughput single-cell RNA sequencing (scRNA-seq) data.

Home Page: https://cellbender.rtfd.io

License: BSD 3-Clause "New" or "Revised" License

Dockerfile 0.21% Python 96.68% WDL 2.19% Shell 0.43% Jupyter Notebook 0.49%

cellbender's People

Stargazers

Watchers

Forkers

elhl93 haroon123 elenichri andrewmoorman mariafiruleva zenghua-git jacobkimmel das2000sidd alecw shicheng-guo mxposed jianguozhou3 schengcheng djdibella lch14forever ucdnjj jamestwebber yingya ast87 prete sballereau mahlaranjeet iiserb qindan2008 auesro stan-dale schae211 cory-weller bioinfo-hub kshakir rcannood yakun-pang yfarjoun yuanjingnan topcrusader nfancy george-hall-ucl bradbalderson oakento cameronraysmith abuendia greengilad jpintar jahernayeem aawdeh

cellbender's Issues

Error during inference

I'm getting a value error during the inference step.

Traceback (most recent call last):
  File "/home/chang/miniconda3/envs/solo/bin/cellbender", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "/media/chang/HDD-3/chang/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/cli.py", line 92, in run
    main(args)
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/cli.py", line 185, in main
    run_remove_background(args)
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/data/dataset.py", line 90, in __init__
    gene_blacklist=gene_blacklist)
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/data/dataset.py", line 205, in _trim_dataset_for_analysis
    get_d_priors_from_dataset(self)  # After gene trimming
  File "/media/chang/HDD-3/chang/CellBender/cellbender/remove_background/data/dataset.py", line 1057, in get_d_priors_from_dataset
    empty_counts = int(np.expm1(empty_log_counts).item())
ValueError: can only convert an array of size 1 to a Python scalar

CellBender affecting count matrix data

Hi,

I was comparing the data from CellBender (v2 branch) (CB) and CellRanger (CR) pipeline to understand the quality of cells that are being filtered. I used two single nuclei RNA-seq data.

I performed simple merge analysis to see if I am getting majority of cells in both pipeline and the cells (clusters) which are unique to either approach and how do they look.
However, I observe that CB and CR out don't merge properly. So CB pipeline is affecting raw matric data from CR and introducing some changes in counts ( I didn't specifically check that, but thats how the analysis look)

Cells in approaches after initial filtering:
CB = 8310 (more cells)
CR = 8121

In the above UMAP you can see that same cells don't overlap in the merge analysis (I didn't do any batch correction for this).
Also when I look at clusters in them, a lot of them are unique to either and only few overlaps (reverse of my expectations)

Finally I also checked, gene count, UMI count and MT gene proportion (pct.mt). (red: CB; green: CR)

A lot of the clusters don't have similar feature (because not same cells are assigned to a particular clusters and may be the gene counts are different?)

My question is: is CellBender filtering or affecting gene counts in addition to predicting cell and debris? If it just affecting number of true cells called; then similar cells identified from two approaches (majority of them come from high quality cells) should have similar expression profile and should merge without any batch correction as it is the same sample.

feature_type Key Error

@sjfleming Running current v2 branch gives the following error for me:
Traceback (most recent call last): File "/wynton/home/ye/mschmitz1/utils/miniconda3/envs/cellbender/bin/cellbender", line 11, in <module> load_entry_point('cellbender', 'console_scripts', 'cellbender')() File "/wynton/home/ye/mschmitz1/utils/CellbenderCurrent/CellBender/cellbender/base_cli.py", line 101, in main cli_dict[args.tool].run(args) File "/wynton/home/ye/mschmitz1/utils/CellbenderCurrent/CellBender/cellbender/remove_background/cli.py", line 102, in run main(args) File "/wynton/home/ye/mschmitz1/utils/CellbenderCurrent/CellBender/cellbender/remove_background/cli.py", line 195, in main run_remove_background(args) File "/wynton/home/ye/mschmitz1/utils/CellbenderCurrent/CellBender/cellbender/remove_background/cli.py", line 165, in run_remove_background File "/wynton/home/ye/mschmitz1/utils/CellbenderCurrent/CellBender/cellbender/remove_background/data/dataset.py", line 551, in save_to_output_file feature_types=self.data['feature_type'], KeyError: 'feature_type'

Refactor SingleCellRNACountsDataset

The class SingleCellRNACountsDataset is currently specific to its usage by remove-background. As the suite of CellBender tools develops, this class should be more general, and should be easily called by other tools. Consider refactoring to an abstract class.

remove-background WDL: workflow can fail if some outputs are not created

If there is any error that prevents an output file from being created, for example the output PDF, this is handled gracefully by cellbender. However, it is not handled gracefully by WDL, which only allows "required" outputs. If one of these outputs does not exist, the workflow fails.

Proposed solution:
In the WDL, ensure each file exists before the task ends. This might amount to just a couple of bash

touch filename

commands.

Error writing output files

Hi,

I tried to run CellBender with the following command:
cellbender remove-background \ --input raw_feature_bc_matrix \ --output pool_38-2.h5 \ --expected-cells 5000 \ --total-droplets-included 15000 \ --epochs 300

However, I got the following error in the log file:
Encountered an error writing output to file pool_38-2.h5. Output may be incomplete. Encountered an error writing output to file pool_38-2_filtered.h5. Output may be incomplete.

No .h5 output is being produced, only log file, pdf and csv file with cell barcodes, which is not empty.

I have attached log file and output pdf here.
pool_38-2.log
pool_38-2.pdf

Thanks.

Batch size, CUDA out of memory

Hi,

Great package! I am currently using cellbender V2.1. I ran into an issue, which is caused by too high memory allocation.

[....]
cellbender:remove-background: [epoch 198]  average training loss: 1790.0774
cellbender:remove-background: [epoch 199]  average training loss: 1787.5904
cellbender:remove-background: [epoch 200]  average training loss: 1792.2732
cellbender:remove-background: [epoch 200] average test loss: 1773.5361
cellbender:remove-background: Inference procedure complete.
cellbender:remove-background: 2020-08-06 23:06:51
cellbender:remove-background: Preparing to write outputs to file...
cell counts tensor([ 8096.,  6134.,  1805.,  2324.,  5410.,  5546.,  5092.,  1724.,  5301.,
         1329.,  3143.,  5382.,   618.,  3833.,  6279.,  5066.,  2166.,  7982.,
         7920.,  3160.,  3907., 12285.,  3919.,  7285.,  1576.,  2011.,  1805.,
         5842.,  2688.,  8696.,  7202.,  7752.,  6153.,  4572.,  2058.,  7318.,
         3196.,  3786.,  7375.,  2877.,  2555.,  4179.,  1650.,  1776.,  4262.,
         4624.,  5314.,  5727.,  5470.,   693.,  4088.,  2078.,  1429.,  2127.,
         5265.,   649.,  4733.,  9864., 19365.,  7845.,  5621.,   699.,  3006.,
         3918.,  1308.,  6071.,  5948.,  1816.,  7495.,  3055.,  2016., 11080.,
         1845.,  1077., 14801.,  8278.,  2293.,  1718.,  1436.,  7260.,  1655.,
        13636.,  8505.,  1307.,  2211.,  7010.,  4465.,  1496.,  3346.,  8285.,
         1948.,  1978.,  2007.,  1693., 16839.,  6170.,  4675., 12212.,  1955.,
         1499.], device='cuda:0')
Traceback (most recent call last):
  File "path/to/bin/cellbender", line 33, in <module>
    sys.exit(load_entry_point('cellbender', 'console_scripts', 'cellbender')())
  File "path/to/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "path/to/cellbender/remove_background/cli.py", line 103, in run
    main(args)
  File "path/to/cellbender/remove_background/cli.py", line 196, in main
    run_remove_background(args)
  File "path/to/cellbender/remove_background/cli.py", line 166, in run_remove_background
    save_plots=True)
  File "path/to/cellbender/remove_background/data/dataset.py", line 524, in save_to_output_file
    inferred_count_matrix = self.posterior.mean
  File "path/to/cellbender/remove_background/infer.py", line 56, in mean
    self._get_mean()
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 402, in _get_mean
    alpha_est=map_est['alpha'])
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 1005, in _lambda_binary_search_given_fpr
    alpha_est=alpha_est)
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 809, in _calculate_expected_fpr_given_lambda_mult
    alpha_est=alpha_est)
  File "path/to/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "path/to/cellbender/remove_background/infer.py", line 604, in _true_counts_from_params
    .log_prob(noise_count_tensor)
  File path/to/lib/python3.7/site-packages/torch/distributions/poisson.py", line 63, in log_prob
    return (rate.log() * value) - rate - (value + 1).lgamma()
RuntimeError: CUDA out of memory. Tried to allocate 1016.00 MiB (GPU 0; 3.97 GiB total capacity; 2.48 GiB already allocated; 378.79 MiB free; 2.58 GiB reserved in total by PyTorch)

Do you suggest to change environmental settings, or adjust the batch size? Changing "empty-drop-training-fraction" did not solve the issue. Thanks for your thoughts!

Basic readthedocs

Adding in basic Sphinx to the codebase.

High level of background: misassigned droplets

Dear coders of CellBender,

First of all thank you for providing such useful tool! I actually found out that this can be easily run on a free GPU powered google colab, which is nice for those not having access to one!

While the tool works flawlessly for several of my datasets, some particular 10X runs coming from the same lab shows some issues: When looking at the log pdf, a lot of droplets from the empty droplet plateau are misassigned to cells, whereas I am rather keen to believe that they should be empty.

One particularity of these datasets is that they all share a very high amount of background (for the following example, the plateau is around 2000 UMIs!):

The log at the start of the run is the following:

cellbender remove-background --input drive/My Drive/ML10_raw_feature_bc_matrix.h5 --output ML10_150_output.h5 --cuda --expected-cells 21000 --total-droplets-included 50000 --epochs 150
cellbender:remove-background: 2020-02-12 12:44:41
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file drive/My Drive/ML10_raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 1807
cellbender:remove-background: Prior on counts for cells is 14313
cellbender:remove-background: Excluding barcodes with counts below 1445
cellbender:remove-background: Using 21000 probable cell barcodes, plus an additional 29000 barcodes, and 28217 empty droplets.
cellbender:remove-background: Running inference...
...

Prior on counts in empty droplets seems reasonable to me, or should I choose an higher one?

The output log pdf is as following:

Following the documentation, I decided to run the analysis by increasing the number of z-dims, z-layers and epochs with the following command:

cellbender remove-background --input drive/My Drive/ML10_raw_feature_bc_matrix.h5 --output ML10_highbckgrd_output.h5 --cuda --expected-cells 21000 --total-droplets-included 50000 --epochs 300 --z-dim 200 --z-layers 1000

But this did not improved anything, and actually training shows weird behaviour probably due to the too high parameters:

Am I missing something? a parameter that could influence the misassignment?

Filtered counts from remove-background higher than raw input counts

I am seeing counts filtered by remove-background actually get higher relative to the raw counts that I feed into remove-background. Is this to be expected given the algorithms involved? Is the idea that it is relative counts that matter and so increasing counts is okay and maybe even expected.

The overall gene counts per cell tend to go up by factors around 1.5 to 2.5. The unique gene counts per cell mostly decrease, but even some of these increase.

Droplet Time Machine

TODO list for the Droplet Time Machine prototype:

Support for non-H5 inputs

Greetings,

Am very excited to try this approach, but I can't seem to be able to get our data into it. Our data comes from the InDrops method. I did go through the trouble of passing the data through Seurat/LOOM to generate .h5 files, which unfortunately does not seem compatible with CellBender (see ValueError: blocks must be 2-D error, below).

Is there any chance that you could introduce a more generic/accessible format that could be used as CellBender input? Ultimately, we all start with barcodes and genes. A sparse matrix would be convenient, for example.

Alternatively, if you know of a good way to load inDrops data into CellBender, then that would really make my day!

cellbender:remove-background: Command:
cellbender remove-background --input data.KO_Gene_new.cells.h5ad --output output.h5 --cuda --expected-cells 500 --total-droplets-included 1000 --epochs 100
cellbender:remove-background: 2019-10-22 11:59:42
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data.KO_Gene_new.cells.h5ad
cellbender:remove-background: CellRanger v2 format
Traceback (most recent call last):
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\Scripts\cellbender-script.py", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "c:\users\c\cellbender\cellbender\base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 92, in run
    main(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 185, in main
    run_remove_background(args)
  File "c:\users\c\cellbender\cellbender\remove_background\cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 82, in __init__
    self._load_data()
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 125, in _load_data
    self.data = get_matrix_from_cellranger_h5(self.input_file)
  File "c:\users\c\cellbender\cellbender\remove_background\data\dataset.py", line 874, in get_matrix_from_cellranger_h5
    count_matrix = sp.vstack(csc_list, format='csc')
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\lib\site-packages\scipy\sparse\construct.py", line 499, in vstack
    return bmat([[b] for b in blocks], format=format, dtype=dtype)
  File "C:\Users\c\AppData\Local\Continuum\miniconda3\envs\CellBender\lib\site-packages\scipy\sparse\construct.py", line 548, in bmat
    raise ValueError('blocks must be 2-D')
ValueError: blocks must be 2-D

High percentage of MT genes

Hi,

I am running CellBender on UMI-tagged scRNA data. The problem I am facing is that cells are left with high percentage of MT-genes after running cellbender remove-background. Thus, after filtering the cells (post-cellbender) using the criteria: nFeature_RNA > 200 & nFeature_RNA < 8000 & percent.mt < 20, I ended up with 25 cells from my original cell count of 8034.

Here is what the Violin plot looks like

Below are the commands I used:

cellbender remove-background
--input "${projDir[i]}/outs/raw_feature_bc_matrix"
--output "$plotDir/cellbender_feature_bc_matrix.h5"
--expected-cells ${expectedCellNum[i]}
--total-droplets-included ${totalDroplet[i]}

Expected cells: 5000
Total Droplets included: 110000 - which is the number of barcodes derived from 2nd plunge in UMI counts (see below)
No. of epochs: 150

The log file output looks a little concerning, as it presents as 0 empty droplets in line 11 (see below)

Not sure if it is because I fed too many background RNA into the algorithm or if there is a limit to the number of droplets that can be included. But the sample is definitely not supposed to have this high a percentage of MT genes, as proven by CellRanger analysis done preveiously.

Using output from STARsolo instead of CellRanger: features.tsv should be renamed to genes.tsv

I'm running CellBender using the output from STARsolo (Solo.out folder) and I ran into the cellbender:remove-background: OSError: Unable to open file error. After some digging, I found that the issue was that STARsolo outputs a features.tsv file by default. This clashes with the automatic V2 V3 detection implemented in CellBender. For future reference, folks using STARsolo instead of CellRanger should rename the features.tsv output to genes.tsv. A quick note about this issue could be helpful in the --input description in the docs.

Cheers

Issue installing/running cellbender: No module named 'cellbender.remove_background'

Dear cellbender team!

I installed cellbender without error in a conda environment following the instructions, but when even trying to run any command with "cellbender," I get the following error below. I have tried running this on a server with GPUs. Any help would be greatly appreciated!

Traceback (most recent call last):
  File "/home/chanj3/anaconda3/envs/cellbender/bin/cellbender", line 8, in <module>
    sys.exit(main())
  File "/home/chanj3/anaconda3/envs/cellbender/lib/python3.7/site-packages/cellbender/base_cli.py", line 90, in main
    parser = get_populated_argparser()
  File "/home/chanj3/anaconda3/envs/cellbender/lib/python3.7/site-packages/cellbender/base_cli.py", line 76, in get_populated_argparser
    module_argparse = importlib.import_module('.'.join(module_argparse_str_list))
  File "/home/chanj3/anaconda3/envs/cellbender/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 965, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'cellbender.remove_background'

Cellbender V2 always hits NaN loss, crashes

Hi there, I was super pumped to try out version 2 so I pulled that branch. Unfortunately when I run cellbender remove-background --input ./spliced/ --output s_cellbended_ambient_200_1000_1000e_V2/s_cellbended.h5 --cuda --expected-cells 1598 --total-droplets-included 11598 --epochs 1000 --z-dim 200 --z-layers 1000 --learning-rate .001 --model ambient it always crashes after <100 epochs saying NaN training loss. Any idea why? Thought it might be helpful to report, anything to get the opportunity to get v2 running sooner!

cellbender:remove-background: [epoch 034] average training loss: 1529.0032 /opt/conda/conda-bld/pytorch_1579022060824/work/aten/src/ATen/native/cuda/Distributions.cu:290: lambda [](int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto::operator()(int, float &, float &, float &, float &, const float &, const float &, const float &, const float &)->auto: block: [0,0,0], thread: [96,0,0] Assertion 0 <= p4 && p4 <= 1 failed.
and
/utils/newminiconda3/envs/cellbenderV2/lib/python3.7/site-packages/pyro/infer/traceenum_elbo.py:419: UserWarning: Encountered NaN: loss log_p(c=2000 | full) = -23.890047073364258 log_p(c=2000 | empty) = -34.5873908996582 cell log_sum.mean() is 8.426836013793945 ~cell log_sum.mean() is 5.024422645568848 cell log_nnz.mean() is 7.62549352645874 ~cell log_nnz.mean() is 4.873218536376953 cell cosine_overlap.mean() is 0.8181769251823425 ~cell cosine_overlap.mean() is 0.4179271459579468 x.mean() is 2.4015369490371086e-05 x.std() is 0.0004832973063457757

RuntimeError: CUDA error: device-side assert triggered Trace Shapes: Param Sites: encoder_z$$$linears.0.weight 1000 41640 encoder_z$$$linears.0.bias 1000 encoder_z$$$loc_out.weight 200 1000 encoder_z$$$loc_out.bias 200 encoder_z$$$sig_out.weight 200 1000 encoder_z$$$sig_out.bias 200 encoder_other$$$linears.0.weight 50 41643 encoder_other$$$linears.0.bias 50 encoder_other$$$linears.1.weight 10 50 encoder_other$$$linears.1.bias 10 encoder_other$$$output.weight 4 10 encoder_other$$$output.bias 4 d_cell_scale alpha0_scale d_empty_loc d_empty_scale chi_ambient 41640 Sample Sites: data dist | value 500 | d_empty dist 500 | value 500 | p_passback dist 500 | value 500 | y dist 500 | value 500 |

Make a v0.1 Docker image + update Terra workflow

cannot convert float NaN to integer

To whom it may concern,

I run cellbender on a scRNA-Seq+CRISPR dataset derived from iPSCs.
I used the cellranger output in the folder filtered_feature_bc_matrix as an input for cellbender.
The cellranger output includes both Gene Expression and CRISPR Guide Capture; so, the features.tsv file looks as follows:

...
ENSG00000275063	AC233755.1	Gene Expression
ENSG00000271254	AC240274.1	Gene Expression
ENSG00000277475	AC213203.1	Gene Expression
ENSG00000268674	FAM231C	Gene Expression
TREM2-1	TREM2-1	CRISPR Guide Capture
TREM2-2	TREM2-2	CRISPR Guide Capture
TREM2-3	TREM2-3	CRISPR Guide Capture
NEG_CTRL-1	NEG_CTRL-1	CRISPR Guide Capture
NEG_CTRL-2	NEG_CTRL-2	CRISPR Guide Capture
NEG_CTRL-3	NEG_CTRL-3	CRISPR Guide Capture

I renamed features.tsv as gene.tsv, to maintain the format reported in the documentation:

cellbender doc

I then run the command:

cellbender remove-background \
     --input ./filtered_feature_bc_matrix \
     --output ./erica_ipcs.h5

This led to the output:

(CellBender) MacBook-Pro-4:Miseq_10x_iPSC_082019 daniele$ cat out.run_cellbender 
cellbender:remove-background: Command:
cellbender remove-background --input ./filtered_feature_bc_matrix --output ./erica_ipcs.h5
cellbender:remove-background: 2020-06-25 12:31:58
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from directory ./filtered_feature_bc_matrix
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
/Users/daniele/anaconda3/envs/CellBender/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning: Mean of empty slice.
  out=out, **kwargs)
/Users/daniele/anaconda3/envs/CellBender/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Traceback (most recent call last):
  File "/Users/daniele/anaconda3/envs/CellBender/bin/cellbender", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 92, in run
    main(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 185, in main
    run_remove_background(args)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 90, in __init__
    gene_blacklist=gene_blacklist)
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 205, in _trim_dataset_for_analysis
    get_d_priors_from_dataset(self)  # After gene trimming
  File "/Users/daniele/Desktop/PARTS/Test_Software/CellBender/cellbender/remove_background/data/dataset.py", line 1064, in get_d_priors_from_dataset
    cell_counts = int(np.expm1(cell_log_counts).item())
ValueError: cannot convert float NaN to integer

Could you please help me understand what is the problem?

Thank you for your attention.

With best wishes,

Daniele Muraro

Fails with "not enough empty droplets" error even though --low-count-threshold 2

CellBender fails on my data with "not enough empty droplets" error even though I set --low-count-threshold to 2. The error & log itself are a bit weird (look at the scale of nUMI filtered). I get this exact error (with different UMI numbers) on 2 very different datasets.

cellbender:remove-background: 2019-11-06 00:23:12
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from directory ./rawdata/filtered_feature_bc_matrix/folder
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 6001
cellbender:remove-background: Prior on counts for cells is 49252
cellbender:remove-background: Excluding barcodes with counts below 4800
Traceback (most recent call last):
  File "/path/to/software/miniconda3/envs/mypyro/bin/cellbender", line 11, in <module>
    load_entry_point('cellbender', 'console_scripts', 'cellbender')()
  File "~/CellBender/cellbender/base_cli.py", line 101, in main
    cli_dict[args.tool].run(args)
  File "~/CellBender/cellbender/remove_background/cli.py", line 92, in run
    main(args)
  File "~/CellBender/cellbender/remove_background/cli.py", line 185, in main
    run_remove_background(args)
  File "~/CellBender/cellbender/remove_background/cli.py", line 143, in run_remove_background
    args.low_count_threshold)
  File "~/CellBender/cellbender/remove_background/data/dataset.py", line 90, in __init__
    gene_blacklist=gene_blacklist)
  File "~/CellBender/cellbender/remove_background/data/dataset.py", line 252, in _trim_dataset_for_analysis
    f"There are no empty droplets with UMI counts over the lower " \
AssertionError: There are no empty droplets with UMI counts over the lower cutoff of 4800.  Some empty droplets are necessary for the analysis.Reduce the --low-count-threshold parameter.

ERROR reading h5 output in R

Really excited to see the results of this program! Thanks for the awesome work!

I would like to work in R, ideally within the SimpleSingleCell-verse. I can successfully read the output files with read10xCounts() [DropletUtils], but the count matrix is a "DelayedMatrix" as opposed to the usual "dgCMatrix" and causes downstream errors (not to mention I cannot view the matrix).

When I follow the code to import into Seurat I get the following error:

Error in H5File.open(filename, mode, file_create_pl, file_access_pl) :
HDF5-API Errors:
error #000: ../../src/hdf5-1.10.0-1/src/H5F.c in H5Fopen(): line 579: unable to open file
class: HDF5
major: File accessibilty
minor: Unable to open file

error #001: ../../src/hdf5-1.10.0-1/src/H5Fint.c in H5F_open(): line 1168: unable to lock the file or initialize file structure
    class: HDF5
    major: File accessibilty
    minor: Unable to open file

error #002: ../../src/hdf5-1.10.0-1/src/H5FD.c in H5FD_lock(): line 1821: driver lock request failed
    class: HDF5
    major: Virtual File Layer
    minor: Can't update object

error #003: ../../src/hdf5-1.10.0-1/src/H5FDsec2.c in H5FD_sec2_lock(): line 939: unable to flock file, errno = 13, error message = 'Permission denied'
    class: HDF5
    major: File accessibilty
    minor: Bad file ID accessed

Any help is appreciated!

Add LICENSE

Training doesn't converge with 300 epochs

Hello,

I've been using CellBender for my datasets, and I noticed that the training loss for one of my dataset exhibits weird behavior. Shoud I try training more? I noticed the cell calls look clean...
I've attached the plots. The params are 300 epochs, 1000 layer dim, and 300 latent dim with 10000 expected cells and 40000 total.

Thanks
avm049_2_out.pdf

The main CLI must show args help when run without any commands

10x H5 Format Error

On most recent commit "c051c44" on v2, the output .h5 files can't be read by scanpy. Seems to think there's a key error on 'genome'?

`KeyError Traceback (most recent call last)
~/utils/miniconda3/envs/scanpy/lib/python3.7/site-packages/scanpy/readwrite.py in _read_v3_10x_h5(filename, start)
253 feature_types=dsets['feature_type'].astype(str),
--> 254 genome=dsets['genome'].astype(str),
255 ),

KeyError: 'genome'

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
in

~/utils/miniconda3/envs/scanpy/lib/python3.7/site-packages/scanpy/readwrite.py in read_10x_h5(filename, genome, gex_only)
159 v3 = '/matrix' in f
160 if v3:
--> 161 adata = _read_v3_10x_h5(filename, start=start)
162 if genome:
163 if genome not in adata.var['genome'].values:

~/utils/miniconda3/envs/scanpy/lib/python3.7/site-packages/scanpy/readwrite.py in _read_v3_10x_h5(filename, start)
258 return adata
259 except KeyError:
--> 260 raise Exception('File is missing one or more required datasets.')
261
262

Exception: File is missing one or more required datasets.`

please exit with non zero value on error

Hello,

currently (commit a348255) cellranger exit with 0 when Unable to open file.
see

bigmess:CellBender/a348255 > cellbender remove-background --input nope --output ./test.h5 --expected-cells 500 --total-droplets-included 5000
cellbender:remove-background: Command:
cellbender remove-background --input nope --output ./test.h5 --expected-cells 500 --total-droplets-included 5000
cellbender:remove-background: 2019-11-21 10:37:56
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file nope
cellbender:remove-background: OSError: Unable to open file nope.
bigmess:CellBender/a348255 > echo $?
0

this will breaks some of our pipelines. and breaks my test suite ;-)

regards

FEATURE REQUEST: compatibility with cellranger aggr

It would be a nice feature to maintain compatibility with cellRanger aggr. For example, I have run CellBender on several scRNA-seq libraries and would like to use the cellRanger aggr function to downsample libraries to similar sizes.

With version 3 of cellRanger I get the following error:
[error] The molecule info HDF5 file (/pasteur/projets/policy01/evo_immuno_pop/Mary/scRNAseq_MixedLane/cellBender/L1_defaults.h5) was produced by an older version of Cell Ranger. Reading these files is unsupported.

With version 2 of cellRanger I get the following error:
AttributeError: Illegal column: ambient_expression

CellRanger-like .mtx output of the background-corrected count matrix

... for easy ingestion by downstream analysis tools. The CSC representation in the H5 file is not accessible enough for typical users.

Cannot import cycler

Hi,

I'm trying to run a cellbender job for the first time on a server and I'm getting the following error:

from cycler import Cycler, cycler as ccycler
ImportError: cannot import name 'Cycler' from 'cycler' (/global/home/users/aralbright/.local/lib/python3.7/site-packages/cycler.py)

I ran cellbender -h like in the other github issue with a similar error and I do get the same thing. I'm using a conda environment on the server and I followed the manual installation instruction in the docs. I'm not sure how to figure out the issue if someone could help me out.

Thanks!

Ashley

Can't read output h5 file

Hi there,
I am very excited to try cellbender but somehow I haven't gotten very far. I ran it on a little study with 18 samples (10x 3' v3 chemistry) and cellbender run goes smooth and log and pdf look good. I am unable, however, to read in the h5 output file and am getting the following error:

h5ls("cellbender_out_filtered.h5")
Error in H5Fopen(file, "H5F_ACC_RDONLY", native = native) : 
  HDF5. File accessibilty. Unable to open file.

What is the problem there? I am able to read other files fine (for instance, the default raw_feature_bc_matrix.h5 that was an input to cellbender.

I am on Ubuntu 18.04
Linux DESKTOP-01 5.4.0-47-generic #51~18.04.1-Ubuntu

and this is the listing of my CellBender environment:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
alabaster                 0.7.12                   pypi_0    pypi
babel                     2.8.0                    pypi_0    pypi
blas                      1.0                         mkl    anaconda
blosc                     1.20.0               hd408876_0    anaconda
bzip2                     1.0.8                h7b6447c_0    anaconda
ca-certificates           2020.7.22                     0    anaconda
cellbender                0.1                       dev_0    <develop>
certifi                   2020.6.20                py37_0    anaconda
chardet                   3.0.4                    pypi_0    pypi
cudatoolkit               10.2.89              hfd86e86_1  
cycler                    0.10.0                   pypi_0    pypi
docutils                  0.16                     pypi_0    pypi
freetype                  2.10.2               h5ab3b9f_0  
future                    0.18.2                   pypi_0    pypi
hdf5                      1.10.4               hb1b8bf9_0    anaconda
idna                      2.10                     pypi_0    pypi
imagesize                 1.2.0                    pypi_0    pypi
intel-openmp              2020.2                      254    anaconda
jinja2                    2.11.2                   pypi_0    pypi
joblib                    0.16.0                   pypi_0    pypi
jpeg                      9b                   h024ee3a_2  
kiwisolver                1.2.0                    pypi_0    pypi
lcms2                     2.11                 h396b838_0  
ld_impl_linux-64          2.33.1               h53a641e_7  
libedit                   3.1.20191231         h14c3975_1  
libffi                    3.3                  he6710b0_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libpng                    1.6.37               hbc83047_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                h2733197_1  
lz4-c                     1.9.2                he6710b0_1    anaconda
lzo                       2.10                 h7b6447c_2    anaconda
markupsafe                1.1.1                    pypi_0    pypi
matplotlib                3.3.1                    pypi_0    pypi
mkl                       2020.2                      256    anaconda
mkl-service               2.3.0            py37he904b0f_0  
mkl_fft                   1.1.0            py37h23d657b_0  
mkl_random                1.1.1            py37h0573a6f_0    anaconda
mock                      4.0.2                      py_0    anaconda
ncurses                   6.2                  he6710b0_1  
ninja                     1.10.1           py37hfd86e86_0  
numexpr                   2.7.1            py37h423224d_0  
numpy                     1.19.1           py37hbc911f0_0  
numpy-base                1.19.1           py37hfa32c7d_0  
olefile                   0.46                     py37_0  
openssl                   1.1.1g               h7b6447c_0    anaconda
opt-einsum                3.3.0                    pypi_0    pypi
packaging                 20.4                     pypi_0    pypi
pandas                    1.1.2                    pypi_0    pypi
pillow                    7.2.0            py37hb39fc2d_0  
pip                       20.2.2                   py37_0  
pygments                  2.7.0                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyro-api                  0.1.2                    pypi_0    pypi
pyro-ppl                  1.4.0                    pypi_0    pypi
pytables                  3.6.1            py37h71ec239_0    anaconda
python                    3.7.9                h7579374_0  
python-dateutil           2.8.1                    pypi_0    pypi
pytorch                   1.6.0           py3.7_cuda10.2.89_cudnn7.6.5_0    pytorch
pytz                      2020.1                   pypi_0    pypi
readline                  8.0                  h7b6447c_0  
requests                  2.24.0                   pypi_0    pypi
scikit-learn              0.23.2                   pypi_0    pypi
scipy                     1.5.2                    pypi_0    pypi
setuptools                49.6.0                   py37_0  
six                       1.15.0                     py_0    anaconda
snappy                    1.1.8                he6710b0_0    anaconda
snowballstemmer           2.0.0                    pypi_0    pypi
sphinx                    3.2.1                    pypi_0    pypi
sphinx-argparse           0.2.5                    pypi_0    pypi
sphinx-autodoc-typehints  1.11.0                   pypi_0    pypi
sphinx-rtd-theme          0.5.0                    pypi_0    pypi
sphinxcontrib-applehelp   1.0.2                    pypi_0    pypi
sphinxcontrib-devhelp     1.0.2                    pypi_0    pypi
sphinxcontrib-htmlhelp    1.0.3                    pypi_0    pypi
sphinxcontrib-jsmath      1.0.1                    pypi_0    pypi
sphinxcontrib-programoutput 0.16                     pypi_0    pypi
sphinxcontrib-qthelp      1.0.3                    pypi_0    pypi
sphinxcontrib-serializinghtml 1.1.4                    pypi_0    pypi
sqlite                    3.33.0               h62c20be_0  
threadpoolctl             2.1.0                    pypi_0    pypi
tk                        8.6.10               hbc83047_0  
torchvision               0.7.0                py37_cu102    pytorch
tqdm                      4.49.0                   pypi_0    pypi
urllib3                   1.25.10                  pypi_0    pypi
wheel                     0.35.1                     py_0  
xz                        5.2.5                h7b6447c_0  
zlib                      1.2.11               h7b6447c_3  
zstd                      1.4.5                h9ceee32_0    anaconda

Strange results on 10X hgmm10k_v3 dataset

Hi,

While CellBender works as expected on 10X hgmm12k (v2), on 10X hgmm10k (v3), it strangely removes large mouse gene counts and adds large human gene counts to mouse cells. 10X hgmm5k (v3) gives similar unexpected results as hgmm10k (v3). Please see logs and plots (hgmm12k and hgmm10k only) below:

hgmm12k, v2

Log:

cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5 --output data/cellbender/hgmm_12k_raw_gene_bc_matrices_h5.cellbender.h5 --expected-cells 12000 --total-droplets-included 22000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 12:36:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_12k/hgmm_12k_raw_gene_bc_matrices_h5.h5
cellbender:remove-background: CellRanger v2 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 199
cellbender:remove-background: Prior on counts for cells is 13864
cellbender:remove-background: Excluding barcodes with counts below 159
cellbender:remove-background: Using 12000 probable cell barcodes, plus an additional 10000 barcodes, and 48062 empty droplets.

Elbow plot, vertical line marks --expected-cells and --total-droplets-included:
Before correction (called cells):
After correction (called cells):
Convergence:

hgmm10k, v3

Log:

cellbender:remove-background: Command:                                                                                                                                                                                                                                   
cellbender remove-background --input data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5 --output data/cellbender/hgmm_10k_v3_raw_feature_bc_matrix.cellbender.h5 --expected-cells 10000 --total-droplets-included 20000 --epochs 150 --cuda
cellbender:remove-background: 2020-01-29 09:31:14
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from file data/hgmm_10k/hgmm_10k_v3_raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Trimming dataset for inference.
cellbender:remove-background: Prior on counts in empty droplets is 444
cellbender:remove-background: Prior on counts for cells is 19036
cellbender:remove-background: Excluding barcodes with counts below 355
cellbender:remove-background: Using 10000 probable cell barcodes, plus an additional 10000 barcodes, and 56957 empty droplets.

Elbow plot, vertical line marks --expected-cells and --total-droplets-included:
Before correction (called cells):
After correction (called cells):
Convergence:

understanding Cellbender output

Hi Stephen,
I was just wondering, is it possible to have a detailed description of what the output.h5 file contains in the various groups? I cannot seem to find it in the readthedocs.

For example there is a group which contamination_fraction(it has 2 sets of values). What to do those mean? is that the lower and upper limit for contamination fraction in the whole dataset?

The ambient expression group, is a list the same length as the number of genes considered in the dataset, however it has only 1 column. So it that the fraction of contamination or ambient expression across all cells per gene?

How can i best evaluate how much ambient RNA or contamination was identified in my sample?

Many thanks
Devika

Save gene Ensembl IDs in output h5 file

The new version of Scanpy will not load the current output of cellbender remove-background due to the lack of a 'genes' field in the h5 file. That field should be included anyway, as it is typically used to contain the Ensembl ID of each gene.

Tiny pbmc4k example for remove-background

We need to generate a basic example for running pbmc4k. This test dataset must be small enough to finish running on CPU in a matter of minutes.

Distribute through Conda?

Wondering if authors have considered distributing CellBender through Conda as well? That'd be quite handy. Thanks!

Remove JIT option

For v0.1, remove the JIT option and run without JIT compiler. It is not currently providing the desired speedup.

Refactor requirements from setup.py to a separate REQUIREMENTS file

Tutorial addition: loading output in scanpy and Seurat

Add a section to the tiny pbmc4k dataset tutorial about how to load the .h5 output of remove-background into scanpy and Seurat for downstream analyses.

Data simulation tool

Refactor data simulation into an improved tool that generates simulated data based on a read dataset.

QUESTION: Future of CelllBender?

Hi there,
Love the model and the interface of cellbender! Works great as a step two in my kallisto pipeline! I've been using V2 for a while and I'm hoping to publish a scRNAseq study in the near future, though there are still issues with NaNs during training or no cells detected on a few samples that I've been working around. I'm wondering how much you expect to change in v2 or V2.1+ in the near future, approximately when that might be, and maybe what a roadmap for the future of the project is!
Thanks so much for making such a nice tool for the community!
Matthew

Low-dimensional representation?

Hi there,

Is there a way to retrieve the low-dimensional representation, z, from the output of remove-background?

Thank you!

Low quality cells being filtered by cellbender

Hi,

I am comparing cellbender approach with cellranger and another method Diem ( https://doi.org/10.1101/786285), as the latter method also claims to work better for single nuclei data, compared to Cellranger and EmptyDrops.

I used it on two datasets, to understand the output. I filtered low quality cells expressing less than 200 genes and more than 2% MT gene fraction. I performed batch correction using FastMNN for comparing the filtered cells in the three approaches. Briefly, I see that cellbender (CB) calls more cells compared to Cellranger (CR) and Diem (D), with later the least of all. Then I clustered the cells to see the identity of the extra cells that are being called out in CB. In both of the data that I analyzed, the extra cells called by CB and CR are falling into one or two cluster, rather then being distributed in other clusters.

I then analyzed the UMI distribution, number of genes called and MT gene %. The extra cluster called in CB and CR, shows higher MT % and lower UMI/ gene counts in CR data. The same cluster in CB, has better UMI/gene distribution but still relatively higher density of cells with higher MT%.

Calling differentially expressed genes for the extra cluster showed that it expresses ribosomal, tubulin and mitochondiral genes. As I see similar cells in CR approach, the only significant gain from the approach I see is background subtraction.

Could you suggest what I could be doing wrong?

Failed reading 10x h5 file - CellBender v2.1

Last week I ran CellBender v1 over a CellRanger V3 library I got, and then successfully loaded the h5 output into a Seurat object.

CreateSeuratObject(Read10X_h5("the_cell_bender_outout_filtered.h5"))

Now, when I tried using CellBender V2.1 instead of V1 over the same CellRanger library, I got the following exception:

Error in `[[.H5File`(infile, paste0(genome, "/", feature_slot)) :
  An object with name matrix/gene_names does not exist in this group

Comparing the output h5 files of both versions they seem to have the same format with the difference that the V2.1 output has the "PYTABLES_FORMAT_VERSION". Is it flagged like that on purpose? This is the flag that causes the Seurat Read10X_h5 function to fail: (part of Read10X_h5 code)

  infile <- hdf5r::H5File$new(filename = filename, mode = "r")
  genomes <- names(x = infile)
  output <- list()
  if (!infile$attr_exists("PYTABLES_FORMAT_VERSION")) {
    if (use.names) {
      feature_slot <- "features/name"
    }
    else {
      feature_slot <- "features/id"
    }
  }
  else {
    if (use.names) {
      feature_slot <- "gene_names"
    }
    else {
      feature_slot <- "genes"
    }
  }

QC Failure advice?

I noticed on the docs that a QC failure looking plot would be problematic for empty droplet identification. I was wondering if you had any tips to get the most of these datasets? I have a decent number of QC failed looking datasets which I can't regenerate since the samples are very rare.

HDF5 file output is not compatible with cellranger aggr v3

Hi I am trying to test out cellbender on my dataset and ran into some problem when aggregating the output files (.h5). I am wondering if they are not compatible with cellranger v3 or if they are not compatible with cellranger aggr.

Here's the error message that came out:
The molecule info HDF5 file (/mnt/data/20190923_R317_Run2_CRV3/R317_r2_Basal.h5) was produced by an older version of Cell Ranger. Reading these files is unsupported.

Output contains droplets with 0 counts

Hi,

I am running CellBender on UMI-tagged scRNA data. To run it, I am using the following command:

cellbender remove-background \
            --input ${input_files} \
            --output ${outfile} \
            --cuda \
            --expected-cells ${expected_cells} \
            --total-droplets-included ${total_droplets_include} \
            --model full \
            --z-dim 200 \
            --z-layers 1000 \
            --low-count-threshold 10 \
            --epochs ${epochs} \
            --empty-drop-training-fraction 0.5 \
            --learning-rate ${learning_rate}

The output generally looks good. Below I have a few plots describing the model and cell probabilities.

However, there are 2 'cells' included that have 0 counts in the filtered output. It's very easy to deal with, but seems like weird behavior given the application.

Error in running remove background

Hello,
I successfully ran Cell bender on my sample the first time with default z-dim and z-layers and 200 epochs. But on inspecting the pdf it looked like the training hadnt converged. I assumed it was because perhaps it needed more epochs. However when i ran cell bender on the same sample with 300 epochs insteasd of 200 epocs i got a error where the job stopped 1/4 way through the training, with a encountered NaN loss. This dint happen when i ran it previously.

On reading a post about perhaps over-parameterising it maybe i need more simpler z-dims and z-layers, what would you suggest based on the pdf output of my first run.

See attached first run PDF (MaleBrainRep1.pdf) and a the log and error report for the second run (MaleBrainRep1_v2,log and
MBR1_v2_error.txt)
MaleBrainRep1.pdf

MBR1_v2_error.txt

on the same sample with 300 epochs.
MaleBrainRep1_v2.log

Best,
Devika

Comprehensive test suite

Create a comprehensive test suite for CellBender.

Differential gene analysis after CellBender

Dear team,
Thank you very much for developing CellBender! I have a question regarding the downstream analysis after CellBender. My scRNA-seq design includes three control samples plus three disease samples. My goal in this study was to identify the cell type specific disease markers. My question is: If I run CellBender on each dataset separately, can I perform differential gene analysis directly on the gene expression values generated by CellBender to compare the disease versus control?
Your comments will be greatly appreciated!
Best,
Nelson

Wildly variable retained barcode list

Hello, I just tried out CellBender on 21 datasets of 10x v2 single nuclei data that should be generally comparable (Cell Ranger retains ~1500-3500 barcodes in each) but I got wildly variable results with CellBender.

I used the following parameters: expected-cells = 5000 , total-droplets-included = 15000, epochs = 1500.

The number of barcodes retained by CellBender (from the output.csv file) are below. I'm surprised to see do many datasets with the maximum number of 15k, as well as 2 datasets with 0. Could this be a result of overfitting because I did 1500 epochs? In the output.pdf I noticed that only a train score is provided, so it's hard to diagnose overfit...

Here are the knee plots in case they help diagnose things. All datasets are pretty similar.

Parallel enumeration over the masking operation

Implement parallel enumeration over masking of the encoded latent z for empty droplets as a poutine.mask() context manager, rather than taking a single sample and using that as a mask.