broadinstitute / pooled-cell-painting-profiling-recipe Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 4.0 408 KB

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

carpenter-lab cell-painting in-situ-sequencing pooled-screen data-science recipe pooled-cell-painting

pooled-cell-painting-profiling-recipe's Introduction

Pooled Cell Painting - Image-based Profiling Pipeline Recipe 👩‍🍳 👨‍🍳

A step-by-step data processing pipeline for Pooled Cell Painting data.

Ingredients

Data are the primary ingredients of science. Here, our data come from a Pooled Cell Painting experiment.

In these experiments, the data are thousands of .csv files storing metadata and morphology measurements from millions of single cells.

There are two fundamental kinds of data ingredients:

Cells
Spots

The Cells ingredients represent morphology measurements for various cellular compartments for each segmented single cell. The Spots ingredients represent in situ sequencing (ISS) results used for "cell calling". Cell calling is the procedure that assigns a specific CRISPR perturbation to an individual cell using barcode sequences read within Spots in the cell by ISS. Because the experiment is "pooled", there are thousands of CRISPR barcodes present in a single well.

These measurements for both data ingredients are currently made by CellProfiler software (using customized Pooled Cell Painting plugins).

Recipe Steps

All cookbooks also include specific instructions, or steps, for each recipe.

Our recipe includes two modules:

The output data are structured in a way that includes measurements from many individual "sites" across a single plate. Each site can be thought of as a single field of view that consists of many different images from the five Cell Painting channels, and four ISS channels across n cycles. The number of cycles is determined as part of experimental design and is typically selected to ensure zero collisions between CRISPR barcodes.

The recipe steps first preprocess spots and cells, output quality control (QC) metrics, and perform filtering. Next, in profile generation, single cell profiles are merged, aggregated, normalized and feature selected. The final output of the pipeline are QC metrics, summary figures, and morphology profiles for each CRISPR guide. These profiles will be used in downstream analyses for biological discovery.

Usage

This recipe is designed to be used as a critical component of a Data Pipeline Welding procedure.

More specifically, this recipe will be linked together, via a GitHub submodule, to a Pooled Cell Painting data repository. The data repositories will be derived from the Pooled Cell Painting template.

More usage instructions can be found in the template repo linked above. Briefly, the goal of the weld is to tightly couple the Pooled Cell Painting processed data to versioned code that performed the processing. This recipe is the versioned code and a GitHub submodule links the recipe by commit hash.

The recipe is interacted with via a series of configuration yaml files defined in the data repository.

Logging

The recipe includes creation of a log file for each step. The file is named after the step (e.g. 0.prefilter-features.log) and saves in a log/ folder within each module. The file logs progress information, warnings, and uncaught exceptions.

pooled-cell-painting-profiling-recipe's People

Contributors

Stargazers

Watchers

Forkers

gwaybio erinweisbart hillsbury merajramezani

pooled-cell-painting-profiling-recipe's Issues

Maintaining different versions of recipe

Generally, we want to avoid incorporating experiment/dataset-specific steps into the recipe. However, this became necessary when handling the slightly different column names between CP074 and CP151.

Specifically, CP151 has columns Metadata_Well, Metadata_Site, and Metadata_Plate; CP074 has columns Metadata_Site and Metadata_TopFolder, which contains both plate and well information.

Metadata in 7.visualize-cell-summary.py

in these lines we load metadata and append to a metadata_list. However, this is never used in this script.

I did not see this when I reviewed #18, but I ran into it now when I came across this error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-8-235105add032> in <module>
     10     metadata_df = (
     11         pd.read_csv(metadata_file, sep="\t")
---> 12         .loc[:, metadata_col_list]
     13         .reset_index(drop=True)
     14     )

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1759                 except (KeyError, IndexError, AttributeError):
   1760                     pass
-> 1761             return self._getitem_tuple(key)
   1762         else:
   1763             # we by definition only have the 0th axis

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1286                 continue
   1287 
-> 1288             retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1289 
   1290         return retval

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1951                     raise ValueError("Cannot index with multidimensional key")
   1952 
-> 1953                 return self._getitem_iterable(key, axis=axis)
   1954 
   1955             # nested tuple slicing

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
   1592         else:
   1593             # A collection of keys
-> 1594             keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
   1595             return self.obj._reindex_with_indexers(
   1596                 {axis: [keyarr, indexer]}, copy=True, allow_dups=True

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
   1550 
   1551         self._validate_read_indexer(
-> 1552             keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
   1553         )
   1554         return keyarr, indexer

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
   1652             if not (ax.is_categorical() or ax.is_interval()):
   1653                 raise KeyError(
-> 1654                     "Passing list-likes to .loc or [] with any missing labels "
   1655                     "is no longer supported, see "
   1656                     "https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"  # noqa:E501

KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'

@ErinWeisbart - is it safe to remove metadata references in this script?

QC plots: DAPI correlation in Nuclei

Plot per-site mean DAPI correlation for BC and CP within nuclei for a clean heuristic on how well aligned the images are post-alignment without needing to account for well-edge affecting per-image measurements.

Error in 4.image-and-segmentation-qc.py

@ErinWeisbart - I am trying to rerun this step in the recent pooled dataset. It was working smoothly until line 471. I paste the error statement at the end of this issue (file paths intentionally obscured).

If you look at the "blame" line 471 is my doing. However, in #72 you modified how cp_sat_df is constructed - which likely changed how it should be processed downstream. ("blame" is a bad technical term.... but it is at least descriptive!)

Do you know what's going on? maybe this is an easy fix 🤷

XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_cells_count_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_ratios_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_Cells_FinalThreshold_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_Nuclei_FinalThreshold_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_PercentConfluent_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/results/sites_with_confluent_regions.csv exists, overwriting
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_text : Removed 720 rows containing missing values.
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'level_3'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "recipe/0.preprocess-sites/4.image-and-segmentation-qc.py", line 471, in <module>
    cp_sat_df[["cat", "type", "Ch"]] = cp_sat_df["level_3"].str.split(
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'level_3'

Aggregate fails when output_single_file_only option set to False

In an experiment with >1,000 sites, the aggregate recipe step fails quietly. We do not observe any errors, but the recipe next step is nevertheless performed and not surprisingly fails.

This may be a compute size issue, but it silently failing is still concerning and we should address.

One option is to aggregate each site independently, and then, with the number of single cells per perturbation, weight the aggregated contribution proportionally to cell count. ~~I describe this option in #57 - time to revisit!~~

Remove example_site requirement in prefilter feature config

When we update the config files (as part of #47) we should remove the requirement for the user to specify an example site. This is annoying and we can very easily randomly sample a folder name.

Add command line option to reprocess files without overwriting existing data

This is an advanced option, meant to be invoked via the command line, and will override the force: true option in the config file. This is a fairly dangerous operation that should only be used by experts. Using the flag may compromise the weld by processing the data more than once with a possibly different recipe.

We discussed adding this feature in #24 (comment). The enhancement is not part of the version 0.1 milestone.

Add output file specifically for comparison to NGS data.

I think current consensus is that we don't want to add NGS comparison to the recipe itself because it requires additional data input. However it is an experiment we are likely to want to do on many datasets.

Can we add an output from the recipe of a file specifically formatted for easy comparison to NGS data?
Will get exact requirements from Maria and report back.

Add optional extra barcode preprocessing

Currently, the barcodes called, quality of barcodes called, and sgRNA assigned to barcode are all output by CellProfiler with no option to modify in-recipe.
It would be helpful to be able to perform troubleshooting to optimize barcode calling in recipe by enabling and additional barcode processing step that reads barcodes called but overwrites quality assignment and sgRNA assignment to allow for:

ignoring cycle n from all reads (e.g. we know Cycle 4 had a reagent mistake. Don't use Cycle 4 when determining call quality)
use N cycles (e.g. we gathered and read all 12 cycles but we know the last two are the worst performing and want to consider call quality only from cycle 1-10)

Make folders only when necessary

I'm running the recipe and it errored during 0.preprocess-sites/1.process-spots but it looks like all the folders for 1.profiles are present (but empty). I know it's a fussy request, but browsing extant folders is a helpful and easy way to what has/has not been completed and it would be nice if folder creation was done just before a file is saved into it to rather than en masse ahead of time.

Project scaffold vs workflow module

We need two distinct functionalities:

A workflow submodule, which contains all the logic, i.e., code + configurations + instructions for reproducing a workflow
A project scaffold, which defines other aspects, excluding logic, that will help standardize projects. This could include license files, README, directory structures, etc.

Because we cannot at present pull from a template repository, GitHub templates are ok for 2. but not for 1.

So instead we propose to create a regular repo to define the workflow submodule, which will then be a submodule of future project repositories.

Project scaffolds can however be created using GitHub templates, and are very useful for this purpose, because we don't need / want the functionality to pull from the template.

@gwaygenomics Does this sound right?

Single cell normalization enhancement option

@hillsbury and I chatted about adding a single cell normalization option in two different scenarios:

When using a single file (output_one_single_file_only set to true)
When using single cell files saved across different site folders (output_one_single_file_only set to false)

#15 describes the need for a method to perform single cell normalization in general, but this issue can be used to document the need to implement both single cell normalization scenarios.

For example, @hillsbury noticed that the weld.py will fail when output_one_single_file_only = True and single_cell is set as a normalize level in the options config. Hillary will paste the error message that she received below :)

Convert quality_col to list

New error in running weld:

Traceback (most recent call last):
File "recipe/0.preprocess-sites/3.visualize-cell-summary.py", line 259, in
.groupby(gene_cols + barcode_cols + quality_col)["Cell_Count_Per_Guide"]
TypeError: can only concatenate list (not "str") to list

To fix, convert to list as [quality_col]

Pandas-2-ize the recipe

The following lines at least are barfing:

Uncaught Exception:   File "recipe/0.preprocess-sites/1.process-spots.py", line 289, in <module>
    spot_count_score_jointplot(

  File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/M059K-SABER/recipe/0.preprocess-sites/scripts/spot_utils.py", line 30, in spot_count_score_jointplot
    pd.DataFrame(df.groupby(parent_col)[score_col].mean())

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/frame.py", line 9843, in merge
    return merge(

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 148, in merge
    op = _MergeOperation(

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 737, in __init__
    ) = self._get_merge_keys()

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1203, in _get_merge_keys
    right_keys.append(right._get_label_or_level_values(rk))

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)

and,if you disable that qc plot,

Uncaught Exception:   File "recipe/0.preprocess-sites/1.process-spots.py", line 340, in <module>
    cell_quality_summary_df = cell_quality.summarize_cell_quality_counts(

  File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/M059K-SABER/recipe/scripts/cell_quality_utils.py", line 107, in summarize_cell_quality_counts
    quality_df.drop_duplicates(dup_cols)

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/frame.py", line 9843, in merge
    return merge(

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 148, in merge
    op = _MergeOperation(

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 737, in __init__
    ) = self._get_merge_keys()

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1221, in _get_merge_keys
    left_keys.append(left._get_label_or_level_values(lk))

  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
    raise KeyError(key)

Those are both downstream of a .value_counts() operation on a df, which was one of the breaking changes in 2 (in 2, the name of the column coming from such an operation is always set to "count"). There are currently 3 functions using value_counts.

Very much hope that's the only changes that need to be made, but we should recommend pandas 1.5.3 until someone goes through and actually successfully runs a >pandas 2 version.

Add multi-plate option

Currently, the config assumes one plate per experiment. This was the case for early experiments, but this will not be the case for large scale experiments. We should add options to save QC and profiles into separate folders in both modules, and then implement @ErinWeisbart 's suggestion in #20

Change Skip Site/Overwrite Behavior in 0.preprocess-sites

I want to better handle a situation where you have processed only some of your data through 0.preprocess-sites/1.process-spots.py (e.g. uncaught exception stops processing half way). For config options - perform either processes or skips the whole module while force_overwrite only happens at the file level (i.e. all of the data processing happens and then it decides to save the files or not) so your only option is to run the whole thing over again if you're missing sites.

I'm thinking it needs another config option that allows you to skip complete sites. If this new option is True, at for site in split sites: it should to see if the output folder exists for that site and if so it skips processing that site.

Might also want to do this for 2.process-cells as well.

Create 0.preprocess-sites/n.determine-site-quality

The CellProfiler analysis pipeline now contains measurements that address the quality of images/data at a given site. It would be nice to have a pipeline that outputs a summary so that it can be viewed at a glance. Such as:

Confluent regions
- list all sites with confluent regions and their % confluent
- map all sites with confluent regions
MeasureImageQuality
- plot power log log slope as a proxy for focus
- plot saturation
Thresholds
- map segmentation thresholds

Revert `site_full` back to `site`

In #23 (specifically #23 (comment)) I suggested a column name change within the 0.preprocess recipe module. This was a bad idea!

It was a bad idea b/c it breaks 0.merge-single-cells.py at: https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/blob/1be45bb33e1da71050dbb102352690cf8a08fb7c/1.generate-profiles/0.merge-single-cells.py#L143L146

we need to revert this change (and address some corresponding breaks after updating) before we can produce profiles.

0./1.process-spots duplicate graphs

When I run CP151A1 through 1.process-spots the output of gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same.

We need to track down why.

Normalize Single Cells by Site

For example reduced representation experiments (230 genes; 2-4 wells) (~150,000 cells) we are able to normalize single cells in a single file using standard approaches. Two important notes:

this approach will not adjust for technical artifacts induced by site and well
this approach will not scale particularly easily to the millions of cells in whole genome experiments

We need to find a solution to normalize single cells within each site. This will also likely help us a bunch in our gene- and guide-level aggregated perturbation profiles.

Write a combine batches module

Since a single experiment is often done in multiple plates/batches, it would probably be good to add in a module where we can combine data from a list of batches so we can get an overview of the whole experiment.
We certainly want cell quality overview. Will think about what else would be helpful to have.

Add logger to capture weld parameters, warnings, and errors

Necessary for automated welding

Hardcoding Cells and Nuclei for threshold QC

In #75, @ErinWeisbart writes about our options to handle the hardcoding issue I commented on in #75 (comment):

In a Cell Painting workflow, Cells and Nuclei are the two compartments that are always segmented. Segmenting fewer compartments is impossible because we need to identify individual cells and we must use Nuclei to determine Cells. It's possible to segment more compartments, but we don't currently do that (even in our workflow where we have many more labels) and are unlikely to do so because 1) it seems to be unnecessary and 2) it's prone to mistakes/variability and therefore requires significant hands on time.

Therefore, the thresholds we want to plot here are likely to always be Cells and Nuclei. The question is therefore should we remove hardcoding in case Cells and Nuclei compartments are ever be labeled in a different way (e.g. cells, Cell, etc.). To do that, it looks like the options are:

Create a new entry in options.yaml to specify the segmented compartment names. Not ideal because it's one more thing to have to enter, but it's the least prone to breaking.
Use core: compartments: and remove Cytoplasm from the list because it's the only tertiary compartment (mathematically determined by subtracting one compartment from another) and any other compartment in the list will have been created by segmentation. Not ideal because then we have to find a way to account for labelling of Cytoplasm.
Use core: cell_match_cols: cytoplasm: and then strip the Parent_ from the front of the compartment string. Parent is hardcoded into CellProfiler, so it's fine from our workflow perspective, but would mean this wouldn't work with data coming not from CellProfiler (which I thought was a goal?).

We may decide to tackle this at a later date

Add print statements

in #43 @ErinWeisbart notes:

Doesn't really fit with the theme of this PR, but since it touches print() statements it made me think, it would be nice to have a print() at the end of every step saying it's done.
0./1. and 0./2. already end with print("All sites complete.") but something like print("Step 0./0. processing complete.") would be nice to have at the end of other steps.

I agree that this would be a nice feature to keep consistent. Perhaps we can bundle this fix with #42

Skip corrupted site files

In a recent run, we observed the following error:

Now processing spots for XXXX-Well2-15...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Now processing spots for XXXX-Well2-16...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Traceback (most recent call last):
  File "recipe/0.preprocess-sites/1.process-spots.py", line 156, in <module>
    foci_df = pd.read_csv(foci_file)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 468, in _read
    return parser.read(nrows)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1057, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 2036, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1011 fields in line 3, saw 1021

We need to add code to skip these sites, instead of completely erroring out.

4.image-and-segmentation-qc.py should produce figures on a per-plate basis

@gwaygenomics I'm happy to address this.

I remember you saying you were going to do some re-factoring to address handling our current experiment structure where a single batch can contain many plates. Will that affect the structure of inputs for this step? (i.e. image_metadata.tsv and cell_count.tsv)? (They already have Metadata_Plate columns that include both plates in the current batch we're working on, so I would assume not...?)

Add load_features import

In running weld.py, @hillsbury and I saw this error:

Traceback (most recent call last):
  File "recipe/0.preprocess-sites/0.prefilter-features.py", line 77, in <module>
    features_df = load_features(core_option_args, example_site_dir)
NameError: name 'load_features' is not defined

To fix, add load_features to the import statement on line 17

process_cells config mismatch

In the config for process_cells we have prefilter-features as a Bool but in 2.process-cells.py is requires running 0.preprocess-sites.py first so that you have a prefilter file as input.

@gwaygenomics What is the goal of this config Bool?
Do we want to update 2.process-cells.py to allow for not using a prefilter file? Or do we want to remove the input from the config?

Dig into alignment flag triggers in 4.image-and-segmentation-qc.py

Look carefully at alignment in a couple batches of data and determine qc thresholds.
Compare to alignment qc thresholds Beth determined with WG screen images.
Either:

Develop method for simply setting cutoff and build it into recipe step
Have recipe output .csvs for a few different qc thresholds so that the correct one can be chosen post hoc

Splitting full site annotation in 4.image-and-segmentation-qc

This code block is giving me some trouble:

pooled-cell-painting-profiling-recipe/0.preprocess-sites/8.image-and-segmentation-qc.py

Lines 202 to 204 in 068c7ea

    
           Plate=[x[0] for x in image_df.site.str.split("-")], 
        
           Well=[x[1] for x in image_df.site.str.split("-")], 
        
           Site=[x[2] for x in image_df.site.str.split("-")],

A couple notes:

Is the site_full always going to be coded in this way? i.e. Plate, then Well, then Site?
I've already encountered when site_full is not split by - (in a separate experiment, we delimit by underscore)

So, this is a very fragile way of handling this split. A couple of solutions:

We always output site_full with the same convention - this is still pretty fragile, since most scientists will forget the standard)
We somehow propagate plate, well, and site number information earlier and do not rely on this split at all. Is this info in the Image.csv file in a seprate metadata column already?

@ErinWeisbart - we should either resolve this or drop this before version 0.1

enhancement/fix needed in saturated_sites.csv

I am going through 4.image-and-segmentation.py in much more detail in #39 - the following code block is giving me some trouble:

pooled-cell-painting-profiling-recipe/0.preprocess-sites/4.image-and-segmentation-qc.py

Lines 345 to 357 in b8c9288

    
           for col in cp_sat_cols: 
        
               cp_sat_df = image_df[image_df[col] > 1] 
        
           for col in bc_sat_cols: 
        
               bc_sat_df = image_df[image_df[col] > 0.25] 
        
           sat_df_cols = cp_sat_cols + bc_sat_cols 
        
           sat_df_cols.append("site") 
        
           sat_df = cp_sat_df.append(bc_sat_df).drop_duplicates(subset="site") 
        
           if len(sat_df.index) > 0: 
        
               sat_output_file = pathlib.Path(results_output, "saturated_sites.csv") 
        
               if check_if_write(output_file, force, throw_warning=True): 
        
                   sat_df.to_csv(sat_output_file)

I am trying to think through what this is doing (so please bear with me!)

There are two different ImageQuality_PercentMaximal types (spots and cells) and this column, in image.csv indicates whether or not that image (either spots or cells) is over saturated? We also want these sites to be output to a file named saturated_sites.csv but only if any saturated sites exist. Is this correct?

If so, then I think we may need to slightly tinker the code to make sure we extract all saturated sites. Sorry I didn't flag this sooner!

Error message for missing control_barcodes in 1.process-spots

In 1.process-spots at this point:

    # Number of non-targetting controls
    num_nt = passed_gene_df.query(
        f"{gene_cols[0]} in @control_barcodes"
    ).Cell_Count_Per_Gene.values[0]

it will error if there are no non-targeting controls at a site. This can happen because 1) biologically, the site actually doesn't contain control_barcodes, or 2) the barcodes you have listed in your config file don't match what is used in the experiment for control_barcodes.

Add in an error message to explain this, ideally parsing between the conditions.

Sanitize gene column

In one recent experiment, the gene column was of the following format: "GENENAME_GUIDEID". Our previous recipe expected a column to just contain the GENENAME. The guide ID could be useful information used to link between different resources (i.e. if the guide id is used in different places in a single experiment, including this info in a separate column would be helpful to retain). A simple solution parsing by underscores will not work since the "control_barcodes" entries in this column will break.

Currently, I will add the following solution:

I will scan to these columns, check for inconsistencies, auto-detect the anomalies, and then parse given the control_barcodes ingredient concern.

Overwrite existing files True/False

It would be nice (though not necessary) to add in an additional setting that toggles True/False for overwriting existing files.
Default behavior in the .yaml should be set to true. But if, for example, you got most of the way through a module before it errored or got shut down, or a single site was missing its input file and so failed to run, it would be nice not to have to re-process everything.

Level of information in config.yaml files

I'm wondering if the config files should be as streamlined as possible or if they should include as much user-friendly information as possible or somewhere in between.
If streamlined, will detailed information live elsewhere? If so, where? And does it make sense to write it or at least take notes there as we go?

e.g.
streamlined:

core:
  compartments:
      - Cells
      - Nuclei
      - Cytoplasm

verbose:

core:
  compartments:
  #list all compartments with profiling features measured in experiment
  #standard cell painting uses Cells, Nuclei, Cytoplasm
  #experiments with SABER also can include Mito, Golgi
  #names can be derived from .csv files found in workspace/analysis
      - Cells
      - Nuclei
      - Cytoplasm

in-between:

core:
  compartments:
  #list all compartments with profiling features measured in experiment
      - Cells
      - Nuclei
      - Cytoplasm

Barcode controls missing from guide abundance summary (cell count per perturbation)

Currently, control barcodes (sg_nt and NT) are specifically removed in generating summary counts and visualizations in 3.visualize-cell-summary.py, most likely so that NT doesn't dominate in visualizations.

Solutions: update step in weld -or- fix in a separate script

Include summary step for guide + guide abundances (cell count per perturbation)

Right now, only one file indicating perturbation abundances is output per site. We should make retrieving a per-plate perturbation abundance easier, by summarizing perturbation counts in an additional script.

@jbauman214 - unfortunately, your request for this info is not super-readily available. We do calculate this at a per-site level, so it is possible to retrieve. The file name you are looking for is:

EXPERIMENT_LABEL/data/0.site-qc/PLATE_NAME/spots/SITE_NAME/cell_perturbation_category_summary_counts.tsv

These are available on github in a private repository for the EXPERIMENT_LABEL per PLATE_NAME. I am intentionally obscuring experimental details as this issue is in a public repo.

Different overwrite warning behavior between steps

When files already exist:
Step 0./1.process-spots throws an error for every file

Now processing spots for 151B2-B1-2...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...
Now processing spots for 151B2-B1-5...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...
Now processing spots for 151B2-B1-4...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...

Step 0./2.process-cells throws one error at the beginning:

Now processing cells for 151B2-B1-2...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:122: UserWarning: Output files likely exist, now overwriting...
Now processing cells for 151B2-B1-5...
Now processing cells for 151B2-B1-4...
Now processing cells for 151B2-B1-3...

I think they should have the same behavior to prevent confusion.
I prefer an error for every file.

Prepare repository for production

We will need to rethink several decisions before the recipe is production-ready. We will use this issue to document these items.

Checklist

Determine where documentation lives (i.e. does it stay as READMEs or does it move to the wiki?)
Define clear user interaction via config or command line flags (likely means we'll migrate command line flags to config)
Indicate in documentation when a step touches a file made by another step so that user knows to update all appropriate config fields.
Add descriptive error messages when a step touches a file made by another step so that the user knows why the file is missing.
Rethink approach to populating Metadata_Foci_Cell_Category during 2.process-cells for when we move beyond simple cell quality categorizations.

Confusing code block in 4.image-and-segmentation-qc.py

The following code block is giving me problems to reproduce 👇 (copied in full below as well)

pooled-cell-painting-profiling-recipe/0.preprocess-sites/4.image-and-segmentation-qc.py

Lines 459 to 479 in 6ff2625

    
           # Create list of questionable channel correlations (alignments) 
        
           corr_df_cols = ["Plate", "Well", "Site", "site"] 
        
           corr_cols = [] 
        
           for col in image_df.columns: 
        
               if "Correlation_Correlation_" in col: 
        
                   corr_cols.append(col) 
        
                   corr_df_cols.append(col) 
        
           image_corr_df = image_df[corr_df_cols] 
        
           image_corr_list = [] 
        
           for col in corr_cols: 
        
               image_corr_list.append( 
        
                   image_corr_df.loc[image_corr_df[col] < correlation_threshold] 
        
               ) 
        
           image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index() 
        
           for col in corr_cols: 
        
               image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass" 
        
           if len(image_corr_df.index) > 0: 
        
               corr_output_file = pathlib.Path(results_output, "flagged_correlations.csv") 
        
               if check_if_write(corr_output_file, force, throw_warning=True): 
        
                   image_corr_df.to_csv(corr_output_file)

In #39 I move the image.csv processing away from 4.image-and-segmentation-qc.py into an earlier step. In this way we are able to propagate important column metadata through in earlier files. This makes things way less fragile.

Anyways, in the new image.csv processing, I am not finding any columns containing the string "Correlation_Correlation_". Because we're missing that string, this code block fails.

# Create list of questionable channel correlations (alignments)
corr_df_cols = ["Plate", "Well", "Site", "site"]
corr_cols = []
for col in image_df.columns:
    if "Correlation_Correlation_" in col:
        corr_cols.append(col)
        corr_df_cols.append(col)
image_corr_df = image_df[corr_df_cols]
image_corr_list = []
for col in corr_cols:
    image_corr_list.append(
        image_corr_df.loc[image_corr_df[col] < correlation_threshold]
    )
image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
for col in corr_cols:
    image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"

if len(image_corr_df.index) > 0:
    corr_output_file = pathlib.Path(results_output, "flagged_correlations.csv")
    if check_if_write(corr_output_file, force, throw_warning=True):
        image_corr_df.to_csv(corr_output_file)

edit to incude error message (my bad for not including in the first place)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-79-1da2ddc0ea9c> in <module>
     15         image_corr_df.loc[image_corr_df[col] < correlation_threshold]
     16     )
---> 17 image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
     18 for col in corr_cols:
     19     image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    279         verify_integrity=verify_integrity,
    280         copy=copy,
--> 281         sort=sort,
    282     )
    283 

~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    327 
    328         if len(objs) == 0:
--> 329             raise ValueError("No objects to concatenate")
    330 
    331         if keys is None:

ValueError: No objects to concatenate

Cell count file is outputting NaN for plate, well, and site_location columns

The issue is in

pooled-cell-painting-profiling-recipe/0.preprocess-sites/2.process-cells.py

Lines 204 to 208 in 886a601

    
           cell_count_df = ( 
        
               pd.DataFrame(metadata_df.loc[:, quality_col].value_counts()) 
        
               .rename(columns={quality_col: "cell_count"}) 
        
               .assign(site=site, plate=plate, well=well, site_location=site_location,) 
        
           )

somewhere along the line, we introduced missing values. I need to track down.

Config mismatch between recipe and template

@gwaygenomics can you clarify something for me?

Both the recipe and template currently have configs but they no longer match.

My understanding is that the recipe is not meant to be used without the template. Therefore, can we remove the site_processing_config.yaml and profiling_config.yaml from the recipe repo?

Or am I missing something?

Codify new contribution protocol

Now that Greg has left the Imaging Platform (😢) and started his own lab (😄), we need to figure out how to handle contributions/changes/updates (i.e. PRs) to the Recipe (and Template).

I'm taking over internal ownership of the repos, but we need to codify how PRs are handled now. To that end, @gwaygenomics we need to know how much you would still like to be involved in these repos. I see three options, though feel free to add nuance or suggest a different breakdown:

Greg has the time and desire to be involved to the same level as before leaving - reviewing every PR in a timely manner.
Greg has some time to give and will review major PRs that touch a lot of code or change something fundamental about the recipe, but doesn't have time to give to the little stuff.
Greg will keep an eye on the repo as he has time and might chime in if tagged, but is too busy to commit to any guaranteed level of involvement in the repos.

Refactor 4.image-and-segmentation-qc

Create a QC yaml file for 4.image-and-segmentation-qc.
See #23 for discussion of approach.

Bugs in 4.image-and-segmentation-qc.py

Noting two things to be fixed in a future pull request:

`force` not defined

(pooled-cp) wm962-fdf:0.preprocess-sites gway$ python 4.image-and-segmentation-qc.py
Traceback (most recent call last):
  File "4.image-and-segmentation-qc.py", line 113, in <module>
    if check_if_write(output_file, force, throw_warning=True):
NameError: name 'force' is not defined

Error concatenating

Done concatenating image files
Traceback (most recent call last):
  File "4.image-and-segmentation-qc.py", line 478, in <module>
    image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
  File "/Users/gway/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 281, in concat
    sort=sort,
  File "/Users/gway/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 329, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

These should be quick fixes

	Plate=[x[0] for x in image_df.site.str.split("-")],
	Well=[x[1] for x in image_df.site.str.split("-")],
	Site=[x[2] for x in image_df.site.str.split("-")],

	for col in cp_sat_cols:
	cp_sat_df = image_df[image_df[col] > 1]
	for col in bc_sat_cols:
	bc_sat_df = image_df[image_df[col] > 0.25]
	sat_df_cols = cp_sat_cols + bc_sat_cols
	sat_df_cols.append("site")

	sat_df = cp_sat_df.append(bc_sat_df).drop_duplicates(subset="site")

	if len(sat_df.index) > 0:
	sat_output_file = pathlib.Path(results_output, "saturated_sites.csv")
	if check_if_write(output_file, force, throw_warning=True):
	sat_df.to_csv(sat_output_file)

	# Create list of questionable channel correlations (alignments)
	corr_df_cols = ["Plate", "Well", "Site", "site"]
	corr_cols = []
	for col in image_df.columns:
	if "Correlation_Correlation_" in col:
	corr_cols.append(col)
	corr_df_cols.append(col)
	image_corr_df = image_df[corr_df_cols]
	image_corr_list = []
	for col in corr_cols:
	image_corr_list.append(
	image_corr_df.loc[image_corr_df[col] < correlation_threshold]
	)
	image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
	for col in corr_cols:
	image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"

	if len(image_corr_df.index) > 0:
	corr_output_file = pathlib.Path(results_output, "flagged_correlations.csv")
	if check_if_write(corr_output_file, force, throw_warning=True):
	image_corr_df.to_csv(corr_output_file)

	cell_count_df = (
	pd.DataFrame(metadata_df.loc[:, quality_col].value_counts())
	.rename(columns={quality_col: "cell_count"})
	.assign(site=site, plate=plate, well=well, site_location=site_location,)
	)