broadinstitute / pooled-cell-painting-profiling-recipe Goto Github PK
View Code? Open in Web Editor NEW:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
License: BSD 3-Clause "New" or "Revised" License
:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
License: BSD 3-Clause "New" or "Revised" License
Generally, we want to avoid incorporating experiment/dataset-specific steps into the recipe. However, this became necessary when handling the slightly different column names between CP074 and CP151.
Specifically, CP151 has columns Metadata_Well, Metadata_Site, and Metadata_Plate; CP074 has columns Metadata_Site and Metadata_TopFolder, which contains both plate and well information.
When I run CP151A1 through 1.process-spots the output of gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same.
We need to track down why.
It would be nice (though not necessary) to add in an additional setting that toggles True/False for overwriting existing files.
Default behavior in the .yaml should be set to true. But if, for example, you got most of the way through a module before it errored or got shut down, or a single site was missing its input file and so failed to run, it would be nice not to have to re-process everything.
The issue is in
pooled-cell-painting-profiling-recipe/0.preprocess-sites/2.process-cells.py
Lines 204 to 208 in 886a601
somewhere along the line, we introduced missing values. I need to track down.
Currently 10 sites error in 2.process-cells with "151B2-B1-87 data not found" message.
We need to dig a bit to figure out why those particular sites are erroring. If possible, a more descriptive error message would be nice to implement to assist in tracking down why those sites error.
151B2-B1-83
151B2-B1-85
151B2-B1-87
151B2-B1-88
151B2-B1-89
151B2-B2-11
151B2-B2-16
151B2-B2-19
151B2-B2-22
151B2-B2-26
Currently, the config assumes one plate per experiment. This was the case for early experiments, but this will not be the case for large scale experiments. We should add options to save QC and profiles into separate folders in both modules, and then implement @ErinWeisbart 's suggestion in #20
Plot per-site mean DAPI correlation for BC and CP within nuclei for a clean heuristic on how well aligned the images are post-alignment without needing to account for well-edge affecting per-image measurements.
In a recent run, we observed the following error:
Now processing spots for XXXX-Well2-15...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Now processing spots for XXXX-Well2-16...part of set ALLBATCHES___ALLPLATES___ALLWELLS
Traceback (most recent call last):
File "recipe/0.preprocess-sites/1.process-spots.py", line 156, in <module>
foci_df = pd.read_csv(foci_file)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 610, in read_csv
return _read(filepath_or_buffer, kwds)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 468, in _read
return parser.read(nrows)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/io/parsers.py", line 2036, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 771, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1011 fields in line 3, saw 1021
We need to add code to skip these sites, instead of completely erroring out.
In the config for process_cells we have prefilter-features
as a Bool but in 2.process-cells.py is requires running 0.preprocess-sites.py first so that you have a prefilter file as input.
@gwaygenomics What is the goal of this config Bool?
Do we want to update 2.process-cells.py to allow for not using a prefilter file? Or do we want to remove the input from the config?
Currently, the barcodes called, quality of barcodes called, and sgRNA assigned to barcode are all output by CellProfiler with no option to modify in-recipe.
It would be helpful to be able to perform troubleshooting to optimize barcode calling in recipe by enabling and additional barcode processing step that reads barcodes called but overwrites quality assignment and sgRNA assignment to allow for:
We will need to rethink several decisions before the recipe is production-ready. We will use this issue to document these items.
When we update the config files (as part of #47) we should remove the requirement for the user to specify an example site. This is annoying and we can very easily randomly sample a folder name.
The following lines at least are barfing:
Uncaught Exception: File "recipe/0.preprocess-sites/1.process-spots.py", line 289, in <module>
spot_count_score_jointplot(
File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/M059K-SABER/recipe/0.preprocess-sites/scripts/spot_utils.py", line 30, in spot_count_score_jointplot
pd.DataFrame(df.groupby(parent_col)[score_col].mean())
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/frame.py", line 9843, in merge
return merge(
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 148, in merge
op = _MergeOperation(
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 737, in __init__
) = self._get_merge_keys()
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1203, in _get_merge_keys
right_keys.append(right._get_label_or_level_values(rk))
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
raise KeyError(key)
and,if you disable that qc plot,
Uncaught Exception: File "recipe/0.preprocess-sites/1.process-spots.py", line 340, in <module>
cell_quality_summary_df = cell_quality.summarize_cell_quality_counts(
File "/home/ubuntu/efs/2018_11_20_Periscope_Calico/workspace/software/M059K-SABER/recipe/scripts/cell_quality_utils.py", line 107, in summarize_cell_quality_counts
quality_df.drop_duplicates(dup_cols)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/frame.py", line 9843, in merge
return merge(
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 148, in merge
op = _MergeOperation(
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 737, in __init__
) = self._get_merge_keys()
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/reshape/merge.py", line 1221, in _get_merge_keys
left_keys.append(left._get_label_or_level_values(lk))
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.8/site-packages/pandas/core/generic.py", line 1778, in _get_label_or_level_values
raise KeyError(key)
Those are both downstream of a .value_counts()
operation on a df, which was one of the breaking changes in 2 (in 2, the name of the column coming from such an operation is always set to "count"). There are currently 3 functions using value_counts.
Very much hope that's the only changes that need to be made, but we should recommend pandas 1.5.3 until someone goes through and actually successfully runs a >pandas 2 version.
@gwaygenomics can you clarify something for me?
Both the recipe and template currently have configs but they no longer match.
My understanding is that the recipe is not meant to be used without the template. Therefore, can we remove the site_processing_config.yaml and profiling_config.yaml from the recipe repo?
Or am I missing something?
in these lines we load metadata and append to a metadata_list
. However, this is never used in this script.
I did not see this when I reviewed #18, but I ran into it now when I came across this error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-8-235105add032> in <module>
10 metadata_df = (
11 pd.read_csv(metadata_file, sep="\t")
---> 12 .loc[:, metadata_col_list]
13 .reset_index(drop=True)
14 )
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
1759 except (KeyError, IndexError, AttributeError):
1760 pass
-> 1761 return self._getitem_tuple(key)
1762 else:
1763 # we by definition only have the 0th axis
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1286 continue
1287
-> 1288 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
1289
1290 return retval
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
1951 raise ValueError("Cannot index with multidimensional key")
1952
-> 1953 return self._getitem_iterable(key, axis=axis)
1954
1955 # nested tuple slicing
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_iterable(self, key, axis)
1592 else:
1593 # A collection of keys
-> 1594 keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
1595 return self.obj._reindex_with_indexers(
1596 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1550
1551 self._validate_read_indexer(
-> 1552 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1553 )
1554 return keyarr, indexer
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1652 if not (ax.is_categorical() or ax.is_interval()):
1653 raise KeyError(
-> 1654 "Passing list-likes to .loc or [] with any missing labels "
1655 "is no longer supported, see "
1656 "https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike" # noqa:E501
KeyError: 'Passing list-likes to .loc or [] with any missing labels is no longer supported, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
@ErinWeisbart - is it safe to remove metadata references in this script?
I think current consensus is that we don't want to add NGS comparison to the recipe itself because it requires additional data input. However it is an experiment we are likely to want to do on many datasets.
Can we add an output from the recipe of a file specifically formatted for easy comparison to NGS data?
Will get exact requirements from Maria and report back.
I'm running the recipe and it errored during 0.preprocess-sites/1.process-spots but it looks like all the folders for 1.profiles are present (but empty). I know it's a fussy request, but browsing extant folders is a helpful and easy way to what has/has not been completed and it would be nice if folder creation was done just before a file is saved into it to rather than en masse ahead of time.
We need two distinct functionalities:
Because we cannot at present pull from a template repository, GitHub templates are ok for 2. but not for 1.
So instead we propose to create a regular repo to define the workflow submodule, which will then be a submodule of future project repositories.
Project scaffolds can however be created using GitHub templates, and are very useful for this purpose, because we don't need / want the functionality to pull from the template.
@gwaygenomics Does this sound right?
Right now, only one file indicating perturbation abundances is output per site. We should make retrieving a per-plate perturbation abundance easier, by summarizing perturbation counts in an additional script.
@jbauman214 - unfortunately, your request for this info is not super-readily available. We do calculate this at a per-site level, so it is possible to retrieve. The file name you are looking for is:
EXPERIMENT_LABEL/data/0.site-qc/PLATE_NAME/spots/SITE_NAME/cell_perturbation_category_summary_counts.tsv
These are available on github in a private repository for the EXPERIMENT_LABEL per PLATE_NAME. I am intentionally obscuring experimental details as this issue is in a public repo.
In one recent experiment, the gene column was of the following format: "GENENAME_GUIDEID". Our previous recipe expected a column to just contain the GENENAME. The guide ID could be useful information used to link between different resources (i.e. if the guide id is used in different places in a single experiment, including this info in a separate column would be helpful to retain). A simple solution parsing by underscores will not work since the "control_barcodes" entries in this column will break.
Currently, I will add the following solution:
I will scan to these columns, check for inconsistencies, auto-detect the anomalies, and then parse given the control_barcodes ingredient concern.
In running weld.py
, @hillsbury and I saw this error:
Traceback (most recent call last):
File "recipe/0.preprocess-sites/0.prefilter-features.py", line 77, in <module>
features_df = load_features(core_option_args, example_site_dir)
NameError: name 'load_features' is not defined
To fix, add load_features
to the import statement on line 17
Currently, control barcodes (sg_nt and NT) are specifically removed in generating summary counts and visualizations in 3.visualize-cell-summary.py, most likely so that NT doesn't dominate in visualizations.
Solutions: update step in weld -or- fix in a separate script
This is an advanced option, meant to be invoked via the command line, and will override the force: true
option in the config file. This is a fairly dangerous operation that should only be used by experts. Using the flag may compromise the weld by processing the data more than once with a possibly different recipe.
We discussed adding this feature in #24 (comment). The enhancement is not part of the version 0.1 milestone.
in #43 @ErinWeisbart notes:
Doesn't really fit with the theme of this PR, but since it touches print() statements it made me think, it would be nice to have a print() at the end of every step saying it's done.
0./1. and 0./2. already end with print("All sites complete.") but something like print("Step 0./0. processing complete.") would be nice to have at the end of other steps.
I agree that this would be a nice feature to keep consistent. Perhaps we can bundle this fix with #42
The following code block is giving me problems to reproduce ๐ (copied in full below as well)
In #39 I move the image.csv
processing away from 4.image-and-segmentation-qc.py
into an earlier step. In this way we are able to propagate important column metadata through in earlier files. This makes things way less fragile.
Anyways, in the new image.csv
processing, I am not finding any columns containing the string "Correlation_Correlation_". Because we're missing that string, this code block fails.
# Create list of questionable channel correlations (alignments)
corr_df_cols = ["Plate", "Well", "Site", "site"]
corr_cols = []
for col in image_df.columns:
if "Correlation_Correlation_" in col:
corr_cols.append(col)
corr_df_cols.append(col)
image_corr_df = image_df[corr_df_cols]
image_corr_list = []
for col in corr_cols:
image_corr_list.append(
image_corr_df.loc[image_corr_df[col] < correlation_threshold]
)
image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
for col in corr_cols:
image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"
if len(image_corr_df.index) > 0:
corr_output_file = pathlib.Path(results_output, "flagged_correlations.csv")
if check_if_write(corr_output_file, force, throw_warning=True):
image_corr_df.to_csv(corr_output_file)
edit to incude error message (my bad for not including in the first place)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-79-1da2ddc0ea9c> in <module>
15 image_corr_df.loc[image_corr_df[col] < correlation_threshold]
16 )
---> 17 image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
18 for col in corr_cols:
19 image_corr_df.loc[(image_corr_df[col] >= correlation_threshold), col] = "pass"
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
279 verify_integrity=verify_integrity,
280 copy=copy,
--> 281 sort=sort,
282 )
283
~/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort)
327
328 if len(objs) == 0:
--> 329 raise ValueError("No objects to concatenate")
330
331 if keys is None:
ValueError: No objects to concatenate
In an experiment with >1,000 sites, the aggregate recipe step fails quietly. We do not observe any errors, but the recipe next step is nevertheless performed and not surprisingly fails.
This may be a compute size issue, but it silently failing is still concerning and we should address.
One option is to aggregate each site independently, and then, with the number of single cells per perturbation, weight the aggregated contribution proportionally to cell count. I describe this option in #57 - time to revisit!
In #75, @ErinWeisbart writes about our options to handle the hardcoding issue I commented on in #75 (comment):
In a Cell Painting workflow, Cells and Nuclei are the two compartments that are always segmented. Segmenting fewer compartments is impossible because we need to identify individual cells and we must use Nuclei to determine Cells. It's possible to segment more compartments, but we don't currently do that (even in our workflow where we have many more labels) and are unlikely to do so because 1) it seems to be unnecessary and 2) it's prone to mistakes/variability and therefore requires significant hands on time.
Therefore, the thresholds we want to plot here are likely to always be Cells and Nuclei. The question is therefore should we remove hardcoding in case Cells and Nuclei compartments are ever be labeled in a different way (e.g. cells, Cell, etc.). To do that, it looks like the options are:
Create a new entry in options.yaml to specify the segmented compartment names. Not ideal because it's one more thing to have to enter, but it's the least prone to breaking.
Use core: compartments: and remove Cytoplasm from the list because it's the only tertiary compartment (mathematically determined by subtracting one compartment from another) and any other compartment in the list will have been created by segmentation. Not ideal because then we have to find a way to account for labelling of Cytoplasm.
Use core: cell_match_cols: cytoplasm: and then strip the Parent_ from the front of the compartment string. Parent is hardcoded into CellProfiler, so it's fine from our workflow perspective, but would mean this wouldn't work with data coming not from CellProfiler (which I thought was a goal?).
We may decide to tackle this at a later date
Look carefully at alignment in a couple batches of data and determine qc thresholds.
Compare to alignment qc thresholds Beth determined with WG screen images.
Either:
I want to better handle a situation where you have processed only some of your data through 0.preprocess-sites/1.process-spots.py (e.g. uncaught exception stops processing half way). For config options - perform
either processes or skips the whole module while force_overwrite
only happens at the file level (i.e. all of the data processing happens and then it decides to save the files or not) so your only option is to run the whole thing over again if you're missing sites.
I'm thinking it needs another config option that allows you to skip complete sites. If this new option is True
, at for site in split sites:
it should to see if the output folder exists for that site and if so it skips processing that site.
Might also want to do this for 2.process-cells as well.
The CellProfiler analysis pipeline now contains measurements that address the quality of images/data at a given site. It would be nice to have a pipeline that outputs a summary so that it can be viewed at a glance. Such as:
Confluent regions
MeasureImageQuality
Thresholds
When files already exist:
Step 0./1.process-spots throws an error for every file
Now processing spots for 151B2-B1-2...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...
Now processing spots for 151B2-B1-5...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...
Now processing spots for 151B2-B1-4...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:51: UserWarning: Output files likely exist, now overwriting...
Step 0./2.process-cells throws one error at the beginning:
Now processing cells for 151B2-B1-2...
/Users/eweisbar/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/ipykernel_launcher.py:122: UserWarning: Output files likely exist, now overwriting...
Now processing cells for 151B2-B1-5...
Now processing cells for 151B2-B1-4...
Now processing cells for 151B2-B1-3...
I think they should have the same behavior to prevent confusion.
I prefer an error for every file.
Create a QC yaml file for 4.image-and-segmentation-qc.
See #23 for discussion of approach.
I'm wondering if the config files should be as streamlined as possible or if they should include as much user-friendly information as possible or somewhere in between.
If streamlined, will detailed information live elsewhere? If so, where? And does it make sense to write it or at least take notes there as we go?
e.g.
streamlined:
core:
compartments:
- Cells
- Nuclei
- Cytoplasm
verbose:
core:
compartments:
#list all compartments with profiling features measured in experiment
#standard cell painting uses Cells, Nuclei, Cytoplasm
#experiments with SABER also can include Mito, Golgi
#names can be derived from .csv files found in workspace/analysis
- Cells
- Nuclei
- Cytoplasm
in-between:
core:
compartments:
#list all compartments with profiling features measured in experiment
- Cells
- Nuclei
- Cytoplasm
@hillsbury and I chatted about adding a single cell normalization option in two different scenarios:
#15 describes the need for a method to perform single cell normalization in general, but this issue can be used to document the need to implement both single cell normalization scenarios.
For example, @hillsbury noticed that the weld.py
will fail when output_one_single_file_only = True and single_cell
is set as a normalize level in the options config. Hillary will paste the error message that she received below :)
Since a single experiment is often done in multiple plates/batches, it would probably be good to add in a module where we can combine data from a list of batches so we can get an overview of the whole experiment.
We certainly want cell quality overview. Will think about what else would be helpful to have.
For example reduced representation experiments (230 genes; 2-4 wells) (~150,000 cells) we are able to normalize single cells in a single file using standard approaches. Two important notes:
We need to find a solution to normalize single cells within each site. This will also likely help us a bunch in our gene- and guide-level aggregated perturbation profiles.
Noting two things to be fixed in a future pull request:
force
not defined(pooled-cp) wm962-fdf:0.preprocess-sites gway$ python 4.image-and-segmentation-qc.py
Traceback (most recent call last):
File "4.image-and-segmentation-qc.py", line 113, in <module>
if check_if_write(output_file, force, throw_warning=True):
NameError: name 'force' is not defined
Done concatenating image files
Traceback (most recent call last):
File "4.image-and-segmentation-qc.py", line 478, in <module>
image_corr_df = pd.concat(image_corr_list).drop_duplicates(subset="site").reset_index()
File "/Users/gway/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 281, in concat
sort=sort,
File "/Users/gway/miniconda3/envs/pooled-cp/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 329, in __init__
raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
These should be quick fixes
Necessary for automated welding
I am going through 4.image-and-segmentation.py in much more detail in #39 - the following code block is giving me some trouble:
I am trying to think through what this is doing (so please bear with me!)
There are two different ImageQuality_PercentMaximal
types (spots and cells) and this column, in image.csv
indicates whether or not that image (either spots or cells) is over saturated? We also want these sites to be output to a file named saturated_sites.csv
but only if any saturated sites exist. Is this correct?
If so, then I think we may need to slightly tinker the code to make sure we extract all saturated sites. Sorry I didn't flag this sooner!
New error in running weld:
Traceback (most recent call last):
File "recipe/0.preprocess-sites/3.visualize-cell-summary.py", line 259, in
.groupby(gene_cols + barcode_cols + quality_col)["Cell_Count_Per_Guide"]
TypeError: can only concatenate list (not "str") to list
To fix, convert to list as [quality_col]
This code block is giving me some trouble:
A couple notes:
Plate
, then Well
, then Site
?-
(in a separate experiment, we delimit by underscore)So, this is a very fragile way of handling this split. A couple of solutions:
plate
, well
, and site number
information earlier and do not rely on this split at all. Is this info in the Image.csv
file in a seprate metadata column already?@ErinWeisbart - we should either resolve this or drop this before version 0.1
In 1.process-spots at this point:
# Number of non-targetting controls
num_nt = passed_gene_df.query(
f"{gene_cols[0]} in @control_barcodes"
).Cell_Count_Per_Gene.values[0]
it will error if there are no non-targeting controls at a site. This can happen because 1) biologically, the site actually doesn't contain control_barcodes, or 2) the barcodes you have listed in your config file don't match what is used in the experiment for control_barcodes.
Add in an error message to explain this, ideally parsing between the conditions.
@ErinWeisbart - I am trying to rerun this step in the recent pooled dataset. It was working smoothly until line 471. I paste the error statement at the end of this issue (file paths intentionally obscured).
If you look at the "blame" line 471 is my doing. However, in #72 you modified how cp_sat_df
is constructed - which likely changed how it should be processed downstream. ("blame" is a bad technical term.... but it is at least descriptive!)
Do you know what's going on? maybe this is an easy fix ๐คท
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_cells_count_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_ratios_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_Cells_FinalThreshold_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_Nuclei_FinalThreshold_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/figures/plate_layout_PercentConfluent_per_well.png exists, overwriting
XXX/recipe/scripts/io_utils.py:9: UserWarning: data/0.site-qc/XXX/results/sites_with_confluent_regions.csv exists, overwriting
/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/plotnine/layer.py:401: PlotnineWarning: geom_text : Removed 720 rows containing missing values.
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'level_3'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "recipe/0.preprocess-sites/4.image-and-segmentation-qc.py", line 471, in <module>
cp_sat_df[["cat", "type", "Ch"]] = cp_sat_df["level_3"].str.split(
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
indexer = self.columns.get_loc(key)
File "/home/ubuntu/miniconda3/envs/pooled-cell-painting/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 'level_3'
In #23 (specifically #23 (comment)) I suggested a column name change within the 0.preprocess
recipe module. This was a bad idea!
It was a bad idea b/c it breaks 0.merge-single-cells.py
at: https://github.com/broadinstitute/pooled-cell-painting-profiling-recipe/blob/1be45bb33e1da71050dbb102352690cf8a08fb7c/1.generate-profiles/0.merge-single-cells.py#L143L146
we need to revert this change (and address some corresponding breaks after updating) before we can produce profiles.
@gwaygenomics I'm happy to address this.
I remember you saying you were going to do some re-factoring to address handling our current experiment structure where a single batch can contain many plates. Will that affect the structure of inputs for this step? (i.e. image_metadata.tsv and cell_count.tsv)? (They already have Metadata_Plate columns that include both plates in the current batch we're working on, so I would assume not...?)
Now that Greg has left the Imaging Platform (๐ข) and started his own lab (๐), we need to figure out how to handle contributions/changes/updates (i.e. PRs) to the Recipe (and Template).
I'm taking over internal ownership of the repos, but we need to codify how PRs are handled now. To that end, @gwaygenomics we need to know how much you would still like to be involved in these repos. I see three options, though feel free to add nuance or suggest a different breakdown:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.