broadinstitute / position-effect-correction Goto Github PK

View Code? Open in Web Editor NEW

0.0 8.0 2.0 353.69 MB

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 3.71% HTML 96.29% Python 0.01%

position-effect-correction's Introduction

position-effect-correction

Setup

Install Python
Install Poetry
Install Poetry Environment: poetry install --no-root
Activate Poetry Environment: poetry shell
Install Pycytominer: poetry run pip install <path_to_pycytominer>, where <path_to_pycytominer> is a GitHub link or a locally cloned repository.

For Linux, see

python-poetry/poetry#1917 (comment) if installing six fails
https://stackoverflow.com/a/75435100 if you get "does not contain any element" warning when running poetry install

To update a poetry environment, run poetry update and poetry install --no-root again.

position-effect-correction's People

Contributors

Watchers

Forkers

alxndrkalinin zitong-chen-16

position-effect-correction's Issues

Regressing out cell count

1. Motivation

Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.

2. Approach

To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.

2.1 Constant and low count features

Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.

Effects of regressing out cell count on a constant feature

Before	After

We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.

Visualizing # of unique feature values vs cell count

2.2 Adding cell count back as a feature

After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06 (see visualization below), which is below the cutoff value unique_cut=0.1 that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01, which later was replaced by 0.1, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection with unique_cut=0.01, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.

Cell count unique values / sample size ratio

3. Results

3.1 Same well, different ORF

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same well, diff ORF	raw->subset	0.0636	0.139 (51/368)
same well, diff ORF	raw->subset->cc adjust	0.0583	0.0217 (8/368)
same well, diff ORF	raw->subset->well correct	0.114	0.25 (92/368)
same well, diff ORF	raw->subset->cc adjust->well correct	0.379	0.723 (266/368)

Same well, different ORF plots

3.2 Same ORF, different well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, diff well	raw->subset	0.00974	0.0 (0/37)
same ORF, diff well	raw->subset->cc adjust	0.0202	0.027 (1/37)
same ORF, diff well	raw->subset->well correct	0.0166	0.027 (1/37)
same ORF, diff well	raw->subset->cc adjust->well correct	0.00834	0.0 (0/37)

Same ORF, different well plots

3.2 Same ORF, same well

Setting	Data	mmAP	Fraction retrieved (p<0.05)
same ORF, same well	raw->subset	0.195	0.903 (3297/3653)
same ORF, same well	raw->subset->cc adjust	0.0856	0.417 (1524/3653)
same ORF, same well	raw->subset->well correct	0.286	0.93 (3397/3653)
same ORF, same well	raw->subset->cc adjust->well correct	0.538	0.989 (3612/3653)

Same ORF, same well plots

Observations:

cc adjustment corrects for plate effects better than well mean correction
a combination of both actually makes things WORSE (perhaps, due to an overcorrection)

Per-batch plate effects visualizations

I gathered plate effect visualizations from jump-orf-data QC into per-batch composite images that can be viewed below.

Missing panels in percent replicating plots designate control plates, for which this metric was not calculated.

2021_04_26_Batch1

Cell count

Percent replicating

2021_05_10_Batch3

Cell count

Percent replicating

2021_05_17_Batch4

Cell count

Percent replicating

2021_05_31_Batch2

Cell count

Percent replicating

2021_06_07_Batch5

Cell count

Percent replicating

2021_06_14_Batch6

Cell count

Percent replicating

2021_06_21_Batch7

Cell count

Percent replicating

2021_07_12_Batch8

Cell count

Percent replicating

2021_07_26_Batch9

Cell count

Percent replicating

2021_08_02_Batch10

Cell count

Percent replicating

2021_08_09_Batch11

Cell count

Percent replicating

2021_08_23_Batch12

Cell count

Percent replicating

2021_08_30_Batch13

Cell count

Percent replicating

Document project plan

Benchmark

same well-position, different perturbation mAP - this should go down
same perturbation, different well-position mAP - this should go up
same perturbation, same well-position mAP - this will go down but we don't want it to go down by too muhc

Exclude controls in all 3 cases
Use notebooks in evalzoo to produce metrics

Dummy method

Add noise and compare with baseline

Simple method

Mean subtract per-feature per-well position

Complex method

Estimate the 16x24 function per feature using a CNN

Baseline correction (subtracting well mean)

To establish a simple baseline for plate layout correction, we calculated and subtracted the mean from profiles that come from the same well. To speed up mAP calculation, we selected a subset of plates that have different plate layouts, but share some of the ORFs. To assessments that we performed included:

raw: taking a subset of raw data, and calculating mAPs (no correction)
subset->correct: taking a subset of raw data, correcting this subset, and calculating mAPs
correct->subset: correcting all of raw data, taking a subset, and calculating mAPs
The metric calculation step also included preprocessing, i.e. RobustMAD and feature selection.

mAPs were calculated in 3 settings (as per #2):

profiles that come from the _same wells positions, but have different ORFs
profiles that come from the different well positions, but have same ORFs
profiles that come from the same well positions and have same ORFs

The results are shown below.

Setting	Data	mmAP	Percent retrieved (p<0.05)
same well, different ORF	raw	0.0634	0.139 (51/368)
same well, different ORF	subset->correct	0.0594	0.0353 (13/368)
same well, different ORF	correct->subset	0.0978	0.166 (61/368)

Setting	Data	mmAP	Percent retrieved (p<0.05)
same ORF, different well	raw	0.0096	0.0 (0/37)
same ORF, different well	subset->correct	0.0135	0.0 (0/37)
same ORF, different well	correct->subset	0.0102	0.027 (1/37)

Setting	Data	mmAP	Percent retrieved (p<0.05)
same well, same ORF	raw	0.197	0.902 (3295/3653)
same well, same ORF	subset->correct	0.165	0.775 (2831/3653)
same well, same ORF	correct->subset	0.243	0.684 (2499/3653)

Related work: Adjustment of plate positional effects

https://www.nature.com/articles/s42003-022-04343-3

This figure, but the paper in general is worth reading

Classification-based analysis of plate position effect

Quick notes, will update in a day or two

I tried a bunch of things but only two were notable

classification of eGFP (so-called poscon) vs. ORF negcons is very high – this could be useful in debugging
well position prediction is very high, now need to figure out if we can narrow it down to features

Ongoing work in #20

Cell count variation

1. Exploring cell count variability

Due to strong presence of per-batch/plate patterns in cell count (CC) visualizations (#7), we wanted to look if cell count variability has a relationship with position effect retrievability metrics (#9). To do so, we added Metadata_Count_Cells column to metadata (from jump-cellpainting/morphmap@dbbd1c3) and calculated it's coefficient of variation (CoV). We then empirically chose a cutoff value of CoV=0.12 to split ORF into low and high CC variability.

Cell counts	Cell count CoV	Cell counts split @ CoV=0.12

2. Subsetting `same ORF, same well` mAP based on low vs high cell count variability

Based on low/high variability, we can select ORFs from the subset that we used to calculate mAP for raw and baseline-corrected data (see #9). Due to low number of samples in same ORF, different well and same well, different ORF, it makes sense to look at same ORF, same well. Columns are the same as in #9:

"subset": a subset of raw uncorrected data
"subset->correct": a subset of raw profiles that were then corrected by subtracting per-well mean on this subset
"correct->subset": a subset of corrected profiles, which were corrected by subtracting per-well mean on full data

2.1 Low cell count variability ORFs

Setting	Data	mmAP	Percent retrieved (p<0.05)
same well, same ORF	raw	0.231	0.969 (2519/2600)
same well, same ORF	subset->correct	0.177	0.816 (2121/2600)
same well, same ORF	correct->subset	0.242	0.7 (1821/2600)

Low CC CoV visualization

2.2 High cell count variability ORFs

Setting	Data	mmAP	Percent retrieved (p<0.05)
same well, same ORF	raw	0.113	0.737 (776/1053)
same well, same ORF	subset->correct	0.134	0.674 (710/1053)
same well, same ORF	correct->subset	0.245	0.644 (678/1053)

High CC CoV visualization

Metrics on uncorrected data differ substantially between low and high variability subsets. Per-well mean subtraction reduces this difference.

2.3 All ORFs

For the reference, results for from #9

Setting	Data	mmAP	Percent retrieved (p<0.05)
same well, same ORF	raw	0.197	0.902 (3295/3653)
same well, same ORF	subset->correct	0.165	0.775 (2831/3653)
same well, same ORF	correct->subset	0.243	0.684 (2499/3653)

3. Visualizing distributional relationships between mAP and cell count variability (all ORFs)

Instead of splitting ORFs into low/high variability, we can also plot all their mAPs vs CoVs. There are very few ORFs that have both high cell count variability and mAP values/significance.

mAPs vs CoVs color-coded by p-values

CoVs vs p-values color-coded by mAP values

broadinstitute / position-effect-correction Goto Github PK

position-effect-correction's Introduction

position-effect-correction

Setup

position-effect-correction's People

Contributors

Watchers

Forkers

position-effect-correction's Issues

1. Motivation

2. Approach

2.1 Constant and low count features

2.2 Adding cell count back as a feature

3. Results

3.1 Same well, different ORF

3.2 Same ORF, different well

3.2 Same ORF, same well

Benchmark

Dummy method

Simple method

Complex method

1. Exploring cell count variability

2. Subsetting same ORF, same well mAP based on low vs high cell count variability

2.1 Low cell count variability ORFs

2.2 High cell count variability ORFs

2.3 All ORFs

3. Visualizing distributional relationships between mAP and cell count variability (all ORFs)

Recommend Projects

Recommend Topics

Recommend Org

2. Subsetting `same ORF, same well` mAP based on low vs high cell count variability