Giter Site home page Giter Site logo

position-effect-correction's Introduction

position-effect-correction

Setup

  1. Install Python
  2. Install Poetry
  3. Install Poetry Environment: poetry install --no-root
  4. Activate Poetry Environment: poetry shell
  5. Install Pycytominer: poetry run pip install <path_to_pycytominer>, where <path_to_pycytominer> is a GitHub link or a locally cloned repository.

For Linux, see

To update a poetry environment, run poetry update and poetry install --no-root again.

position-effect-correction's People

Contributors

alxndrkalinin avatar shntnu avatar

Watchers

James Cloos avatar  avatar Anne Carpenter avatar Henry Ferrara avatar Beth Cimini avatar  avatar Niranj Chandrasekaran avatar  avatar

position-effect-correction's Issues

Regressing out cell count

1. Motivation

Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.

2. Approach

To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.

2.1 Constant and low count features

Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.

Effects of regressing out cell count on a constant feature
Before After
before after

We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.

Visualizing # of unique feature values vs cell count

unique_vs_cc

2.2 Adding cell count back as a feature

After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06 (see visualization below), which is below the cutoff value unique_cut=0.1 that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01, which later was replaced by 0.1, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection with unique_cut=0.01, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.

Cell count unique values / sample size ratio

unique_size_ratio

3. Results

3.1 Same well, different ORF

Setting Data mmAP Fraction retrieved (p<0.05)
same well, diff ORF raw->subset 0.0636 0.139 (51/368)
same well, diff ORF raw->subset->cc adjust 0.0583 0.0217 (8/368)
same well, diff ORF raw->subset->well correct 0.114 0.25 (92/368)
same well, diff ORF raw->subset->cc adjust->well correct 0.379 0.723 (266/368)
Same well, different ORF plots

same_well_diff_orf

3.2 Same ORF, different well

Setting Data mmAP Fraction retrieved (p<0.05)
same ORF, diff well raw->subset 0.00974 0.0 (0/37)
same ORF, diff well raw->subset->cc adjust 0.0202 0.027 (1/37)
same ORF, diff well raw->subset->well correct 0.0166 0.027 (1/37)
same ORF, diff well raw->subset->cc adjust->well correct 0.00834 0.0 (0/37)
Same ORF, different well plots

same_orf_diff_well

3.2 Same ORF, same well

Setting Data mmAP Fraction retrieved (p<0.05)
same ORF, same well raw->subset 0.195 0.903 (3297/3653)
same ORF, same well raw->subset->cc adjust 0.0856 0.417 (1524/3653)
same ORF, same well raw->subset->well correct 0.286 0.93 (3397/3653)
same ORF, same well raw->subset->cc adjust->well correct 0.538 0.989 (3612/3653)
Same ORF, same well plots

same_orf_diff_well

Observations:

  • cc adjustment corrects for plate effects better than well mean correction
  • a combination of both actually makes things WORSE (perhaps, due to an overcorrection)

Per-batch plate effects visualizations

I gathered plate effect visualizations from jump-orf-data QC into per-batch composite images that can be viewed below.

Missing panels in percent replicating plots designate control plates, for which this metric was not calculated.

  • 2021_04_26_Batch1
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_05_10_Batch3
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_05_17_Batch4
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_05_31_Batch2
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_06_07_Batch5
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_06_14_Batch6
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_06_21_Batch7
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_07_12_Batch8
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_07_26_Batch9
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_08_02_Batch10
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_08_09_Batch11
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_08_23_Batch12
Cell count

Cell_count

Percent replicating

Percent_replicating

  • 2021_08_30_Batch13
Cell count

Cell_count

Percent replicating

Percent_replicating

Document project plan

Benchmark

  1. same well-position, different perturbation mAP - this should go down
  2. same perturbation, different well-position mAP - this should go up
  3. same perturbation, same well-position mAP - this will go down but we don't want it to go down by too muhc
  • Exclude controls in all 3 cases
  • Use notebooks in evalzoo to produce metrics

Dummy method

Add noise and compare with baseline

Simple method

Mean subtract per-feature per-well position

Complex method

Estimate the 16x24 function per feature using a CNN

Baseline correction (subtracting well mean)

To establish a simple baseline for plate layout correction, we calculated and subtracted the mean from profiles that come from the same well. To speed up mAP calculation, we selected a subset of plates that have different plate layouts, but share some of the ORFs. To assessments that we performed included:

  • raw: taking a subset of raw data, and calculating mAPs (no correction)
  • subset->correct: taking a subset of raw data, correcting this subset, and calculating mAPs
  • correct->subset: correcting all of raw data, taking a subset, and calculating mAPs
    The metric calculation step also included preprocessing, i.e. RobustMAD and feature selection.

mAPs were calculated in 3 settings (as per #2):

  • profiles that come from the _same wells positions, but have different ORFs
  • profiles that come from the different well positions, but have same ORFs
  • profiles that come from the same well positions and have same ORFs

The results are shown below.

Setting Data mmAP Percent retrieved (p<0.05)
same well, different ORF raw 0.0634 0.139 (51/368)
same well, different ORF subset->correct 0.0594 0.0353 (13/368)
same well, different ORF correct->subset 0.0978 0.166 (61/368)

same_well_diff_pert

Setting Data mmAP Percent retrieved (p<0.05)
same ORF, different well raw 0.0096 0.0 (0/37)
same ORF, different well subset->correct 0.0135 0.0 (0/37)
same ORF, different well correct->subset 0.0102 0.027 (1/37)

same_well_diff_pert

Setting Data mmAP Percent retrieved (p<0.05)
same well, same ORF raw 0.197 0.902 (3295/3653)
same well, same ORF subset->correct 0.165 0.775 (2831/3653)
same well, same ORF correct->subset 0.243 0.684 (2499/3653)

same_well_diff_pert

Classification-based analysis of plate position effect

Quick notes, will update in a day or two

I tried a bunch of things but only two were notable

  • classification of eGFP (so-called poscon) vs. ORF negcons is very high – this could be useful in debugging
  • well position prediction is very high, now need to figure out if we can narrow it down to features

Ongoing work in #20

Cell count variation

1. Exploring cell count variability

Due to strong presence of per-batch/plate patterns in cell count (CC) visualizations (#7), we wanted to look if cell count variability has a relationship with position effect retrievability metrics (#9). To do so, we added Metadata_Count_Cells column to metadata (from jump-cellpainting/morphmap@dbbd1c3) and calculated it's coefficient of variation (CoV). We then empirically chose a cutoff value of CoV=0.12 to split ORF into low and high CC variability.

Cell counts Cell count CoV Cell counts split @ CoV=0.12
cell_counts cell_counts_cov cell_counts_cov_thresh

2. Subsetting same ORF, same well mAP based on low vs high cell count variability

Based on low/high variability, we can select ORFs from the subset that we used to calculate mAP for raw and baseline-corrected data (see #9). Due to low number of samples in same ORF, different well and same well, different ORF, it makes sense to look at same ORF, same well. Columns are the same as in #9:

  • "subset": a subset of raw uncorrected data
  • "subset->correct": a subset of raw profiles that were then corrected by subtracting per-well mean on this subset
  • "correct->subset": a subset of corrected profiles, which were corrected by subtracting per-well mean on full data

2.1 Low cell count variability ORFs

Setting Data mmAP Percent retrieved (p<0.05)
same well, same ORF raw 0.231 0.969 (2519/2600)
same well, same ORF subset->correct 0.177 0.816 (2121/2600)
same well, same ORF correct->subset 0.242 0.7 (1821/2600)
Low CC CoV visualization

low_var_same_same

2.2 High cell count variability ORFs

Setting Data mmAP Percent retrieved (p<0.05)
same well, same ORF raw 0.113 0.737 (776/1053)
same well, same ORF subset->correct 0.134 0.674 (710/1053)
same well, same ORF correct->subset 0.245 0.644 (678/1053)
High CC CoV visualization

high_var_same_same

Metrics on uncorrected data differ substantially between low and high variability subsets. Per-well mean subtraction reduces this difference.

2.3 All ORFs

For the reference, results for from #9
Setting Data mmAP Percent retrieved (p<0.05)
same well, same ORF raw 0.197 0.902 (3295/3653)
same well, same ORF subset->correct 0.165 0.775 (2831/3653)
same well, same ORF correct->subset 0.243 0.684 (2499/3653)

same_well_diff_pert

3. Visualizing distributional relationships between mAP and cell count variability (all ORFs)

Instead of splitting ORFs into low/high variability, we can also plot all their mAPs vs CoVs. There are very few ORFs that have both high cell count variability and mAP values/significance.

mAPs vs CoVs color-coded by p-values
map_vs_ccv

CoVs vs p-values color-coded by mAP values
pval_vs_ccv

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.