1. Motivation
Exploratory visual QC (#7) and retrievability metrics (#12) analyses showed that: (1) there are patterns in cell count variation across well positions / plates / batches, and (2) this variation has a relationship with an ability to retrieve ORF replicate, i.e. ORFs with high cell count variability tend to have lower mAP values.
2. Approach
To address that, we explored regressing out cell counts from other features and recalculating the effect of this correction on retrievability metrics. As the first step, we added cell count as a feature by aggregating all of the metadata early in the preprocessing pipeline (d91cbd5). Then, for each feature, we fit a linear model to predict cell count from this feature, and replace actual feature values with residuals from this model.
2.1 Constant and low count features
Because plate effect correction is the first step in the preprocessing pipeline, all features are present in the dataset, including those that have constant values across all samples (e.g. min/max intensity value can be 0/65535). When fitting a linear model using these features, resulting residuals are not exactly zero due to rounding. Instead, they're equal to some small numbers, which can correlate well with cell count, producing the effect opposite to desired.
Effects of regressing out cell count on a constant feature
Before |
After |
|
|
We visualized the number of unique values per feature vs correlation to cell count to confirm that no features with less than a few hundred unique values have high correlations with cell count. Based on this result, we only regress out cell count from features that have more than 100 unique values. One idea we did not explore is whether it'd help to not regress cell count from features that are not highly correlated with cell count in the first place.
Visualizing # of unique feature values vs cell count
2.2 Adding cell count back as a feature
After regressing out cell count, we can add cell count as a separate feature. However, we found out that it is later filtered out at the feature_select
step of the pipeline. The reason for that is that as a integer count feature, cell type has a unique values / sample size ratio ~0.06
(see visualization below), which is below the cutoff value unique_cut=0.1
that is used as one of the criteria to filter out low variance features in pycytominer. Turns out, earlier versions of pycytominer had a more relaxed cutoff value of 0.01
, which later was replaced by 0.1
, probably because of a typo (see cytomining/pycytominer#282). To prevent cell count being remove by this criterion, we use feature_selection
with unique_cut=0.01
, as per original pycytominer default value. This results in a different number of features selected from any subset, so we reran preprocessing for all uncorrected and cc-adjusted subsets.
Cell count unique values / sample size ratio
3. Results
3.1 Same well, different ORF
Setting |
Data |
mmAP |
Fraction retrieved (p<0.05) |
same well, diff ORF |
raw->subset |
0.0636 |
0.139 (51/368) |
same well, diff ORF |
raw->subset->cc adjust |
0.0583 |
0.0217 (8/368) |
same well, diff ORF |
raw->subset->well correct |
0.114 |
0.25 (92/368) |
same well, diff ORF |
raw->subset->cc adjust->well correct |
0.379 |
0.723 (266/368) |
Same well, different ORF plots
3.2 Same ORF, different well
Setting |
Data |
mmAP |
Fraction retrieved (p<0.05) |
same ORF, diff well |
raw->subset |
0.00974 |
0.0 (0/37) |
same ORF, diff well |
raw->subset->cc adjust |
0.0202 |
0.027 (1/37) |
same ORF, diff well |
raw->subset->well correct |
0.0166 |
0.027 (1/37) |
same ORF, diff well |
raw->subset->cc adjust->well correct |
0.00834 |
0.0 (0/37) |
Same ORF, different well plots
3.2 Same ORF, same well
Setting |
Data |
mmAP |
Fraction retrieved (p<0.05) |
same ORF, same well |
raw->subset |
0.195 |
0.903 (3297/3653) |
same ORF, same well |
raw->subset->cc adjust |
0.0856 |
0.417 (1524/3653) |
same ORF, same well |
raw->subset->well correct |
0.286 |
0.93 (3397/3653) |
same ORF, same well |
raw->subset->cc adjust->well correct |
0.538 |
0.989 (3612/3653) |
Same ORF, same well plots
Observations:
- cc adjustment corrects for plate effects better than well mean correction
- a combination of both actually makes things WORSE (perhaps, due to an overcorrection)