broadinstitute / 2021_09_01_varchamp Goto Github PK

View Code? Open in Web Editor NEW

0.0 7.0 3.0 9.21 MB

License: MIT License

TeX 0.01% Jupyter Notebook 99.10% Python 0.88% Shell 0.01%

2021_09_01_varchamp's Issues

First Pass - Pilot Variant Painting data analysis

Goal:

What proportion of variants and which ones show a signal relative to their WT?

Basic Analysis using mean profiles:

Using correlation coefficients (by Marzieh)
- Replicate correlation + null distributions
- list of correlation coefficient scores for each pair
Using MAP (by @yhan8)
- #15
- list of map scores for each pair

Evaluation of well position effect

@Zitong-Chen-16 @shntnu @AnneCarpenter : the concerning well position analysis results. Would love to hear your thoughts!

Background

The team generated a few batches that contain many repeated controls to try understand well position effects. There are two batches, each with four plates. Each plate contains 48 wells with ALK_REF, and 48 well with ALK_VAR (referred to as "REF" and "VAR" from here on). The 'ALK' locus was chosen because it is a positive protein mis-localization control.

The objective here was to determine how well our production data analysis strategy detects phenotypic effects (morphological/localization differences between REF and VAR) compared to technical well position effects.

Our current method for calling variants with significant phenotypic effects is to train a binary classifier to distinguish a given VAR from its WT using three plates within a batch, and to test on the fourth plate. We measure success using an F1 score. Since we expect there to be some well position effect, we also train binary classifiers to differentiate cells from pairs of repeated control wells (the well position null). The REF-VAR F1 scores are compared to the distribution of control F1 scores to decide which pairs of REF-VAR wells outperform the well position null. This well position null is imperfect, because repeated controls are usually closer together than REF-VAR pairs (need to confirm this). Therefore, we did this larger analysis described here.

Methods

Features were split into 3 sets: protein features, non-protein fluorescent features (3 channels), and brightfield features (3 channels). Considering the 48 REF wells and 48 VAR wells, we define every possible pair of two well positions: REF-REF pairs (48 choose 2 = 1128), VAR-VAR pairs (48 choose 2 = 1128), and REF-VAR pairs (48*48 = 2304).

For each feature set (protein, non-protein, brightfield) and comparison (REF-REF, VAR-VAR, REF-VAR) we train a binary XGBoost classifier with three plates from one batch and test with the fourth plate from that batch. We also train one null for reference - the REF-REF pairs are used to train another classifier after shuffling the well position labels of the training data. The F1 score and feature importance scores are saved from each classifier.

Since this is all quite computationally intensive, this analysis considers only the four plates in batch 4 (ignoring batch 6) and only looks at one training & test set (there are 4 possible sets, one with each of the 4 plates as the test plate).

Results

Figure 1: the distribution of F1 scores across all classifiers, faceted by comparison type and feature set.

The takeaways here:

The distribution of F1 scores are virtually identical between REF-VAR, REF-REF, and VAR-VAR pairs
The brightfield features are best at predicting well position, next best are non-protein features, and next protein features
The null is centered on 0.5 and has a relatively narrow spread. For both the protein and non-protein fluorescent channels, there are classifiers that do significantly better and significantly worse than the NULL. Sam and I have talked through a potential explanation for this.
The distributions of protein and non-protein F1 scores unfortunately look suspiciously similar to the distribution of F1 scores in our production classifiers from this notebook (comparing all REF-VAR pairs across all loci in our real data) - look at the orange 'allele' distributions and compare to the protein and non-protein distribution shapes above:

Figure 2: a comparison of feature importance scores across comparison types. Feature importance (FIP) scores were ranked for each classifier (1 = best, highest = worst) and averaged across all classifiers with the same comparison type and feature set. Here are pairwise scatterplots of mean FIP scores from the protein feature classifiers for all comparison types. Each point corresponds to a single feature.

Key takeaways:

The classifiers are learning to use the same features regardless of whether we are looking at REF-REF, VAR-VAR, or REF-VAR pairs of wells. We appear to always be learning well position and not morphology (😱)
The null prioritizes random features (as expected from a shuffled null)
These takeways are the same for the non-protein and brightfield classifiers

Figure 3: a scatterplot comparing the F1 score for the classifier for each pair of wells to the physical distance between those two wells on the plate.

Key takeways:

Only the protein feature REF-REF, VAR-VAR, and REF-VAR classifiers have a significant correlation between F1 score and physical distance (corr r = 0.16, 0.22, 0.24, corr p-value = 10-8, 10-14, 10-35 respectively).

Selecting negative and positive control ORFs

Negative controls:

We want negative controls to:
- Have no signature confirmed by existing datasets.
  - Having no signature usually is equivalent to having low replicate correlations for negative controls, so relying on that rule, Niranj has searched for negative controls which have low replicate correlations based on two CPJUMP1 and in the JUMP production experiments.

Positive controls:

We want positive controls to:
- Have strong wt to mutant mislocalization
  - we check the manual impact score by Jessie
- Have strong wt phenotype using the rest of the cell painting channels comparing to controls
  - we check the replicate correlation in the cpjump1 dataset.

Does low replicate correlation necessarily mean no signature?

I'm creating this issue as this assumption is used in different contexts and I'm not yet convinced why that is ture. @shntnu, I'm sure you have explained me before but hopefully this issue is the last time I'm questioning this assumption :)

Processing Pipeline

Instructions on the processing steps and parameters in each step:

1- Clean Platemap and the cleaned version in metadata/reprocessed

The input platemap for this project has been inconsistent across batches and also within the experiment
We should check this input metadata and make it consistent with an standard for each batch of data
Sometimes even the sqlite file has irregular column naming that we have to address by another way of handling it.

2- Save a subset of intensity features for transfection efficiency exploration and parameter selection of transfection detection

Save folder: plate_raw_intensity_features

3- Read Intensity features and save their distribution in results/intensity_dists

4- Based on the fixed parameters for transfection detection, generate and save population level profiles and also save transfected single cells for visualization and subpopulation analysis

Save folder for mean profiles: '/population_profiles/'+batchName+'/'+plateName
Save folder for transfected single cell profiles '/singlecell_profiles/'+batchName+'/'+plateName

5- Read data for analysis

Parameters:
- single cell scaling: 'sc_per_plate_scaling':'raw' or 'sc_scaled_per_plate'
- well level profiles zcoring: zscored_profiles: 'untransfected','untransfected_stringent'

6- Calculate replicate correlation of profiles

Save curve plots and values to results/replicate_corr_curves

7- Calculate WT-MT impact scores and save

Approach 1: average replicate level profiles and score treatment level profiles
- save the results in: /results/Impact-Scores/Method-MeanProfiles/impact_scores_trt_todaydate
Approach 2: calculate impact scores per plate
- save the results in: /results/Impact-Scores/Method-MeanProfiles/impact_scores_perplate_todaydate

Evaluation of Negative and Positive Control Selections

This issue documents information and evaluation results about the negative and positive controls we included in the VarChAMP CellPainting experiments. From the previous discussion in GH issue #3, the following treatments are selected as controls in batch B1A1R1.

Negative Control for Morphological Changes: MAPK9, RHEB, SLIRP, & PRKACB
- Labeled "NC" in "node_type" column in metadata file.
- Treatments selected because they have low replicate correlations in CPJUMP1 experiments.
Positive Control for Morphological Changes: PTK2B
- Labeled "PC" in "node_type".
- Selected because its function is likely to induce morphological changes but hasn't been confirmed in our assay.
Positive Control for Protein Instability/Mislocalization - ALK vs. ALK R1275Q
- Labeled "PC" in "node_type".
- Selected because the protein of variant R1275Q should mislocalize to the ER.
Transduction/Selection Control: 516 - TC
- Labeled "TC" in "node_type".
- There should be no remaining cells in the well after selection.

Plate 1 to 4 in B1A1R1 contains four replicates of each control treatment on each plate. Plate 4 also contains 21 candidate controls (one replicate on each plate) that could be used as controls in future experiments.

For our reference, the annotations in the "node_type" column in plate maps are:

"allele" - missense mutant
"disease_wt" - reference/WT allele
"TC" - transduction/selection control
"PC" - positive control
"NC" - negative control
"cPC" - candidate positive control (morphology)
"cPPC" - candidate positive control (protein)
"cNC" - candidate negative control (morphology)

Cleanup repo

Replicate MAP + null distributions

Can single-cell-level classification score be used to determine variant impact?

For context, see #5 (comment) and the comments that follow

Why do we treat the protein channel differently?

⚠️ tedium ahead! I confused myself after writing it down so I suggest we just discuss this in person, but feel free to add notes.

I’d find it so helpful if we can clearly articulate two things

1. In the context of controls, why should we think of the protein channel differently from a Cell Painting channel, say, the ER channel for the sake of this discussion?

It is seemingly obvious, but if you take a simplified/abstracted view that all channels are just measuring some aspect of a cell’s phenotype, it’s less obvious to me.

I get it for no-protein negative controls — there’s nothing to mark in the protein channel. I'll note that there's no equivalent for the ER channel, that is, there isn't a "no-ER" negative control, and that's because the ER is always present and will therefore always be marked. Perhaps the only equivalent would be if a hypothetical negative control somehow destroyed the ability of Con A to bind to ER (without affecting the ER itself). In this case, all perturbations that didn't destroy the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.

But what about a protein control that had any localization other than the protein of interest? Here you say the protein of interest will always have a phenotype. Why? Again, it is seemingly obvious because if we know the negative control has localization X, then any protein that doesn't have localization X will have a phenotype. Is there an equivalent for the ER channel? I suppose the equivalent is again a hypothetical negative control that would somehow change – but not destroy – the ability of Con A's binding to ER (without affecting the ER itself). In this case, all perturbations that didn't similarly change the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.

All this leads to the next question

2. What is a good negative control, when we limit ourselves only to the phenotype observed in the protein channel?

Context:

Shantanu Singh
That said, if we’ve settled on negative controls, we can still use the framework for reporting phenotypic activity.

Anne Carpenter
(which would be for the non-protein channels - the protein channel would always have a phenotype if we are comparing to no-protein controls, or if we choose a protein control that had any localization other than the protein of interest)

poscon negcon selection experiment by Chloe

Notes from Chloe's emails:

Email on Aug 19, 2022:
- For our positive controls, ideally we’d like to establish a reference ORF paired with two mutants, one showing strong shifts and one subtle in the protein channel as well as detectable changes in morphology. In this case, profiling would especially be helpful. For the NegCons, we must slim down our selection to only 4 ORFs – I’m not sure if you guys have preference for selection there.
Email on Sep 15, 2022:
- Regarding the PosCons, we’d like to select either IMPDH1 or ALK as our reference allele,
  plus two of their respective variants (one which shows strong morphological shifts/localization patterns,
  and one that’s subtle). For NegCons, we can only select 4 to include in our screen –
  I’ll leave it up to you guys which 4 best suit your needs.
- You can disregard all wells that are not labelled either PosCon or NegCon for this screen.
  And please keep in mind each quadrant received a varying dose of viral supernatant.
  The amount I settled on for our final pipeline is 6 uL, so perhaps you want to pay attention to the wells which received a vTitre = 6.