broadinstitute / 2021_09_01_varchamp Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
@Zitong-Chen-16 @shntnu @AnneCarpenter : the concerning well position analysis results. Would love to hear your thoughts!
The team generated a few batches that contain many repeated controls to try understand well position effects. There are two batches, each with four plates. Each plate contains 48 wells with ALK_REF, and 48 well with ALK_VAR (referred to as "REF" and "VAR" from here on). The 'ALK' locus was chosen because it is a positive protein mis-localization control.
The objective here was to determine how well our production data analysis strategy detects phenotypic effects (morphological/localization differences between REF and VAR) compared to technical well position effects.
Our current method for calling variants with significant phenotypic effects is to train a binary classifier to distinguish a given VAR from its WT using three plates within a batch, and to test on the fourth plate. We measure success using an F1 score. Since we expect there to be some well position effect, we also train binary classifiers to differentiate cells from pairs of repeated control wells (the well position null). The REF-VAR F1 scores are compared to the distribution of control F1 scores to decide which pairs of REF-VAR wells outperform the well position null. This well position null is imperfect, because repeated controls are usually closer together than REF-VAR pairs (need to confirm this). Therefore, we did this larger analysis described here.
Features were split into 3 sets: protein features, non-protein fluorescent features (3 channels), and brightfield features (3 channels). Considering the 48 REF wells and 48 VAR wells, we define every possible pair of two well positions: REF-REF pairs (48 choose 2 = 1128), VAR-VAR pairs (48 choose 2 = 1128), and REF-VAR pairs (48*48 = 2304).
For each feature set (protein, non-protein, brightfield) and comparison (REF-REF, VAR-VAR, REF-VAR) we train a binary XGBoost classifier with three plates from one batch and test with the fourth plate from that batch. We also train one null for reference - the REF-REF pairs are used to train another classifier after shuffling the well position labels of the training data. The F1 score and feature importance scores are saved from each classifier.
Since this is all quite computationally intensive, this analysis considers only the four plates in batch 4 (ignoring batch 6) and only looks at one training & test set (there are 4 possible sets, one with each of the 4 plates as the test plate).
Figure 1: the distribution of F1 scores across all classifiers, faceted by comparison type and feature set.
The takeaways here:
Figure 2: a comparison of feature importance scores across comparison types. Feature importance (FIP) scores were ranked for each classifier (1 = best, highest = worst) and averaged across all classifiers with the same comparison type and feature set. Here are pairwise scatterplots of mean FIP scores from the protein feature classifiers for all comparison types. Each point corresponds to a single feature.
Key takeaways:
Figure 3: a scatterplot comparing the F1 score for the classifier for each pair of wells to the physical distance between those two wells on the plate.
Key takeways:
1- Clean Platemap and the cleaned version in metadata/reprocessed
2- Save a subset of intensity features for transfection efficiency exploration and parameter selection of transfection detection
plate_raw_intensity_features
3- Read Intensity features and save their distribution in results/intensity_dists
4- Based on the fixed parameters for transfection detection, generate and save population level profiles and also save transfected single cells for visualization and subpopulation analysis
'/population_profiles/'+batchName+'/'+plateName
'/singlecell_profiles/'+batchName+'/'+plateName
5- Read data for analysis
'sc_per_plate_scaling':'raw' or 'sc_scaled_per_plate'
zscored_profiles: 'untransfected','untransfected_stringent'
6- Calculate replicate correlation of profiles
7- Calculate WT-MT impact scores and save
/results/Impact-Scores/Method-MeanProfiles/impact_scores_trt_todaydate
/results/Impact-Scores/Method-MeanProfiles/impact_scores_perplate_todaydate
This issue documents information and evaluation results about the negative and positive controls we included in the VarChAMP CellPainting experiments. From the previous discussion in GH issue #3, the following treatments are selected as controls in batch B1A1R1.
Plate 1 to 4 in B1A1R1 contains four replicates of each control treatment on each plate. Plate 4 also contains 21 candidate controls (one replicate on each plate) that could be used as controls in future experiments.
For our reference, the annotations in the "node_type" column in plate maps are:
For context, see #5 (comment) and the comments that follow
I’d find it so helpful if we can clearly articulate two things
1. In the context of controls, why should we think of the protein channel differently from a Cell Painting channel, say, the ER channel for the sake of this discussion?
It is seemingly obvious, but if you take a simplified/abstracted view that all channels are just measuring some aspect of a cell’s phenotype, it’s less obvious to me.
I get it for no-protein negative controls — there’s nothing to mark in the protein channel. I'll note that there's no equivalent for the ER channel, that is, there isn't a "no-ER" negative control, and that's because the ER is always present and will therefore always be marked. Perhaps the only equivalent would be if a hypothetical negative control somehow destroyed the ability of Con A to bind to ER (without affecting the ER itself). In this case, all perturbations that didn't destroy the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.
But what about a protein control that had any localization other than the protein of interest? Here you say the protein of interest will always have a phenotype. Why? Again, it is seemingly obvious because if we know the negative control has localization X, then any protein that doesn't have localization X will have a phenotype. Is there an equivalent for the ER channel? I suppose the equivalent is again a hypothetical negative control that would somehow change – but not destroy – the ability of Con A's binding to ER (without affecting the ER itself). In this case, all perturbations that didn't similarly change the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.
All this leads to the next question
2. What is a good negative control, when we limit ourselves only to the phenotype observed in the protein channel?
Context:
Shantanu Singh
That said, if we’ve settled on negative controls, we can still use the framework for reporting phenotypic activity.
Anne Carpenter
(which would be for the non-protein channels - the protein channel would always have a phenotype if we are comparing to no-protein controls, or if we choose a protein control that had any localization other than the protein of interest)
Notes from Chloe's emails:
Email on Aug 19, 2022:
Email on Sep 15, 2022:
Regarding the PosCons, we’d like to select either IMPDH1 or ALK as our reference allele,
plus two of their respective variants (one which shows strong morphological shifts/localization patterns,
and one that’s subtle). For NegCons, we can only select 4 to include in our screen –
I’ll leave it up to you guys which 4 best suit your needs.
You can disregard all wells that are not labelled either PosCon or NegCon for this screen.
And please keep in mind each quadrant received a varying dose of viral supernatant.
The amount I settled on for our final pipeline is 6 uL, so perhaps you want to pay attention to the wells which received a vTitre = 6.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.