broadinstitute / 2021_09_01_varchamp Goto Github PK

License: MIT License

TeX 0.01% Jupyter Notebook 99.51% Python 0.48% Shell 0.01%

2021_09_01_varchamp's Introduction

VarChAMP: Variant Characterization across the Mendelian Proteome

We aim to functionally characterize approximately 100,000 coding variants across Mendelian disease genes, addressing the significant gap in understanding the impact of human genomic variations. By analyzing the phenotypic impacts of these variants, we seek to elucidate genotype-phenotype relationships in inherited disorders. We will create a searchable database detailing these variant effects, accessible through the IGVF consortium, which will contribute to public health by aiding in the diagnosis and treatment of Mendelian disorders.

Documents

GDrive folder (internal): link

What's in this repo?

This repo contains the analysis scripts and notebooks for the VarChAMP project. The data is stored in a separate repo, 2021_09_01_VarChAMP-data, which is added as a submodule to this repo. Profiles from all the plates are in 2021_09_01_VarChAMP-data/profiles. All levels of profiles downstream of the aggregation step in the pycytominer workflow are in that folder.

How to use this repo?

Fork the repo

Clone the repo

git clone [email protected]:<YOUR USER NAME>/2021_09_01_VarChAMP.git

Download the contents of the submodule

git submodule update --init --recursive
cd 2021_09_01_VarChAMP-data
dvc pull
git lfs pull

Install the conda environment within each folder before running the notebooks. We use mamba to manage the computational environment. To install mamba see instructions. After installing mamba, execute the following to install and navigate to the environment:

# First, install the conda environment
mamba env create --force --file environment.yml

# If you had already installed this environment and now want to update it
mamba env update --file environment.yml --prune

# Then, activate the environment and you're all set!
environment_name=$(grep "name:" environment.yml | awk '{print $2}')
mamba activate $environment_name

Run the notebooks

2021_09_01_varchamp's People

Contributors

Watchers

Forkers

yhan8 zitong-chen-16 jessica-ewald

2021_09_01_varchamp's Issues

Selecting negative and positive control ORFs

Negative controls:

We want negative controls to:
- Have no signature confirmed by existing datasets.
  - Having no signature usually is equivalent to having low replicate correlations for negative controls, so relying on that rule, Niranj has searched for negative controls which have low replicate correlations based on two CPJUMP1 and in the JUMP production experiments.

Positive controls:

We want positive controls to:
- Have strong wt to mutant mislocalization
  - we check the manual impact score by Jessie
- Have strong wt phenotype using the rest of the cell painting channels comparing to controls
  - we check the replicate correlation in the cpjump1 dataset.

Does low replicate correlation necessarily mean no signature?

I'm creating this issue as this assumption is used in different contexts and I'm not yet convinced why that is ture. @shntnu, I'm sure you have explained me before but hopefully this issue is the last time I'm questioning this assumption :)

Replicate MAP + null distributions

First Pass - Pilot Variant Painting data analysis

Goal:

What proportion of variants and which ones show a signal relative to their WT?

Basic Analysis using mean profiles:

Using correlation coefficients (by Marzieh)
- Replicate correlation + null distributions
- list of correlation coefficient scores for each pair
Using MAP (by @yhan8)
- #15
- list of map scores for each pair

poscon negcon selection experiment by Chloe

Notes from Chloe's emails:

Email on Aug 19, 2022:
- For our positive controls, ideally we’d like to establish a reference ORF paired with two mutants, one showing strong shifts and one subtle in the protein channel as well as detectable changes in morphology. In this case, profiling would especially be helpful. For the NegCons, we must slim down our selection to only 4 ORFs – I’m not sure if you guys have preference for selection there.
Email on Sep 15, 2022:
- Regarding the PosCons, we’d like to select either IMPDH1 or ALK as our reference allele,
  plus two of their respective variants (one which shows strong morphological shifts/localization patterns,
  and one that’s subtle). For NegCons, we can only select 4 to include in our screen –
  I’ll leave it up to you guys which 4 best suit your needs.
- You can disregard all wells that are not labelled either PosCon or NegCon for this screen.
  And please keep in mind each quadrant received a varying dose of viral supernatant.
  The amount I settled on for our final pipeline is 6 uL, so perhaps you want to pay attention to the wells which received a vTitre = 6.

Cleanup repo

Why do we treat the protein channel differently?

⚠️ tedium ahead! I confused myself after writing it down so I suggest we just discuss this in person, but feel free to add notes.

I’d find it so helpful if we can clearly articulate two things

1. In the context of controls, why should we think of the protein channel differently from a Cell Painting channel, say, the ER channel for the sake of this discussion?

It is seemingly obvious, but if you take a simplified/abstracted view that all channels are just measuring some aspect of a cell’s phenotype, it’s less obvious to me.

I get it for no-protein negative controls — there’s nothing to mark in the protein channel. I'll note that there's no equivalent for the ER channel, that is, there isn't a "no-ER" negative control, and that's because the ER is always present and will therefore always be marked. Perhaps the only equivalent would be if a hypothetical negative control somehow destroyed the ability of Con A to bind to ER (without affecting the ER itself). In this case, all perturbations that didn't destroy the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.

But what about a protein control that had any localization other than the protein of interest? Here you say the protein of interest will always have a phenotype. Why? Again, it is seemingly obvious because if we know the negative control has localization X, then any protein that doesn't have localization X will have a phenotype. Is there an equivalent for the ER channel? I suppose the equivalent is again a hypothetical negative control that would somehow change – but not destroy – the ability of Con A's binding to ER (without affecting the ER itself). In this case, all perturbations that didn't similarly change the ability of Con A to bind to ER – whether or not they affected ER itself – would have a phenotype.

All this leads to the next question

2. What is a good negative control, when we limit ourselves only to the phenotype observed in the protein channel?

Context:

Shantanu Singh
That said, if we’ve settled on negative controls, we can still use the framework for reporting phenotypic activity.

Anne Carpenter
(which would be for the non-protein channels - the protein channel would always have a phenotype if we are comparing to no-protein controls, or if we choose a protein control that had any localization other than the protein of interest)

Can single-cell-level classification score be used to determine variant impact?

For context, see #5 (comment) and the comments that follow

Processing Pipeline

Instructions on the processing steps and parameters in each step:

1- Clean Platemap and the cleaned version in metadata/reprocessed

The input platemap for this project has been inconsistent across batches and also within the experiment
We should check this input metadata and make it consistent with an standard for each batch of data
Sometimes even the sqlite file has irregular column naming that we have to address by another way of handling it.

2- Save a subset of intensity features for transfection efficiency exploration and parameter selection of transfection detection

Save folder: plate_raw_intensity_features

3- Read Intensity features and save their distribution in results/intensity_dists

4- Based on the fixed parameters for transfection detection, generate and save population level profiles and also save transfected single cells for visualization and subpopulation analysis

Save folder for mean profiles: '/population_profiles/'+batchName+'/'+plateName
Save folder for transfected single cell profiles '/singlecell_profiles/'+batchName+'/'+plateName

5- Read data for analysis

Parameters:
- single cell scaling: 'sc_per_plate_scaling':'raw' or 'sc_scaled_per_plate'
- well level profiles zcoring: zscored_profiles: 'untransfected','untransfected_stringent'

6- Calculate replicate correlation of profiles

Save curve plots and values to results/replicate_corr_curves

7- Calculate WT-MT impact scores and save

Approach 1: average replicate level profiles and score treatment level profiles
- save the results in: /results/Impact-Scores/Method-MeanProfiles/impact_scores_trt_todaydate
Approach 2: calculate impact scores per plate
- save the results in: /results/Impact-Scores/Method-MeanProfiles/impact_scores_perplate_todaydate

Evaluation of Negative and Positive Control Selections

This issue documents information and evaluation results about the negative and positive controls we included in the VarChAMP CellPainting experiments. From the previous discussion in GH issue #3, the following treatments are selected as controls in batch B1A1R1.

Negative Control for Morphological Changes: MAPK9, RHEB, SLIRP, & PRKACB
- Labeled "NC" in "node_type" column in metadata file.
- Treatments selected because they have low replicate correlations in CPJUMP1 experiments.
Positive Control for Morphological Changes: PTK2B
- Labeled "PC" in "node_type".
- Selected because its function is likely to induce morphological changes but hasn't been confirmed in our assay.
Positive Control for Protein Instability/Mislocalization - ALK vs. ALK R1275Q
- Labeled "PC" in "node_type".
- Selected because the protein of variant R1275Q should mislocalize to the ER.
Transduction/Selection Control: 516 - TC
- Labeled "TC" in "node_type".
- There should be no remaining cells in the well after selection.

Plate 1 to 4 in B1A1R1 contains four replicates of each control treatment on each plate. Plate 4 also contains 21 candidate controls (one replicate on each plate) that could be used as controls in future experiments.

For our reference, the annotations in the "node_type" column in plate maps are:

"allele" - missense mutant
"disease_wt" - reference/WT allele
"TC" - transduction/selection control
"PC" - positive control
"NC" - negative control
"cPC" - candidate positive control (morphology)
"cPPC" - candidate positive control (protein)
"cNC" - candidate negative control (morphology)