Giter Site home page Giter Site logo

broadinstitute / depmap_omics Goto Github PK

View Code? Open in Web Editor NEW
98.0 18.0 22.0 413.1 MB

What you need to process the Quarterly DepMap-Omics releases from Terra

Home Page: https://depmap.org/portal/

Jupyter Notebook 5.58% R 0.05% HTML 82.25% WDL 10.64% Python 1.32% Shell 0.10% Dockerfile 0.04% Makefile 0.02%
depmap cancer-genomics cloud-computing data-science

depmap_omics's Introduction

depmap_omics

This repository contains code that processes data for the biannual DepMap data release. State of the pipeline for each release can be found under the "Releases" tab in this repo.

Table of Contents

Getting Started

The processing pipeline relies on the following tools:

Installatiion

git clone http://github.com/BroadInstitute/depmap_omics.git && cd depmap_omics

pip install -e .

⚠️ This repository needs other repos

Some important data and code from the genepy Library.

Use the instructions in the genepy page to install the package.

⚠️ You need the following R and python packages

  1. You will need to install jupyter notetbooks and google cloud sdk
  • install Google Cloud SDK.
  • authenticate my SDK account by running gcloud auth application-default login in the terminal, and follow the instrucion to log in.
  1. and GSVA for ssGSEA in R R run R -e 'if(!requireNamespace("BiocManager", quietly = TRUE)){install.packages("BiocManager")};BiocManager::install(c("GSEABase", "erccdashboard", "GSVA", "DESeq2"));'

  2. For Python use the requirements.txt file pip install -r requirements.txt

⚠️ Follow instructions here to set up Terra and obtain access to services required for running the pipeline.

Repository File Structure

ccle_tasks/ Contains a notebook for each of the different additional processing that the CCLE team has to perform as well as one-off tasks run by the omics team

data/ Contains important information used for processing, including terra workspace configurations from past quarters

depmapomics/ Contains the core python code used in the pipeline and called by the processing jupyter notebooks

*_pipeline/ Contains some of the workflows' wdl files and script files used by these workflows

temp/ Contains the temp file that can get removed after processing (should be empty)

documentation/ Contains some additional files and diagrams for documenting the pipelines

tests/ Contains automated pytest functions used internally for development

jupyter notebooks: RNA_CCLE.ipynb contains the DepMap processing pipelines for Expression and Fusion (from RNAseq data), and WGS_CCLE.ipynb contains the DepMap processing pipelines for Copy number and Mutations (from WGS/WES data)

Pipeline Walkthrough

The processing pipelines are encapsulated in two jupyter notebooks (RNA_CCLE.ipynb and WGS_CCLE.ipynb). Each is divided into four steps: uploading, running Terra pipelines, local postprocessing, and uploading. Here is a detailed walkthrough (Note that the steps that are "internal only" are run as part of DepMap's data processing, but not meant for external users to reproduce due to various dependencies that are unique to our team at the Broad. The "internal only" functions below can be found in the depmap_omics_upload repo):

1. Uploading and Preprocessing (internal only)

Currently, sequenced data for DepMap is generated by the Genomics Platform (GP) at the Broad who deposits them into several different Terra workspaces. Therefore, the first step of this pipeline is to look at these workspaces and

  • identify new samples by looking at the bam files and compare them with bams we have already onboarded
  • remove duplicates and ones with broken file paths
  • onboard new samples and new versions of old cell lines if we find any

2. Running Terra Pipelines

We are using Dalmatian to send requests to Terra, so before running this part, external users need to make sure that the dalmatian WorkspaceManager object is initialized with the right workspace and that the functions are taking the correct workflow names as inputs. You can then run the RNAseq and/or WGS pipelines on your samples.

For a more in-depth documentation on what our pipelines contain, including the packages, input references, and parameters, please refer to this summary of DepMap processing pipeline.

3. Downloading and Postprocessing (sections under on local in the notebooks)

This step will do a set of tasks:

  • clean the workspaces by deleting large useless files, including unmapped bams.
  • retrieve from the workspace interesting QC results.
  • copy realigned bam files to our own data storage bucket (internal only).
  • download the outputs from Terra pipelines.

The main postprocessing steps for each pipeline are as followed:

Copy Number

copynumbers.py contains the main postprocessing function postProcess() responsible for postprocessing segments and creating gene-level (relative and absolute) CN files and genomic feature table. Gene mapping information is retrieved from BioMart version nov2020. The function also applies the following filters to segment and CN data:

  • Remove chrY segments from cell lines where their chrY segment count is bigger than 150
  • Mark samples that have more than 1500 segments as QC fails and remove them
  • Remove genes whose Entrez ID is NaN in BioMart in the gene-level matrices

Internal only: dm_omics.cnPostProcessing() calls the above function on both WES and WGS data, merges them, renames the indices into ProfileIDs, and upload them to taiga.

Note: to get the exact same results as in DepMap, be sure to run genecn = genecn.apply(lambda x: np.log2(1+x)) to the genecn dataframe in the CNV pipeline

Mutation

mutations.py contains postProcess(), the function responsible for postprocessing aggregated MAF files, genotyped mutation matrices (hot spot and damaging), binary guide mutation matrices, and structural variants (SVs).

Internal only: dm_omics.mutationPostProcessing() calls the above function on both WES and WGS data, merges them, renames the indices into ProfileIDs, removes genes whose hugo symbol is not in biomart, generates individual mutation datasets for variant types, and uploads them to taiga. It also generates and uploads a binary matrix for germline mutations.

Expression

expressions.py contains the main postprocessing function responsible for postprocessing aggregated expression data from RSEM, which removes duplicates and QC failures, renames genes, filters and log transforms values, and generates transcrip-level, gene-level, and protein-coding gene-level expression data files. Gene mapping information is retrieved from BioMart version nov2020. Optionally, in addition, it also generates Single-sample GSEA (ssGSEA) data.

Internal only: dm_omics.expressionPostProcessing() is a wrapper for the above function. It renames the indices into ProfileIDs and uploads the files to taiga.

Fusion

Functions that postprocess aggregated fusion data can be found in fusions.py. We want to apply filters to the fusion table to reduce the number of artifacts in the dataset. Specifically, we filter the following:

  • Remove fusions involving mitochondrial chromosomes, or HLA genes, or immunoglobulin genes
  • Remove red herring fusions (from STAR-Fusion annotations column)
  • Remove fusions recurrent in CCLE (>= 25 samples)
  • Remove fusions that have (SpliceType=" INCL_NON_REF_SPLICE" AND LargeAnchorSupport="No" AND FFPM < 0.1)
  • Remove fusions with FFPM < 0.05 (STAR-Fusion suggests using 0.1, but looking at the translocation data, this looks like it might be too aggressive)

Internal only: dm_omics.fusionPostProcessing() is a wrapper for the above function. It renames the indices into ProfileIDs and uploads the data to taiga.

4. QC, Grouping and Uploading to the Portal (internal use only)

We then perform the following QC tasks for each dataset:

CN

Once the CN files are saved, we load them back in python and do some validations, in brief:

  • mean, max, var...
  • to previous release: same mean, max, var...
  • checkAmountOfSegments: flag any samples with a very high number of segments

Mutation

Compare to previous release (broad only)

We compare the results to the previous releases MAF. Namely:

  • Count the total number of mutations per cell line, split by type (SNP, INS, DEL)
  • Count the total number of mutations observed by position (group by chromosome, start position, end position and count the number of mutations)
REMARK:

Overall the filters applied after the CGA pipeline are the following:

We remove everything that:

  • has AF<.1
  • OR coverage <4
  • OR alt cov=1
  • OR is not in coding regions
  • OR is in Exac with a frequency of >0.005%
    • except if it is either
      • in TCGA > 3 times
      • OR in Cosmic > 10 times
    • AND in a set of known cancer regions.
  • OR exist in >5% of the CCLE samples
    • except if they are in TCGA >5 times

RNA

Once the expression files are saved, we do the following validations:

  • mean, max, var...
  • comparison to previous release: same mean, max, var...
  • we QC on the amount of genes with 0 counts for each samples

After QC, data is uploaded to taiga for all portal audiences according to release dates in Gumbo.

@jkobject @gkugener @gmiller @5im1z @__BroadInsitute

If you have any feedback or run into any issues, feel free to post an issue on the github repo.

depmap_omics's People

Contributors

5im1z avatar colganwi avatar dependabot[bot] avatar dna-dave avatar gulatide avatar iboyle-broad avatar javadnoorb avatar jkobject avatar kugenerg avatar millergw avatar pgm avatar qinqian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

depmap_omics's Issues

Availability of PureCN copy number calls for DepMap 23Q2

Hello,

Thank you so much for this repo!

Just wanted to know - for the most recent DepMap data release, I was able to find the relative CN and CN ratios per gene, but couldn't find the absolute copy numbers.

I saw from a discussion on the DepMap forum that ABSOLUTE calls are available up to CCLE 2019, but that PureCN is being optimized for future releases.

(https://forum.depmap.org/t/classifying-copy-number-alterations/2287)

Just wanted to check in to see if this data is available for the latest release, and if so, where I could find it.

Thank you so much!

New cell lines in 23q2 release not at Cellosaurus (no RRID)

Hi DepMap team,

I noticed that there are 5 new cell lines in the DepMap 23q2 release that are not annotated at Cellosaurus currently. Can we work on adding them? Happy to help if possible.

DepMap ID   Name    Source
ACH-001134  MYLA    Academic lab
ACH-001172  U251MGDM    HSRRB via Bandhyopadhyay (Broad)
ACH-002002  A375SKINCJ2 Cory Johannessen (Broad)
ACH-002471  PSS008  Alejandro Sweet-Cordero (UCSF)
ACH-002834  PSS131R Alejandro Sweet-Cordero (UCSF)

See related issue filed with the Cellosaurus team calipho-sib/cellosaurus#8

Best,
Mike

Using custom cell-line data

Hi,
First, I am attempting to run copynumbers.py. However, the error message "No module named 'gumbo_client'" appeared. So I did 'pip install gumbo_client', but I got an error message saying "No matching distribution found for gumbo_client", so I can't find a way. How can I solve it?

Second, using this pipeline, I want to obtain copy number alteration from my cell-line WES data. How can I do it?

Thank you.

What is the differences of gene expression profiles between Depmap and Cancer Cell Line Encyclopedia (CCLE)

In 2018, CCLE provided RNAseq data for 1019 cancer cell lines (https://sites.broadinstitute.org/ccle/), which also were downloaded in DepMap. However, the recent DepMap release contains the gene expression profile of 1408 cancer cell line. We observed that the expression values of the same gene in the same cell lines of depmap and CCLE are different. Thus, there are some questions that need to be consulted: Has DepMap used CCLE's RNAseq raw data (overlapped cancer cell lines between CCLE in 2018 and DepMap), just use different analysis processes ? Whether are only specific cancer cell lines in DepMap sequenced by DepMap?

Mutect2 clustered events correction

While examining the somatic mutation pipeline I noticed recent "debug" commits making corrections for Mutect2 clustered_events: d901564, 559daef. These seem to change clustered_events to PASS when there are <=2 non-germline mutations in a +/- 50 bp region around the site.

It happens that I have also been looking for a reasonable solution aiming to prevent some clustered_events from being filtered out, as it has been known that this filter sometimes removes potentially "real" somatic mutations (e.g. this post). Therefore I am very curious what you find after applying the correction above -- does it help to improve performance (e.g. reducing false negatives w/o greatly increasing false positives)? Is this approach official yet -- will the next DepMap release adopt this correction for clustered_events?

What is the pseudo normal used to call mutations in cell lines?

Hello,

I'm trying to process WGS data from a cell line similar to how your pipeline at DepMap. I noticed in your workflow doc there was a note under the mutations slide that "this pipeline requires a matched normal, so we use a pseudo normal for all cell lines samples". Could you explain what this pseudo normal is and would it be possible for you to share this data with me?

Thank you

productionalization tasks

The following is a placeholder for all the issues/PRs that need to be opened for productionalization:

  1. Automation:
  • Replace notebooks by command line functions
  • Automate the preprocessing step
  • Automate the postprocessing step
  • Link all steps together
  1. Code maintenance:
  • Save the dockers of the workflows on GC docker repository
  • Save WDL files on GitHub
  • Save script which are on GCS on GitHub
  • Sync workflows to the GitHub version
  1. Enhancements:
  • Create a master docker image shell with all required installations
  • Consolidate all the workflows for each pipeline into one master workflow (by doing task imports)
  • Function to auto-upload WDL scripts to terra and update the workflows
  • All workflows with our scripts need to get the file from a git pull of our github (so that it is always up to date)
  • Record and save the commit hash during the pipeline run
  • Submit each pipeline through Sparklespray (SNV, CNV, fusion, expression)
  • Implement embarrassingly parallel slow jobs using Sparklespray (e.g. Skyros/DMC/Public taiga uploads)
  • Implement a function to auto-upload WDL scripts to Terra and update the workflows
  • Implement QCs for comparison to older data as a form of continuous integration
  1. Cost reduction:
  • Implement a function to auto erase useless data after a run is finished/failed
  • Use GP buckets instead of cclebams
  1. Consolidate data:
  • Consistent disease name, subtype, …
  • Find SM-ids
  • Complete Media Type and condition
  • Analyze data from different sources collectively (add CCLF data in our workspaces)
  1. Documentation:
  • Document the workflows and code

Error in installation

Thank you for giving such a good way to download depmap database for those who have difficulty installing depmap from Bioconductor (like me). However, I still have some issues in installing from github...
It shows that:

install_github("depmap_omics")
Error in parse_repo_spec(repo) :
Invalid git repo specification: 'depmap_omics'

May I ask is there any solutions to this issue? Thank you for any response in advance.

yuwei

Duplicate RNA-seq omics profiles for cell lines

Hi DepMap team,

I'm trying to load transcript-level counts from the RNA-seq expression data from the OmicsExpressionTranscriptsTPMLogp1Profile.csv file. I'm having trouble uniquely resolving the profile identifiers (e.g. "PR-lqUArB", "PR-pOBrMJ") to the model identifiers ("ACH-000029") using the "OmicsProfiles.csv" file. I'm seeing 16 cell lines that currently have this issue with the 23q2 release.

Is there a rational way to pick between the duplicate profiles for the 16 cell lines?

Here's a working example (in R) with more details:

library(dplyr)
library(pipette)
df <-
    import(
        con = "https://figshare.com/ndownloader/files/40449635",
        format = "csv"
    ) |>
    filter(Datatype == "rna") |>
    arrange(ModelID)
dupes <- sort(df[["ModelID"]][duplicated(df[["ModelID"]])])
print(dupes)
##  [1] "ACH-000029" "ACH-000095" "ACH-000143" "ACH-000206" "ACH-000328"
##  [6] "ACH-000337" "ACH-000455" "ACH-000468" "ACH-000517" "ACH-000532"
## [11] "ACH-000556" "ACH-000597" "ACH-000700" "ACH-000931" "ACH-000975"
## [16] "ACH-001192"
df <- df[df[["ModelID"]] %in% dupes, ]
print(df)
##      ProfileID ModelConditionID    ModelID Datatype WESKit
## 28   PR-lqUArB   MC-000029-BMZc ACH-000029      rna   <NA>
## 29   PR-pOBrMJ   MC-000029-BMZc ACH-000029      rna   <NA>
## 94   PR-6E5fvI   MC-000095-UcYl ACH-000095      rna   <NA>
## 95   PR-9bHyjI   MC-000095-UcYl ACH-000095      rna   <NA>
## 143  PR-dlwhbG   MC-000143-xMKb ACH-000143      rna   <NA>
## 144  PR-eLOZCF   MC-000143-xMKb ACH-000143      rna   <NA>
## 207  PR-by8s63   MC-000206-Jmpg ACH-000206      rna   <NA>
## 208  PR-xissjH   MC-000206-Jmpg ACH-000206      rna   <NA>
## 328  PR-DjTYZp   MC-000328-gA4f ACH-000328      rna   <NA>
## 329  PR-S409MD   MC-000328-gA4f ACH-000328      rna   <NA>
## 338  PR-ZJC2Tm   MC-000337-VmHG ACH-000337      rna   <NA>
## 339  PR-zvd6KC   MC-000337-VmHG ACH-000337      rna   <NA>
## 456  PR-HCodtv   MC-000455-QvVM ACH-000455      rna   <NA>
## 457  PR-JWn3XA   MC-000455-QvVM ACH-000455      rna   <NA>
## 470  PR-8iDtve   MC-000468-c6hY ACH-000468      rna   <NA>
## 471  PR-qf7nCW   MC-000468-c6hY ACH-000468      rna   <NA>
## 518  PR-aOml9R   MC-000517-kcbL ACH-000517      rna   <NA>
## 519  PR-i9CVhO   MC-000517-kcbL ACH-000517      rna   <NA>
## 534  PR-Q8g8M0   MC-000532-NN9r ACH-000532      rna   <NA>
## 535  PR-t6ctGM   MC-000532-NN9r ACH-000532      rna   <NA>
## 559  PR-1hnFd4   MC-000556-YK2Z ACH-000556      rna   <NA>
## 560  PR-Iug0GM   MC-000556-YK2Z ACH-000556      rna   <NA>
## 601  PR-3ATZmJ   MC-000597-RDyO ACH-000597      rna   <NA>
## 602  PR-w00MaJ   MC-000597-RDyO ACH-000597      rna   <NA>
## 704  PR-hs3wNI   MC-000700-mndS ACH-000700      rna   <NA>
## 705  PR-uQ6qid   MC-000700-mndS ACH-000700      rna   <NA>
## 934  PR-CHQ9Av   MC-000931-3a7D ACH-000931      rna   <NA>
## 935  PR-k0J8JP   MC-000931-3a7D ACH-000931      rna   <NA>
## 979  PR-eCRyEu   MC-000975-PUD5 ACH-000975      rna   <NA>
## 980  PR-s28NRl   MC-000975-PUD5 ACH-000975      rna   <NA>
## 1056 PR-RJkk8B   MC-001192-OhhV ACH-001192      rna   <NA>
## 1057 PR-V4rEyG   MC-001192-OhhV ACH-001192      rna   <NA>

Best,
Mike

CCLE_mutations.csv MAF GRCh37

Hello!

By any chance, is there a way to release the mutational profiling of CCLE as a .vcf file instead of .maf? Also, related with the same data, are you planning to align agains GRch38 instead of 37, like you did for example with transcriptome?

Thank you!

Pedro

poetry add gumbo_client does not work due to psycopg2

With conda python 3.9 version, the poetry add git+https://github.com/broadinstitute/gumbo_client.git raise the error:

Note: This error originates from the build backend, and is likely not a problem with poetry but with psycopg2 (2.9.5) not supporting PEP 517 builds. You can verify this by running 'pip wheel --use-pep517 "psycopg2 (==2.9.5) ; python_version >= "3.6""'.

A simple workaround for now is to install gumbo_client manually pip install git+https://github.com/broadinstitute/gumbo_client.git.

Capture kit used in WES samples

I want to firstly thank you all for the great work on the CCLE/DepMap database!

And recently I am processing whole-exon-sequencing samples from the CCLE_PRJNA523380 project. I am looking for the capture kit and PoN that used for each of the WES samples:

• I found the sample_for_pon.tsv file containing 1668 lanes of records (says from GTEx on the GitHub page), can you let me know if these samples are also used for the CCLE WES CNV process? And can you let me know where I can directly download the PoN file that corresponds to CCLE WES data?

• On the GitHub page, you said you are using Illumina ICE intervals and Agilent intervals for WES samples, I can only find a wes_agilent_hg19_baits.interval_list, which is very different from any Agilent released capture kit bed file. Can you let me know what is the exact capture kit that has been used for exon capture for all the CCLE WES data?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.