Giter Site home page Giter Site logo

d3b-bixu-data-assembly's People

Contributors

zhangb1 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

d3b-bixu-data-assembly's Issues

Incomplete PNOC008 clinical data in data assembly histologies

Some issues that we should look into to ensure PNOC008 sample information is correctly filled in the data assembly histologies file:

  1. Histology information is missing from data assembly file. For this issue, the KF clinical file is also not helpful as it has a lot of missing information. Should we try to manually fill this information?
# columns relevant to generalized patient reporting
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
                    "parent_aliquot_id","aliquot_id",
                    "experimental_strategy","RNA_library",
                    "sample_type","tumor_descriptor",
                    "pathology_diagnosis","integrated_diagnosis",
                    "short_histology","broad_histology","molecular_subtype",
                    "gtex_group","cancer_group",
                    "OS_days","OS_status")

# read most recent data assembly and keep only columns relevant to generalized patient reporting
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
  filter(cohort %in% c("PNOC", "CBTN")) %>%
  mutate(parent_aliquot_id = NA,
         gtex_group = NA, 
         cancer_group = NA) %>%
  dplyr::select(hist_cols_used) # 3687 rows

# read PNOC008 clinical manifest from KF data tracker
pnoc008_clinical <- readxl::read_xlsx('~/Projects/OMPARE/data/manifest/pnoc008_manifest.xlsx')

# integrated_diagnosis, short_histology, broad_histology, molecular_subtype missing entirely
data_assembly %>%
  dplyr::filter(cohort_participant_id %in% pnoc008_clinical$`Research ID`) %>%
  dplyr::select(integrated_diagnosis, short_histology, broad_histology, molecular_subtype) %>% 
  unique()

  integrated_diagnosis short_histology broad_histology molecular_subtype
1                   NA            <NA>            <NA>              <NA>
  1. Missing survival information:
# combine PNOC008 clinical with data assembly on common cohort participant identifiers
pnoc008_clinical %>%
  dplyr::select(`Research ID`, `Last Known Status`, `Age at Diagnosis (in days)`, `Age at Collection (in days)`, `Age at Age At Last Known Status (if deceased, this is days to death)`, ) %>%
  inner_join(data_assembly %>%
               dplyr::select(cohort_participant_id, OS_status, OS_days) %>%
               unique(), by = c("Research ID" = "cohort_participant_id"))

First four columns are from KF clinical file and last two from data assembly file:

Research ID Last Known Status Age at Diagnosis (in days) Age at Collection (in days) Age at Age At Last Known Status (if deceased, this is days to death) OS_status OS_days
C3064299 Living 4065 4067 NA NA NA
C3064422 Deceased 2758 2758 3074 NA NA
C3070818 Living 737 737 NA NA NA
C3070941 Living 4143 4143 NA NA NA
C3071064 Living 6784 6784 NA NA NA
C3071310 Living 3922 3922 NA NA NA
C3071433 Living NA NA NA NA NA
C3077337 Living 5151 5151 NA NA NA
C3077460 Living 2288 2288 NA NA NA
C3077583 Living 4049 4049 NA NA NA
C3077706 Living 5233 5233 NA NA NA
C3077829 Living 7309 NA NA NA NA
C3078075 Living 3480 3480 NA NA NA
C3077952 Living 2801 2801 NA NA NA
C3078198 Living 3147 3147 NA NA NA
C3183978 Living NA 2435 NA NA NA
C3172416 Living NA 1552 NA NA NA
C3172539 Living NA 5688 NA NA NA
C3172662 Living NA 4867 NA NA NA
C3172785 Living NA 6904 NA NA NA
C3172908 Living NA 3072 NA NA NA
C3173031 Living NA 5080 NA NA NA
C3173154 Living NA 3863 NA NA NA
C3173277 Living NA 5353 NA NA NA
C3173400 Living NA 1886 NA NA NA
C3173523 Living NA 1825 NA NA NA
C3173646 Living NA 4532 NA NA NA
C3173769 Living NA 2038 NA NA NA
C3505500 Living NA 4715 NA NA NA
C3505623 Living NA 1065 NA NA NA
C3505746 Living NA 3376 NA NA NA
C3505869 Living NA 6783 NA NA NA
C3505992 Living NA 1991 NA NA NA
C3506115 Living NA 5569 NA NA NA
C3506238 Living NA 1765 NA NA NA
C3506361 Living NA 1126 NA NA NA
C3506484 Living NA 2952 NA NA NA
C3506607 Living NA 6026 NA NA NA
C3506730 Living NA 4687 NA NA NA
C3506853 Living NA 5874 NA NA NA
C3506976 Living NA 4352 NA NA NA
C3507099 Living NA 4015 NA NA NA
C3507222 Living NA 4991 NA NA NA
C3507345 Living NA 6300 NA NA NA
C3507468 Living NA 6817 NA NA NA
C3507714 Living NA 2069 NA NA NA
C3507591 Living NA 4869 NA NA NA
C3507837 Living NA 4595 NA NA NA
C3507960 Living NA 4169 NA NA NA
C3508083 Living NA 3165 NA NA NA

cc: @aadamk

Duplicated gene_id in merged expression matrices

To be discussed in the next toolkit meeting.

Using the latest merged TPM/Count expression matrices from data assembly project:

  1. gene_id column is duplicated
# TPM matrix
tpm <- readRDS('~/Downloads/gene-expression-rsem-tpm.BS_MADCWWMX.rds')
grep('gene_id', colnames(tpm))
[1]    1 2527

This is also the case with counts matrix

cc: @aadamk

Collapse and combine GTEx TPM and expected_count to KFDRC collapsed RSEM files

Background:

For OT we will need each data release to included processed GTEx v8 data in gene-expression-rsem-tpm-collapsed.rds and gene-counts-rsem-expected_count-collapsed.rds.

In the last data release I did:

  • script to collapse GTEx and not removing the genes that are not expressed ie have rowSum ==0
    output gtex-gene-expression-rsem-tpm-collapsed.rds
  • script to collapse PBTA+GMFK merge rsem file and not removing the genes that are not expressed ie have rowSum ==0
    output pbta-gmkf-gene-expression-rsem-tpm-collapsed.rds
  • notebook to combine the two file
    output gene-expression-rsem-tpm-collapsed.rds

Required update

Update the collapse-rnaseq to be able to handle adding GTEx processed files.

CC @zhangb1 @yuankunzhu @jharenza for discussion.

Minor issues with merged histologies

Updating this ticket with the latest histologies:

There are some minor issues with the merged histologies file that we can discuss in the next toolkit meeting. Using the latest merged histologies from the data assembly project:

  1. Duplicate rows corresponding to Kids_First_Biospecimen_ID. For each of these duplicated rows, one row has NA and one row is properly populated for certain fields.
dat <- read.delim('~/Downloads/tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')
length(unique(dat$Kids_First_Biospecimen_ID)) # 36829 unique BS ids
dim(dat) # 39814 number of total rows

# one example is BS_007JTNB8 where the first row has NA for fields like cns_region, short_histology, etc but second row is properly populated 
> dat %>%
+     filter(Kids_First_Biospecimen_ID == "BS_007JTNB8")
  Kids_First_Biospecimen_ID      cns_region sample_id aliquot_id Kids_First_Participant_ID
1               BS_007JTNB8            <NA> 7316-2558     655073               PT_1MW98VR1
2               BS_007JTNB8 Posterior fossa 7316-2558     655073               PT_1MW98VR1
  experimental_strategy sample_type  composition  tumor_descriptor               primary_site
1                   WGS       Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
2                   WGS       Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
  reported_gender  race              ethnicity age_at_diagnosis_days pathology_diagnosis
1            Male White Not Hispanic or Latino                  1872          Ependymoma
2            Male White Not Hispanic or Latino                  1872          Ependymoma
  integrated_diagnosis short_histology broad_histology Notes germline_sex_estimate RNA_library
1                 <NA>            <NA>            <NA>  <NA>                  <NA>        <NA>
2                 <NA> Ependymal tumor Ependymal tumor  <NA>                  Male        <NA>
  OS_days OS_status PFS_days cohort age_last_update_days seq_center normal_fraction
1     687    LIVING      687   CBTN                 2559  NantOmics       0.3224044
2     687    LIVING      687   PBTA                 2559  NantOmics       0.3224044
  tumor_fraction tumor_ploidy cancer_predispositions     molecular_subtype cohort_participant_id
1      0.6775956            2        None documented                  <NA>               C632220
2      0.6775956            2        None documented EPN, To be classified               C632220
   extent_of_tumor_resection
1 Gross/Near total resection
2 Gross/Near total resection
  1. RNA sample info for many samples is not available so have to pull it from warehouse and add to master histology file
# an example with patient 47, 48 and 49
dat %>%
  filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
  dplyr::select(cohort_participant_id, experimental_strategy) %>%
  unique()

  cohort_participant_id experimental_strategy
1              C3507714                   WXS
2              C3507591                   WXS
3              C3507468                   WXS
  1. short_histology info missing for many samples
# an example with patient 47, 48 and 49
dat %>%
  filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
  dplyr::select(cohort_participant_id, short_histology) %>%
  unique()

  cohort_participant_id short_histology
1              C3507714            <NA>
2              C3507591            <NA>
3              C3507468            <NA>
  1. For RNA samples that are present in the master histology file, there are 189 samples with no RNA_library info
# ids where RNA_library is NA
unique_bs_ids <- dat %>%
  filter(sample_type == "Tumor",
         experimental_strategy == "RNA-Seq",
         is.na(RNA_library)) %>%
  pull(Kids_First_Biospecimen_ID) %>%
  unique()

# 188 are from CBTN and 1 from GMKF
dat %>%
  filter(Kids_First_Biospecimen_ID %in% unique_bs_ids) %>%
  group_by(cohort)  %>%
  summarise(n = n())

# A tibble: 2 × 2
  cohort     n
  <chr>  <int>
1 CBTN     188
2 GMKF       1
  1. Couple of 008 patient missing. We don't have data for 1, 7 and 50 so we can ignore those but we do have data for 43 and 44 and these patients do not seem to be present in the master histology file:
# latest data assembly histology file
dat = read.delim('tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')

# pnoc008 clinical data manifest from Kids First data tracker
pnoc008_manifest = read_xlsx('FV_D8H9K61X_Copy of 03.01.2022 PNOC008 Clinical Data Manifest.xlsx', sheet = 2)

# identify ids that are in clinical data manifest but not in data assembly histology file
pnoc008_ids <- pnoc008_manifest$`Research ID`
pnoc008_not_found <- setdiff(pnoc008_ids, dat$cohort_participant_id)
pnoc008_manifest[which(pnoc008_manifest$`Research ID` %in% pnoc008_not_found),"PNOC Subject ID"]

# A tibble: 5 × 1
  `PNOC Subject ID`
  <chr>            
1 P-01             
2 P-07             
3 P-43             
4 P-44             
5 P-50 

cc: @aadamk

Differences in OT and data assembly information

Currently, the generalized patient reporting code cannot be automated because for each new patient report, I have manually check and map all missing information between various file sources i.e. KF clinical data (new patients are not in OT), Open Targets and Data assembly histology file.

Here are differences in some columns between OT histology file and data assembly file (To be discussed in toolkit as well):

# minimal set of columns to be used for pediatric samples within the generalized reporting code 
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
                    "parent_aliquot_id","aliquot_id",
                    "experimental_strategy","RNA_library",
                    "sample_type","tumor_descriptor",
                    "pathology_diagnosis","integrated_diagnosis",
                    "short_histology","broad_histology","molecular_subtype",
                    "OS_days","OS_status")

# read data assembly histology and subset to minimal columns
# add parent_aliquot_id as this field is not present in the data assembly file
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
  filter(cohort %in% c("PNOC", "CBTN")) %>%
  mutate(parent_aliquot_id = NA) %>%
  dplyr::select(hist_cols_used) # 3687

# read open targets and subset to minimal columns
open_targets <- read.delim('~/Projects/PediatricOpenTargets/OpenPedCan-analysis/data/histologies.tsv')
open_targets <- open_targets %>%
  dplyr::filter(cohort == "PBTA") %>%
  dplyr::select(hist_cols_used) # 2984

# subset both files on common samples
common_samples <- intersect(data_assembly$Kids_First_Biospecimen_ID, open_targets$Kids_First_Biospecimen_ID)
data_assembly <- data_assembly %>%
  filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
  arrange(Kids_First_Biospecimen_ID)
open_targets <- open_targets %>%
  filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
  arrange(Kids_First_Biospecimen_ID)

# replace NA to blank so we can compare the columns in both histology files
data_assembly[is.na(data_assembly)] <- "" 
open_targets[is.na(open_targets)] <- ""

# check differences in the minimal set of columns only
for(i in 1:length(hist_cols_used)){
  col_to_check <- hist_cols_used[i]
  diff_rows <- open_targets[which(open_targets[,col_to_check] != data_assembly[,col_to_check]),] %>%
    nrow() 
  print(paste(col_to_check, diff_rows, sep = ": "))
}

[1] "Kids_First_Biospecimen_ID: 0"
[1] "sample_id: 0"
[1] "cohort: 2984"
[1] "cohort_participant_id: 0"
[1] "parent_aliquot_id: 2840"
[1] "aliquot_id: 0"
[1] "experimental_strategy: 0"
[1] "RNA_library: 0"
[1] "sample_type: 0"
[1] "tumor_descriptor: 94"
[1] "pathology_diagnosis: 80"
[1] "integrated_diagnosis: 1364"
[1] "short_histology: 2111"
[1] "broad_histology: 2111"
[1] "molecular_subtype: 1615"
[1] "OS_days: 472"
[1] "OS_status: 0"

So, in short:

  1. parent_aliquot_id is absent in data assembly
  2. there are many differences in cohort (for e.g OT uses PBTA and data assembly uses PNOC/CBTN), tumor_descriptor, pathology_diagnosis, integrated_diagnosis, short_histology, broad_histology, molecular_subtype, OS_days

cc: @aadamk

Data assembly requests

Hi @zhangb1,

Some data assembly requests:

  1. The merged STAR-Fusion file pbta-BS_A7Y1Y314-fusion-starfusion.tsv.gz does not look right, there are NAs in it so could you please check?

For the progression sample reports I was able to fetch the StarFusion output file for Patient-43-P and merge it to the v11 so it is not needed by me anymore but just wanted to point this out for future patients.

  1. Would it be possible to generate a collapsed matrix i.e. (rownames are unique gene symbols) of Counts and TPM? Not high priority but would be nice to have so we don't have to collapse in the reporting code.

  2. Would it be possible to integrate the fusion annotation code to generate the fusion-putative-oncogenic.tsv? This is not high priority but would be nice to have.

cc: @aadamk

Thanks!

Histology file fields

Comparing the primary sample information for Patient 43 and 35 from v11 to the corresponding progression samples in the new histology file histologies_v11-base-add-BS_AG4BP2PM.tsv, I think some of the fields have missing information. Is this because there is no information available?

  1. pathology_diagnosis should be High-grade glioma/astrocytoma (WHO grade III/IV) (?). This is only for the +1 patient.
  2. broad_histology should be Diffuse astrocytic and oligodendroglial tumor (?). This is only for the +1 patient.
  3. short_histology should be HGAT (?). This is only for the +1 patient.
  4. RNA_library seems to have moved to PFS_days for BS_A7Y1Y314 (C3506976). This is from histologies_v11-base-add-BS_AG4BP2PM.tsv Seems to have been resolved.
  5. molecular_subtype seems to be missing entirely from histologies_v11-base-add-BS_AG4BP2PM.tsv and histologies_v11-base-add-BS_HSXARQ1K.tsv. I think it was present in v11 histologies.tsv (but not in histologies-base.tsv). So they are NA across all samples.
  6. Just noticed the following discrepancy, two samples BS_J4E9SW51 and BS_H1XPVS9A (both correspond to C334437) are annotated as LGAT in v10/v11 but HGAT in data assembly histology files:
# OpenPedCan v10 histology
v10_histology <- read_tsv('data/OpenPedCan-analysis/data/v10/histologies.tsv')
v10_histology %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "Low-grade astrocytic tumor" "Low-grade astrocytic tumor"

# OpenPedCan v11 histology
v11_histology <- read_tsv('data/OpenPedCan-analysis/data/v11/histologies.tsv')
v11_histology %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "LGAT" "LGAT"

# data assembly file generated with BS_HSXARQ1K
dat <- read_tsv('histologies_v11-base-add-BS_HSXARQ1K.tsv')
dat %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "HGAT" "HGAT"

# data assembly file with BS_AG4BP2PM
dat <- read_tsv('histologies_v11-base-add-BS_AG4BP2PM.tsv')
dat %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "HGAT" "HGAT"

I manually filled these for the report generation so don't need these. Just wanted to make a note of it.

cc @aadamk

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.