d3b-center / d3b-bixu-data-assembly Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 31.72 MB

License: Apache License 2.0

Perl 0.03% R 0.48% Shell 0.11% Dockerfile 0.02% Common Workflow Language 0.41% Python 0.46% HTML 98.49%

d3b-bixu-data-assembly's People

Contributors

Watchers

d3b-bixu-data-assembly's Issues

Incomplete PNOC008 clinical data in data assembly histologies

Some issues that we should look into to ensure PNOC008 sample information is correctly filled in the data assembly histologies file:

Histology information is missing from data assembly file. For this issue, the KF clinical file is also not helpful as it has a lot of missing information. Should we try to manually fill this information?

# columns relevant to generalized patient reporting
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
                    "parent_aliquot_id","aliquot_id",
                    "experimental_strategy","RNA_library",
                    "sample_type","tumor_descriptor",
                    "pathology_diagnosis","integrated_diagnosis",
                    "short_histology","broad_histology","molecular_subtype",
                    "gtex_group","cancer_group",
                    "OS_days","OS_status")

# read most recent data assembly and keep only columns relevant to generalized patient reporting
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
  filter(cohort %in% c("PNOC", "CBTN")) %>%
  mutate(parent_aliquot_id = NA,
         gtex_group = NA, 
         cancer_group = NA) %>%
  dplyr::select(hist_cols_used) # 3687 rows

# read PNOC008 clinical manifest from KF data tracker
pnoc008_clinical <- readxl::read_xlsx('~/Projects/OMPARE/data/manifest/pnoc008_manifest.xlsx')

# integrated_diagnosis, short_histology, broad_histology, molecular_subtype missing entirely
data_assembly %>%
  dplyr::filter(cohort_participant_id %in% pnoc008_clinical$`Research ID`) %>%
  dplyr::select(integrated_diagnosis, short_histology, broad_histology, molecular_subtype) %>% 
  unique()

  integrated_diagnosis short_histology broad_histology molecular_subtype
1                   NA            <NA>            <NA>              <NA>

Missing survival information:

# combine PNOC008 clinical with data assembly on common cohort participant identifiers
pnoc008_clinical %>%
  dplyr::select(`Research ID`, `Last Known Status`, `Age at Diagnosis (in days)`, `Age at Collection (in days)`, `Age at Age At Last Known Status (if deceased, this is days to death)`, ) %>%
  inner_join(data_assembly %>%
               dplyr::select(cohort_participant_id, OS_status, OS_days) %>%
               unique(), by = c("Research ID" = "cohort_participant_id"))

First four columns are from KF clinical file and last two from data assembly file:

Research ID	Last Known Status	Age at Diagnosis (in days)	Age at Collection (in days)	Age at Age At Last Known Status (if deceased, this is days to death)	OS_status	OS_days
C3064299	Living	4065	4067	NA	NA	NA
C3064422	Deceased	2758	2758	3074	NA	NA
C3070818	Living	737	737	NA	NA	NA
C3070941	Living	4143	4143	NA	NA	NA
C3071064	Living	6784	6784	NA	NA	NA
C3071310	Living	3922	3922	NA	NA	NA
C3071433	Living	NA	NA	NA	NA	NA
C3077337	Living	5151	5151	NA	NA	NA
C3077460	Living	2288	2288	NA	NA	NA
C3077583	Living	4049	4049	NA	NA	NA
C3077706	Living	5233	5233	NA	NA	NA
C3077829	Living	7309	NA	NA	NA	NA
C3078075	Living	3480	3480	NA	NA	NA
C3077952	Living	2801	2801	NA	NA	NA
C3078198	Living	3147	3147	NA	NA	NA
C3183978	Living	NA	2435	NA	NA	NA
C3172416	Living	NA	1552	NA	NA	NA
C3172539	Living	NA	5688	NA	NA	NA
C3172662	Living	NA	4867	NA	NA	NA
C3172785	Living	NA	6904	NA	NA	NA
C3172908	Living	NA	3072	NA	NA	NA
C3173031	Living	NA	5080	NA	NA	NA
C3173154	Living	NA	3863	NA	NA	NA
C3173277	Living	NA	5353	NA	NA	NA
C3173400	Living	NA	1886	NA	NA	NA
C3173523	Living	NA	1825	NA	NA	NA
C3173646	Living	NA	4532	NA	NA	NA
C3173769	Living	NA	2038	NA	NA	NA
C3505500	Living	NA	4715	NA	NA	NA
C3505623	Living	NA	1065	NA	NA	NA
C3505746	Living	NA	3376	NA	NA	NA
C3505869	Living	NA	6783	NA	NA	NA
C3505992	Living	NA	1991	NA	NA	NA
C3506115	Living	NA	5569	NA	NA	NA
C3506238	Living	NA	1765	NA	NA	NA
C3506361	Living	NA	1126	NA	NA	NA
C3506484	Living	NA	2952	NA	NA	NA
C3506607	Living	NA	6026	NA	NA	NA
C3506730	Living	NA	4687	NA	NA	NA
C3506853	Living	NA	5874	NA	NA	NA
C3506976	Living	NA	4352	NA	NA	NA
C3507099	Living	NA	4015	NA	NA	NA
C3507222	Living	NA	4991	NA	NA	NA
C3507345	Living	NA	6300	NA	NA	NA
C3507468	Living	NA	6817	NA	NA	NA
C3507714	Living	NA	2069	NA	NA	NA
C3507591	Living	NA	4869	NA	NA	NA
C3507837	Living	NA	4595	NA	NA	NA
C3507960	Living	NA	4169	NA	NA	NA
C3508083	Living	NA	3165	NA	NA	NA

cc: @aadamk

Duplicated gene_id in merged expression matrices

To be discussed in the next toolkit meeting.

Using the latest merged TPM/Count expression matrices from data assembly project:

gene_id column is duplicated

# TPM matrix
tpm <- readRDS('~/Downloads/gene-expression-rsem-tpm.BS_MADCWWMX.rds')
grep('gene_id', colnames(tpm))
[1]    1 2527

This is also the case with counts matrix

cc: @aadamk

Collapse and combine GTEx TPM and expected_count to KFDRC collapsed RSEM files

Background:

For OT we will need each data release to included processed GTEx v8 data in gene-expression-rsem-tpm-collapsed.rds and gene-counts-rsem-expected_count-collapsed.rds.

In the last data release I did:

script to collapse GTEx and not removing the genes that are not expressed ie have rowSum ==0
output gtex-gene-expression-rsem-tpm-collapsed.rds
script to collapse PBTA+GMFK merge rsem file and not removing the genes that are not expressed ie have rowSum ==0
output pbta-gmkf-gene-expression-rsem-tpm-collapsed.rds
notebook to combine the two file
output gene-expression-rsem-tpm-collapsed.rds

Required update

Update the collapse-rnaseq to be able to handle adding GTEx processed files.

GTEx v8 processed files can be downloaded from:
TPM https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct.gz
expected_count https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz
keep genes with rowSum ==0
combine file filtering for common genes

CC @zhangb1 @yuankunzhu @jharenza for discussion.

Minor issues with merged histologies

Updating this ticket with the latest histologies:

There are some minor issues with the merged histologies file that we can discuss in the next toolkit meeting. Using the latest merged histologies from the data assembly project:

Duplicate rows corresponding to Kids_First_Biospecimen_ID. For each of these duplicated rows, one row has NA and one row is properly populated for certain fields.

dat <- read.delim('~/Downloads/tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')
length(unique(dat$Kids_First_Biospecimen_ID)) # 36829 unique BS ids
dim(dat) # 39814 number of total rows

# one example is BS_007JTNB8 where the first row has NA for fields like cns_region, short_histology, etc but second row is properly populated 
> dat %>%
+     filter(Kids_First_Biospecimen_ID == "BS_007JTNB8")
  Kids_First_Biospecimen_ID      cns_region sample_id aliquot_id Kids_First_Participant_ID
1               BS_007JTNB8            <NA> 7316-2558     655073               PT_1MW98VR1
2               BS_007JTNB8 Posterior fossa 7316-2558     655073               PT_1MW98VR1
  experimental_strategy sample_type  composition  tumor_descriptor               primary_site
1                   WGS       Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
2                   WGS       Tumor Solid Tissue Initial CNS Tumor Cerebellum/Posterior Fossa
  reported_gender  race              ethnicity age_at_diagnosis_days pathology_diagnosis
1            Male White Not Hispanic or Latino                  1872          Ependymoma
2            Male White Not Hispanic or Latino                  1872          Ependymoma
  integrated_diagnosis short_histology broad_histology Notes germline_sex_estimate RNA_library
1                 <NA>            <NA>            <NA>  <NA>                  <NA>        <NA>
2                 <NA> Ependymal tumor Ependymal tumor  <NA>                  Male        <NA>
  OS_days OS_status PFS_days cohort age_last_update_days seq_center normal_fraction
1     687    LIVING      687   CBTN                 2559  NantOmics       0.3224044
2     687    LIVING      687   PBTA                 2559  NantOmics       0.3224044
  tumor_fraction tumor_ploidy cancer_predispositions     molecular_subtype cohort_participant_id
1      0.6775956            2        None documented                  <NA>               C632220
2      0.6775956            2        None documented EPN, To be classified               C632220
   extent_of_tumor_resection
1 Gross/Near total resection
2 Gross/Near total resection

RNA sample info for many samples is not available so have to pull it from warehouse and add to master histology file

# an example with patient 47, 48 and 49
dat %>%
  filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
  dplyr::select(cohort_participant_id, experimental_strategy) %>%
  unique()

  cohort_participant_id experimental_strategy
1              C3507714                   WXS
2              C3507591                   WXS
3              C3507468                   WXS

short_histology info missing for many samples

# an example with patient 47, 48 and 49
dat %>%
  filter(cohort_participant_id %in% c("C3507468", "C3507591", "C3507714")) %>%
  dplyr::select(cohort_participant_id, short_histology) %>%
  unique()

  cohort_participant_id short_histology
1              C3507714            <NA>
2              C3507591            <NA>
3              C3507468            <NA>

For RNA samples that are present in the master histology file, there are 189 samples with no RNA_library info

# ids where RNA_library is NA
unique_bs_ids <- dat %>%
  filter(sample_type == "Tumor",
         experimental_strategy == "RNA-Seq",
         is.na(RNA_library)) %>%
  pull(Kids_First_Biospecimen_ID) %>%
  unique()

# 188 are from CBTN and 1 from GMKF
dat %>%
  filter(Kids_First_Biospecimen_ID %in% unique_bs_ids) %>%
  group_by(cohort)  %>%
  summarise(n = n())

# A tibble: 2 × 2
  cohort     n
  <chr>  <int>
1 CBTN     188
2 GMKF       1

Couple of 008 patient missing. We don't have data for 1, 7 and 50 so we can ignore those but we do have data for 43 and 44 and these patients do not seem to be present in the master histology file:

# latest data assembly histology file
dat = read.delim('tumor-board-1022-histology-ops-BS_ST7KGV85.tsv')

# pnoc008 clinical data manifest from Kids First data tracker
pnoc008_manifest = read_xlsx('FV_D8H9K61X_Copy of 03.01.2022 PNOC008 Clinical Data Manifest.xlsx', sheet = 2)

# identify ids that are in clinical data manifest but not in data assembly histology file
pnoc008_ids <- pnoc008_manifest$`Research ID`
pnoc008_not_found <- setdiff(pnoc008_ids, dat$cohort_participant_id)
pnoc008_manifest[which(pnoc008_manifest$`Research ID` %in% pnoc008_not_found),"PNOC Subject ID"]

# A tibble: 5 × 1
  `PNOC Subject ID`
  <chr>            
1 P-01             
2 P-07             
3 P-43             
4 P-44             
5 P-50

cc: @aadamk

Differences in OT and data assembly information

Currently, the generalized patient reporting code cannot be automated because for each new patient report, I have manually check and map all missing information between various file sources i.e. KF clinical data (new patients are not in OT), Open Targets and Data assembly histology file.

Here are differences in some columns between OT histology file and data assembly file (To be discussed in toolkit as well):

# minimal set of columns to be used for pediatric samples within the generalized reporting code 
hist_cols_used <- c("Kids_First_Biospecimen_ID","sample_id","cohort","cohort_participant_id",
                    "parent_aliquot_id","aliquot_id",
                    "experimental_strategy","RNA_library",
                    "sample_type","tumor_descriptor",
                    "pathology_diagnosis","integrated_diagnosis",
                    "short_histology","broad_histology","molecular_subtype",
                    "OS_days","OS_status")

# read data assembly histology and subset to minimal columns
# add parent_aliquot_id as this field is not present in the data assembly file
data_assembly <- read.delim('~/Downloads/histologies-add-bixops-uptoPNOC008-52.tsv')
data_assembly <- data_assembly %>%
  filter(cohort %in% c("PNOC", "CBTN")) %>%
  mutate(parent_aliquot_id = NA) %>%
  dplyr::select(hist_cols_used) # 3687

# read open targets and subset to minimal columns
open_targets <- read.delim('~/Projects/PediatricOpenTargets/OpenPedCan-analysis/data/histologies.tsv')
open_targets <- open_targets %>%
  dplyr::filter(cohort == "PBTA") %>%
  dplyr::select(hist_cols_used) # 2984

# subset both files on common samples
common_samples <- intersect(data_assembly$Kids_First_Biospecimen_ID, open_targets$Kids_First_Biospecimen_ID)
data_assembly <- data_assembly %>%
  filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
  arrange(Kids_First_Biospecimen_ID)
open_targets <- open_targets %>%
  filter(Kids_First_Biospecimen_ID %in% common_samples) %>%
  arrange(Kids_First_Biospecimen_ID)

# replace NA to blank so we can compare the columns in both histology files
data_assembly[is.na(data_assembly)] <- "" 
open_targets[is.na(open_targets)] <- ""

# check differences in the minimal set of columns only
for(i in 1:length(hist_cols_used)){
  col_to_check <- hist_cols_used[i]
  diff_rows <- open_targets[which(open_targets[,col_to_check] != data_assembly[,col_to_check]),] %>%
    nrow() 
  print(paste(col_to_check, diff_rows, sep = ": "))
}

[1] "Kids_First_Biospecimen_ID: 0"
[1] "sample_id: 0"
[1] "cohort: 2984"
[1] "cohort_participant_id: 0"
[1] "parent_aliquot_id: 2840"
[1] "aliquot_id: 0"
[1] "experimental_strategy: 0"
[1] "RNA_library: 0"
[1] "sample_type: 0"
[1] "tumor_descriptor: 94"
[1] "pathology_diagnosis: 80"
[1] "integrated_diagnosis: 1364"
[1] "short_histology: 2111"
[1] "broad_histology: 2111"
[1] "molecular_subtype: 1615"
[1] "OS_days: 472"
[1] "OS_status: 0"

So, in short:

parent_aliquot_id is absent in data assembly
there are many differences in cohort (for e.g OT uses PBTA and data assembly uses PNOC/CBTN), tumor_descriptor, pathology_diagnosis, integrated_diagnosis, short_histology, broad_histology, molecular_subtype, OS_days

cc: @aadamk

Data assembly requests

Hi @zhangb1,

Some data assembly requests:

The merged STAR-Fusion file pbta-BS_A7Y1Y314-fusion-starfusion.tsv.gz does not look right, there are NAs in it so could you please check?

For the progression sample reports I was able to fetch the StarFusion output file for Patient-43-P and merge it to the v11 so it is not needed by me anymore but just wanted to point this out for future patients.

Would it be possible to generate a collapsed matrix i.e. (rownames are unique gene symbols) of Counts and TPM? Not high priority but would be nice to have so we don't have to collapse in the reporting code.
Would it be possible to integrate the fusion annotation code to generate the fusion-putative-oncogenic.tsv? This is not high priority but would be nice to have.

cc: @aadamk

Thanks!

Histology file fields

Comparing the primary sample information for Patient 43 and 35 from v11 to the corresponding progression samples in the new histology file histologies_v11-base-add-BS_AG4BP2PM.tsv, I think some of the fields have missing information. Is this because there is no information available?

pathology_diagnosis should be High-grade glioma/astrocytoma (WHO grade III/IV) (?). This is only for the +1 patient.
broad_histology should be Diffuse astrocytic and oligodendroglial tumor (?). This is only for the +1 patient.
short_histology should be HGAT (?). This is only for the +1 patient.
~~RNA_library seems to have moved to PFS_days for BS_A7Y1Y314 (C3506976). This is from histologies_v11-base-add-BS_AG4BP2PM.tsv~~ Seems to have been resolved.
molecular_subtype seems to be missing entirely from histologies_v11-base-add-BS_AG4BP2PM.tsv and histologies_v11-base-add-BS_HSXARQ1K.tsv. I think it was present in v11 histologies.tsv (but not in histologies-base.tsv). So they are NA across all samples.
Just noticed the following discrepancy, two samples BS_J4E9SW51 and BS_H1XPVS9A (both correspond to C334437) are annotated as LGAT in v10/v11 but HGAT in data assembly histology files:

# OpenPedCan v10 histology
v10_histology <- read_tsv('data/OpenPedCan-analysis/data/v10/histologies.tsv')
v10_histology %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "Low-grade astrocytic tumor" "Low-grade astrocytic tumor"

# OpenPedCan v11 histology
v11_histology <- read_tsv('data/OpenPedCan-analysis/data/v11/histologies.tsv')
v11_histology %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "LGAT" "LGAT"

# data assembly file generated with BS_HSXARQ1K
dat <- read_tsv('histologies_v11-base-add-BS_HSXARQ1K.tsv')
dat %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "HGAT" "HGAT"

# data assembly file with BS_AG4BP2PM
dat <- read_tsv('histologies_v11-base-add-BS_AG4BP2PM.tsv')
dat %>%
  filter(Kids_First_Biospecimen_ID %in% c("BS_J4E9SW51", "BS_H1XPVS9A")) %>%
  pull(short_histology)
[1] "HGAT" "HGAT"

I manually filled these for the report generation so don't need these. Just wanted to make a note of it.

cc @aadamk

d3b-center / d3b-bixu-data-assembly Goto Github PK

d3b-bixu-data-assembly's People

Contributors

Watchers

d3b-bixu-data-assembly's Issues

Incomplete PNOC008 clinical data in data assembly histologies

Duplicated gene_id in merged expression matrices

Collapse and combine GTEx TPM and expected_count to KFDRC collapsed RSEM files

Background:

Required update

Minor issues with merged histologies

Differences in OT and data assembly information

Data assembly requests

Histology file fields

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent