Giter Site home page Giter Site logo

d3b-cds-manifest-prep's People

Contributors

chris-s-friedman avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

d3b-cds-manifest-prep's Issues

Submission Package bug -

Describe the bug

one file in genomic_info not in file

Expected behavior

all files in genomic_info should be in file

Version ID

0.8.1

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - PT_95S99RWE not in diagnoses manifest

Describe the bug

PT_95S99RWE is not in the diagnosis manifest

Expected behavior

PT_95S99RWE should have a diagnosis

Version ID

0.13.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug -

Describe the bug

23 diagnosis ids are not unique

Expected behavior

all ids need to be unique

Version ID

0.8.1

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Feature Request: CDS v1.x.x

CDS v1.x.x

This version of CDS will have data from a few sources:

Note: the below list will be updated periodically with links to the related manifests.

  1. CBTN X01 source and Harmonized data
    a. source data file-sample-participant mapping: https://data-tracker.kidsfirstdrc.org/study/SD_BHJXBDQK/documents/SF_Z4B1Q5XE
    b. harmonized data post-harmonization manifest: https://data-tracker.kidsfirstdrc.org/study/SD_BHJXBDQK/documents/SF_R8XTMZAN
  2. CBTN Pre-X01 DNA and RNA files that went through new gencode
  3. CBTN Pre-X01 data that was not included in v0.14.1 that can be identified as coming from a particular participant and sample
  4. PNOC008 samples collected and analyzed after the file-sample-participant manifest for v0.14.1 was closed.

To establish item 3: these participants/samples/files are ones that are released in either CAVATICA, OpenPedCan Histologies v12, or on PedCBioPortal but not in cds v0.14.1.


Edits

  • Edit 1: 2023-02-16 - add links for cbtn x01 source file-sample-participant mapping and harmonized post-harmonization data manifests

Feature Request: try to incorporate bailey's testing harness

Is your feature request related to a problem? Please describe

try to incorporate bailey's testing and reporting harness into qc

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional Context

No response

Submission Package bug - Diagnoses Not Reported

Describe the bug

Investigate why diagnoses have the text Not Reported

Expected behavior

no diagnosis should be Not Reported.

Version ID

0.9.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - germline samples are in the diagnosis-sample map

Describe the bug

germline samples are in the diagnosis sample map

Expected behavior

germline samples shouldn't be in the diagnosis sample map. participants that have only germline samples will have a diagnosis in the diagnosis file but will not have a diagnosis in the diagnosis-sample map file.

Version ID

0.13.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - diagnosis-sample mapping

Describe the bug

CDS found that some samples had multiple diagnoses associated with them in the diagnosis-sample mapping. They expect a sample to only have one diagnosis.

For example:

in the diagnosis-sample mapping table:

diagnosis_id sample_id
DG__BS_1GFP3T8N__0 BS_1GFP3T8N
DG__BS_1GFP3T8N__1 BS_1GFP3T8N
DG__BS_1GFP3T8N__2 BS_1GFP3T8N

and in the diagnosis table

diagnosis_id primary_diagnosis participant_id
DG__BS_1GFP3T8N__0 Craniopharyngioma PT_P1F0AHMT
DG__BS_1GFP3T8N__1 High-grade glioma/astrocytoma (WHO grade III/IV) PT_P1F0AHMT
DG__BS_1GFP3T8N__2 Low-grade glioma/astrocytoma (WHO grade I/II) PT_P1F0AHMT

Is this expected and true?

Expected behavior

from the cds team:

If this is expected (e.g. because of heterogeneity in the tumor), would it be possible to modify the sample_ids so that there could be 1:1 mapping of sample ID : diagnosis ID?
If the participant_age_at_collection and the anatomic_site of a set of samples was the same, but they each had a unique diagnosis, could a secondary user infer that the tumor was heterogenous for tumor grade or type?

Version ID

0.9.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - Sequencing File Information is Missing

Describe the bug

Sequencing Information is incomplete for some files to be submitted in the second CDS release.

There are 8756 unique sequencing experiments associated with files being submitted.

The export from the dataservice with information about each of these experiemnts is here.

Platform

Accepted values:

AB Capillary
ABI Solid
BGISEQ
Complete Genomics
Helicos
Illumina
Ion Torrent
LS 454
Oxford Nanopore
PacBio SMRT

Actual Values

platform count
Illumina 8503
Not Reported 242
Other 11

The issue is with the last two platforms. We need to decide what platform these experiments were performed on.

The 11 experiments where platform is other are all rna-seq samples, where the instrument model is DNBSeq that were sequenced at BGI.

@chris-s-friedman to get the platform for the above from bix

For the 242, their compostion of strategy, instrument model, and sequencing center is below. Note that none of these experiments have a value for instrument model.

library_strategy instrument_model sequencing_center_id sequencing center name count
RNA-Seq Not Reported SC_2ZBAMKK0 Novogene 81
WGS Not Reported SC_2ZBAMKK0 Novogene 131
WGS Not Reported SC_FAD4KCQG BGI 15
WGS Not Reported SC_N1EVHSME NantOmics 10
WGS Not Reported SC_WWEQ9HFY BGI@CHOP Genome Center 5

@chris-s-friedman to look through past files to get previously investigated platform

Instrument Model

Actual Values

Instrument Model Count
Not Reported 5838
HiSeq 1809
HiSeq X 1007
Novaseq 6000 91
DNBSeq 11

None of these instrument models are accepted values in their data model

Neither HiSeq or HiSeq X are accepted values, but they do have values for HiSeq X Five and HiSeq X Ten.

There is no Novaseq instrument model in their enumerated values.

There is no DNBSeq instrument model in their enumerated values.

@baileyckelly to ask ccdi if these values above are acceptable

Of the Not Reported instrument models:

  1. 199 experiments are cbtn experiments from pre-x01
  2. 76 experiments are pnoc 003/008 experiments created before february 2023
  3. 5449 experiments are from cbtn x01
  4. 40 experiments are pnoc 003/008 experiments on 2/6/2023 and 2/8/2023 that look to be associated with cbtn x01
  5. 74 experiments are associated with cbtn x01 under the study ID SD_8C478S85, High Incidence of Pediatric CNS Tumors, D3B-PCNST.

Items 1 and 2 will need some further investigation.

3, 4, and 5 are all from the cbtn x01 and should all have similiar instrument models.

Library Selection

For RNA-Seq samples, this is missing for all pre-x01 data
For WGX, WXS, and Targeted Capture, this is missing for pre-x01 data and x01 data


From the metadata template:

For sequencing files, please try to provide all metadata, if applicable, for the following properties: avg_read_length, number_of_reads, number_of_bp, coverage

Number of Reads

missing for 3192 experiments. All pre x01

Mean read length

missing for 3192 experiments. All pre x01

Coverage

Missing for all experiments

number of bp

missing for all experiments

Expected behavior

No response

Version ID

None

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - sample in genomic_info not in sample

Describe the bug

1 sample that is in genomic info that is not in sample

Expected behavior

all samples in genomic info should be in sample

Version ID

0.8.1

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

order IDs in the output manifests

Is your feature request related to a problem? Please describe.
The output order of entities in the different output manifests is not controlled. occasionally this order can change between versions without underlying changes to the data. This causes unexpected diffs when updating manifests in github. Ordered IDs would makeit easier to understand diffs between versions

Describe the solution you'd like
Order the output manifests by key ID

Submission Package bug - Diagnoses should be at the event level, not the aliquot level

Describe the bug

No response

Expected behavior

See the model google sheet here.

Diagnoses should be at the sample/ event level (7316 number).

Diagnosis IDs should take the form dg_[7316-1234]_[x], where the id starts with DG, then the 7316 number, then the diagnosis number within that event when events have multiple diagnoses attached to them.

Version ID

0.12.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - genomic_info values not in enums

Describe the bug

There are 3 values for platform, strategy and library that are not in the allowed enum

Expected behavior

all values for these columns should be enums

Version ID

0.8.1

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Submission Package bug - missing sample and files

Describe the bug

one sample (BS_0J5MCBZV) and three files (GF_8A1T39FW, GF_H6Z3Q10Y, GF_QN2WX9M5) are missing from the genomic_info manifest.

Expected behavior

These items should be in the genomic_info manifest

Version ID

0.8.0

Effected file(s)

  • sample
  • participant
  • diagnosis
  • diagnosis_sample_mapping
  • file
  • file_sample_participant_map
  • genomic_info

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.