icgc-argo / argo-dictionary Goto Github PK
View Code? Open in Web Editor NEWDevelopment of the ARGO Data Dictionary
Home Page: https://docs.icgc-argo.org/dictionary
License: GNU Affero General Public License v3.0
Development of the ARGO Data Dictionary
Home Page: https://docs.icgc-argo.org/dictionary
License: GNU Affero General Public License v3.0
Will generate a codelist containing list of all possible stage groups.This codelist will be used in the following fields:
Will update codelist used in clinical_staging_system
, pathological_staging_system
and recurrence_staging_system
to make sure they contain all the staging systems recommended by AJCC.
Will also add a script validation check (similar to the one we use for tumour grading) to check the stage group value against the staging system specified.
Work in progress: https://docs.google.com/spreadsheets/d/1RCUqJ2DeK5vynGrYzhLOu-UAs9Kvxx_4L_5sWo_TtHc/edit?usp=sharing
A few projects have asked what the primary diagnosis should be since they use different methods to determine the diagnosis. In reality, the primary diagnosis can be made on the basis of different methods and each project/cancer type may use different ways to determine the diagnosis (ie. imaging or biopsy or clinical only etc.).
Propose adding a new extended (optional) field called "basis_of_diagnosis" in the Primary Diagnosis
table which will let data submitters specify the method of primary diagnosis. This field will consist of a controlled vocabulary recommended by the IARC and WHO:
Both WHO (see see https://apps.who.int/iris/bitstream/handle/10665/96612/9789241548496_eng.pdf page 27, Table 26) and the ENCR (European Network of Cancer Registries) recommend this codelist (see here). Only the most valid basis of diagnosis is required (ie. only need to submit one value for this clinical field). The "Notes" section will contain details and a link to WHO/IARC for reference.
Currently, the data submitter is expected to submit clinical data for the tumour-related specific fields when the tumour_normal_designation
is Normal
which is incorrect.
The following tumour-related fields should only need to be completed only if tumour_normal_designation
is Tumour
. If tumour_normal_designation
is Normal
, then these fields should not be filled in:
pathological_tumour_staging_system
pathological_T_category
pathological_N_category
pathological_M_category
pathological_stage_group
tumour_grading_system
tumour_grade
percent_tumour_cells
percent_proliferating_cells
percent_stromal_cells
percent_necrosis
Beta-testers from the PACA-CA project were confused as to how to submit the disease_status_at_followup
since apparently the term "relapse" is not used in pancreatic cancer (they use recurrence).
Based on discussions with the working group, it was confirmed that the terms "relapse" and "recurrence" are actually synonymous, so it's been proposed we allow both terms.
disease_status_at_followup
codelist.Some fields in the specimen file are dependent on if tumour_normal_designation = tumour
(in sample-registration file). This is not reflected in the dictionary labels.
The following tumour-related fields should only need to be completed only if tumour_normal_designation is Tumour. these should have dependency tags.
pathological_tumour_staging_system
pathological_T_category
pathological_N_category
pathological_M_category
pathological_stage_group
tumour_grading_system
tumour_grade
percent_tumour_cells
percent_proliferating_cells
percent_stromal_cells
percent_necrosis
@hknahal these
pathological_T_category
pathological_N_category
pathological_M_category are dependent on pathological_tumour_staging_system already...so not sure if those need to be dependent on tumour_normal_designation too.
The error message from validation scripts doesn't show up on the UI.
I uploaded a sample registration file, with invalid field data, on Platform UI (Dev).
Specimen Type: 'Cell line - derived from normal'
Tumour_Normal_Designation: 'Tumour'
The error outputted should have been:
"Invalid specimen_type. Specimen_type cannot be set to normal type value (Normal or Cell line - derived from normal) when tumour_normal_designation is set to Tumour."
Instead, the error was the following:
The value is not permissible for this field (see image below).
Steps to reproduce the behaviour:
The error outputted should have been:
"Invalid specimen_type. Specimen_type cannot be set to normal type value (Normal or Cell line - derived from normal) when tumour_normal_designation is set to Tumour."
As discussed with working group:
In our dictionary, we use the term “Monoclonal antibodies (for liquid tumours)”. However, monoclonal antibodies are not always associated with liquid tumours. For example, in breast cancer HER+, patients get treated with monoclonal antibodies (sometimes called immunotherapy). Proposed the following changes:
Instead of the two current categories “Immunotherapy” and “Monoclonal antibodies” use:
Immunotherapy, monoclonal antibodies other than immune checkpoint inhibitors
Immunotherapy, immune checkpoint inhibitors
Immunotherapy, cell-based
Immunotherapy, other immunomodulatory substances
Monoclonal antibodies (for liquid tumours)
from the treatment_type
codelist and leave just Immunotherapy
Will most likely add a few more fields to this table after gathering more information from working group.
sampleSubmitterId
--> submitter_sample_id
specimenClass
, specimenType.
these should align with the registration field names tumor_normal_designation
and specimen_tissue_source
"data_type": "sequencing_file",
"submitter_read_group_id": "C0HVY.2"
"file"
entitymale
,female
, unspecified
should be Male
/Female
/Other
study _id
and a field program_id
. These are redundant - we don't need both for this; we should either make the study --> program OR just explain to put program id in study."matched_normal_submitter_sample_id"
that is only filled in if the sample is a tumor_type.I cannot submit decimals for drug dosages:
these values should allow integers
change from type integer
to number
dictionary locations
argo-dictionary/schemas/chemotherapy.json
Line 81 in 69f864d
Add two optional fields in Treatment table for clinical trial information
clinical_trials_database: Will consist of a codelist which will contain (will add more if requested):
"NCI Clinical Trials"
"EU Clinical Trials Register"
clinical_trial_number: This will contain a script validation that will use a regex to ensure the format of the clinical trial number matches that clinical_trial_database.
After discussions with working group, decided we should accept non-malignant neoplasms as well. Comment from working group:
"I see no problem in listing non-malignant neoplasms as well. For some longitudinal cancer studies an early diagnosis of a precursor lesion and its genetic information might be highly relevant."
Will update regex for following fields:
cancer_type_code
cancer_type_prior_malignancy
In many files, I shouldn't be able to put in T, N, M values when staging system is NOT any edition of the AJCC cancer staging system.
This occurs in the following files:
specimen file: pathological_tumour_staging_system
Primary Diagnosis file: clinical_tumour_staging_system
Follow up file: recurrence_tumour_staging_system and posttherapy_tumour_staging_system
So only when the values of these files are any edition of AJCC, then the T, N, M field values should be filled out.
For example:
Steps to reproduce the behaviour:
I should get an error that the T, N, M values shouldn't be filled in if the staging system field is not any edition of AJCC
Currently all scripts listed in the dictionary are placeholders. We need to update these to include real script based validation.
An example script is here: https://wiki.oicr.on.ca/display/icgcargotech/Validation+Script+Template
There is one field that will use the same script across multiple files. This is the clinical staging system and its associated fields.
(https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/donor.json), please implement the following script based validations in the dictionary:
staging_system
field, when any AJCC
value is selected, then the t_category
, n_category
, and m_category
, stage_group
, stage_suffix
fields are required to be filled in.staging_system
groupings contain stage_group
&clinical_stage_suffix
fields.follow_up
, primary_diagnosis
, and specimen
schemasPost-therapy tumour staging system should not be dependent on whether the disease status is relapse or progression.
We can remove the following script validations for posttherapy_tumour_staging_system:
https://github.com/icgc-argo/argo-dictionary/blob/master/references/validationFunctions/follow_up/posttherapy_tumour_staging_system.js#L2
https://github.com/icgc-argo/argo-dictionary/blob/master/references/validationFunctions/follow_up/posttherapy_tumour_staging_system.js#L3
The regular expression for tumour_histological_type
is incorrect. For example: 8140/3
will fail valdiation.
I need to fix this regex at: https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/specimen.json#L141
Beta testers are confused as to where they find their program_id.
For each program_id entry in the dictionary, please add the following note:
This is the unique id that is assigned to your program. When you are logged in, the program_id can be found in the top title bar of your Submission Dashboard. PACA-AU is an example of a program_id.
Currently all scripts listed in the dictionary are placeholders. We need to update these to include real script based validation.
An example script is here: https://wiki.oicr.on.ca/display/icgcargotech/Validation+Script+Template
For the sample_registration file (https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/sample_registration.json ), please implement the following script based validations in the dictionary:
specimen_type
. The logic should be:tumor_normal_designation
= Normal
then specimen_type
can be equal to one of:Normal
,Normal - tissue adjacent to primary tumour
Cell line - derived from normal
Other wise, tumor_normal_designation
= Tumour
If tumor_normal_designation
= Tumour
then specimen_type
can be equal to any values EXCEPT:
Normal
,
Cell line - derived from normal
Add the correct tags to the dictionary schema so they will appear in the Dictionary viewer. A guide for tags can be found here: https://wiki.oicr.on.ca/pages/viewpage.action?pageId=134938807
The existing codelist for Gleason values (ie. when tumour_grading_system is gleason) contains dashes (ie. gleason 2–6). When a data submitter copies and pastes a gleason value from the browser to an excel file, the dash renders as a special character.
Consider replacing existing codelist with gleason grade groups which don't contain dashes. Can provide mapping information between grade groups and scores in "Notes" section:
We need to be able to answer the questions like this:
Which tumour was this treatment related to?
Which sample was associated with this primary diagnosis?
Which diagnosis was this follow_up related to?
submitter_primary_diagnosis_id
on primary_diagnosis filesubmitter_primary_diagnosis_id
on specimen filesubmitter_primary_diagnosis_id
on treatment filesubmitter_primary_diagnosis_id
on follow_up filesubmitter_treatment_id
on follow_up file (optional)For this dictionary change we will have some dependencies on clinical:
Create a system within argo dictionary to run unit tests on the validation functions.
Before npm run compile can proceed, all unit tests should pass first.
Look into different testing frameworks.
--
Many of the codelists used in the dictionary are from external standards (ie. NCI, ICD-10, ICD-O-3, RECIST etc.) Include a reference to these external standards so users know where the codelist was derived from.
There are currently values in dictionary that has commas in them, that will conflict with the new array of values feature.
Remove commas for terms. Affected controlled vocabulary lists:
https://github.com/icgc-argo/argo-dictionary/blob/master/references/list.json#L243
https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/specimen.json#L178-L182 https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/specimen.json#L199-L201
We currently ask for the patient's "weight" in the Donor level. This would be baseline weight, but it would also be good to have the patient's weight at followup since it can change:
Use case from data submitter for PACA-CA project:
Email from Rajshree:
The weight of patients may vary substantially from diagnosis till resection. Neoadjuvant chemotherapy, if given prior to surgery, can result in significant appetite changes; biliary obstruction due to a pancreatic mass can alter eating habits as can cause nausea and vomiting. So the weight can drop significantly and is monitored at each clinic visit
Will add a new optional field called "weight_at_followup" in the Followup table
When a user makes updates to data they have already submitted, and that data has dependencies, check that the dependent fields are filled out correctly.
For example:
The submitter should be prompted to remove the dependent values too.
From Christina:
We should check that the dependent fields are left empty in this case during initial submission and subsequent update. And give error if they are filled out in the wrong context
\b
should be escaped \\b
in the regex, see submitter_id for exmaple because it's a JS
white space characterBeta feedback: would be easier to read the values if they were in alphabetic order.
Alphabetize permissible values for all value lists in the dictionary
We need a codelist for all the T, N and M values when the cancer staging system is AJCC. Some cancer types only allow certain values for T. For example, breast cancer allows Tis
but other cancer types do not.
Steps to reproduce the behaviour:
A clear and concise description of what you expected to happen.
Currently all scripts listed in the dictionary are placeholders. We need to update these to include real script based validation.
An example script is here: https://wiki.oicr.on.ca/display/icgcargotech/Validation+Script+Template
For the sample_registration file (https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/donor.json), please implement the following script based validations in the dictionary:
1) Add script validation to cause_of_death
This field should be filled in if vital_status=deceased
2) Add the correct tags to the dictionary schema so they will appear in the Dictionary viewer. A guide for tags can be found here: https://wiki.oicr.on.ca/pages/viewpage.action?pageId=134938807
Currently all scripts listed in the dictionary are placeholders. We need to update these to include real script based validation. An example script is here: https://wiki.oicr.on.ca/display/icgcargotech/Validation+Script+Template
For the follow_up schema (https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/follow_up.json), please implement the following script based validations in the dictionary:
relapse_interval
, if disease_status_at_followup
=relapse
then this field must have a value.method_of_progression_status
, if disease_status_at_followup
=progression
or relapse
, then this field must have a value.anatomic_site_progression_or_recurrences
, if disease_status_at_followup
=progression
or relapse
, then this field must have a valuerecurrence_tumour_staging_system
, if disease_status_at_followup
=progression
or relapse
, then this field must have a value.posttherapy_tumour_staging_system
, if disease_status_at_followup
=progression
or relapse
, then this field can have a value, otherwise it should not have a value.The current regex for submitter_donor_id
, submitter_specimen_id
and submitter_sample_id
do not prevent a data submitter from submitting an identifier that has the same prefix as ARGO reserved prefixes (ie. DO*, SP* and SA*).
In ICGC, we had a script validation to prevent data submitters from using ICGC reserved prefixes. We can do the same, or modify the existing regex.
Exclude following prefixes:
submitter_donor_id
: exclude DO
prefix
submitter_specimen_id
: exclude SP
prefix
submitter_sample_id
: exclude SA
prefix
submitter_treatment_id
: exclude TR
prefix
submitter_primary_diagnosis_id
: exclude PD
prefix
submitter_follow_up_id
: exclude FU
prefix
We should use the drug names from DrugBank for chemotherapy_drug_name
, but this will obviously make the codelist really long. Can we store the DrugBank information in a database and then have a script validation here to cross-check the submitted drug name?
https://wiki.oicr.on.ca/pages/viewpage.action?spaceKey=icgcargotech&title=Drug+Name+Specification
The codelist for "tumour_grade" when tumour_grading_system is "lymphoid neoplasms" looks like this:
low grade or indolent nhl
high grade or aggressive nhl
A data submitter may confuse the "or" as an option to submit either "low grade" or "indolent nhl". But they need to submit the whole phrase (ie. "low grade or indolent nhl"). If they submit just "low grade", the validation fails like this:
Format the newline response to show this:
- `low grade or indolent nhl`
- `high grade or aggressive nhl`
Currently all scripts listed in the dictionary are placeholders. We need to update these to include real script based validation. An example script is here: https://wiki.oicr.on.ca/display/icgcargotech/Validation+Script+Template
For the follow_up schema (https://github.com/icgc-argo/argo-dictionary/blob/master/schemas/specimen.json), please implement the following script based validations in the dictionary:
tumour_grade
, select a value based on tumor_grading_system according to this matrix:Default | Gleason | Nottingham | Brain cancer | ISUP for renal cell carcinoma | Lymphoid neoplasms |
---|---|---|---|---|---|
Gx - cannot be assessed | Gleason X: Gleason score cannot be determined | G1 (Low grade or well differentiated) | Grade I | Grade 1: Tumor cell nucleoli invisible or small and basophilic at 400 x magnification | Low grade or indolent NHL |
G1 well differentiated/low grade | Gleason 2–6: The tumor tissue is well differentiated | G2 (Intermediate grade or moderately differentiated) | Grade II | Grade 2: Tumor cell nucleoli conspicuous at 400 x magnification but inconspicuous at 100 x magnification | High grade or aggressive NHL |
G2 moderately differentiated/intermediated grade | Gleason 7: The tumor tissue is moderately differentiated | G3 (High grade or poorly differentiated) | Grade III | Grade 3: Tumor cell nucleoli eosinophilic and clearly visible at 100 x magnification | |
G3 poorly differentiated/high grade | Gleason 8–10: The tumor tissue is poorly differentiated or undifferentiated | Grade IV | Grade 4: Tumors showing extreme nuclear pleomorphism and/or containing tumor giant cells and/or the presence of any proportion of tumor showing sarcomatoid and/or rhabdoid dedifferentiation | ||
G4 undifferentiated/high grade |
There are programs with multiple primary sites or "pancancer"
Thus, we need to identify primary_site at a donor level so that each donor can be associated with the correct primary site.
GDC Reference:
https://docs.gdc.cancer.gov/Data_Dictionary/viewer/#?view=table-definition-view&id=case&anchor=primary_site
{
"name": "primary_site",
"valueType": "string",
"description": "The text term used to describe the primary site of disease, as categorized by the World Health Organization's (WHO) International Classification of Diseases for Oncology (ICD-O). This categorization groups cases into general categories.",
"meta": {
"displayName": "Primary Site"
"core": true,
},
"restrictions": {
"required": true,
"enum": [
"Accessory sinuses",
"Adrenal gland",
"Anus and anal canal",
"Base of tongue",
"Bladder",
"Bones, joints and articular cartilage of limbs",
"Bones, joints and articular cartilage of other and unspecified sites",
"Brain",
"Breast",
"Bronchus and lung",
"Cervix uteri",
"Colon",
"Connective, subcutaneous and other soft tissues",
"Corpus uteri",
"Esophagus",
"Eye and adnexa",
"Floor of mouth",
"Gallbladder",
"Gum",
"Heart, mediastinum, and pleura",
"Hematopoietic and reticuloendothelial systems",
"Hypopharynx",
"Kidney",
"Larynx",
"Lip",
"Liver and intrahepatic bile ducts",
"Lymph nodes",
"Meninges",
"Nasal cavity and middle ear",
"Nasopharynx",
"Oropharynx",
"Other and ill-defined digestive organs",
"Other and ill-defined sites",
"Other and ill-defined sites in lip, oral cavity and pharynx",
"Other and ill-defined sites within respiratory system and intrathoracic organs",
"Other and unspecified female genital organs",
"Other and unspecified major salivary glands",
"Other and unspecified male genital organs",
"Other and unspecified parts of biliary tract",
"Other and unspecified parts of mouth",
"Other and unspecified parts of tongue",
"Other and unspecified urinary organs",
"Other endocrine glands and related structures",
"Ovary",
"Palate",
"Pancreas",
"Parotid gland",
"Penis",
"Peripheral nerves and autonomic nervous system",
"Placenta",
"Prostate gland",
"Pyriform sinus",
"Rectosigmoid junction",
"Rectum",
"Renal pelvis",
"Retroperitoneum and peritoneum",
"Skin",
"Small intestine",
"Spinal cord, cranial nerves, and other parts of central nervous system",
"Stomach",
"Testis",
"Thymus",
"Thyroid gland",
"Tonsil",
"Trachea",
"Ureter",
"Uterus, NOS",
"Vagina",
"Vulva",
"Unknown",
"Not Reported"
]
}
}
Based on user feedback, there are two clinical fields that will require re-naming. The current clinical field names are confusing some data submitters. After discussions with working group, it was agreed to update the following:
central_pathology_confirmed -> reference_pathology_confirmed. Will also update description so it is more clear.
application_form -> radiation_therapy_type
Dictionary repo is only able to reference one validation function per field
It's currently not possible to have a field validated in two ways.
Need to make a system where any number of validation functions can be referenced, and they get combined, and return one comprehensive result.
-reference to a script whose sole purpose is to combine n number of functions
-import the required functions in the field
I was confused at what the last two errors meant
Suggestion to change:
[specimen_acquisition_interval] requires [donor.vital_status], [donor.survival_time] in order to complete validation. Please upload data for all fields in this clinical data submission.
to
specimen_acquisition_interval requires vital_status for this donor in order to complete validation. Please upload donor data for this specimen using the donor.tsv file.
Wondering if we say "and survival_time" in this error as for donors that are alive, they don't require survival time
if so we can write:
specimen_acquisition_interval requires vital_status and survival_time (for deceased donors) in order to complete validation. Please upload donor data for this specimen using the donor.tsv file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.