icgc-argo / argo-metadata-schemas Goto Github PK
View Code? Open in Web Editor NEWRepo to host ARGO metadata schemas defined using JSON Schema
License: GNU Affero General Public License v3.0
Repo to host ARGO metadata schemas defined using JSON Schema
License: GNU Affero General Public License v3.0
We need a mapping for the elastic-search indices that will be used by the File Repository. Currently, the indes only contains information directly from Maestro, which means only data found in Song.
We need to supplement this information with 1) clinical data 2) release information.
Start from Maestro mapping as an example (https://github.com/icgc-argo/argo-metadata-schemas/blob/master/mappings/maestro_file_centric.json).
Format the index to include this data:
file
level field at the top level.embargo_stage
file_id
labels
object with string keys and array values.. Example:{
"labels": {
[key:string]: string[];
}
}
curl --location --request GET 'https://song.rdpc-qa.cancercollaboratory.org/studies/TEST-PR/analysis/328ebe1e-28a1-4670-8ebe-1e28a16670d9' \
--header 'Authorization: Bearer 58bb9610-2184-48b2-a189-34016622c0c4'
Dynamic portion of SONG schemas have been changed since last time ES mappings were created.
Here are the changes: 0.1.0...0.3.0 (looking for individual schema file for its specific change)
Summary of what need to be sync'd:
library_strategy
changed to experiment_strategy
variant_class
which should be a root propertity for file_centric mappings. Populate with null
if SONG metadata does not have itNew fields under file.info
that are not able to be defined in dynamic SONG schema. See example payloads: here and here
file.info.data_category
to be added to file_centric mappings as data_category
file.info.analysis_tools
to be added to file_centric mappings as analysis_tools
data_category
and analysis_tools
under file.info
are meant to be the same level as file.dateType
in SONG, but since we cannot change SONG's base schema, we added them to under info.
So in terms of ES mapping, data_category
and analysis_tools
should be added to wherever data_type
field goes.
Update all song schema for argo to have session_id
in addition to run_id
currently input_files
and inputs
are both on the top level for data bundles, it makes it much more natural and easy to handle to move input_files
under inputs
We need some high level user stories for how the data curation person or team will create, update, and manage the data dictionaries and their versions.
We can scope the requirements to only be related to the backend service at a high level.
Typo of allof
instead of allOf
results in all experiment info being unvalidated.
Get Schema and example payload. Payload silently validates currently
Current payload has Illumina
when value should be ILLUMINA
Looking at schema, note allof
:
Correcting typo results in proper enforcement of schema
Schema should catch invalid enum values
When the SONG payload is validated against the current schema, and "sequencing_date" is null, the validation fails producing the following error:
In SONG:
[SubmitService::schema.violation] - #/experiment/sequencing_date: #: 2 subschemas matched instead of one
This appears to be related to this section of the schema where "sequencing_date" is expected to match only one of the formats.
"sequencing_date": {
"type": "string",
"oneOf": [
{
"format": "date"
},
{
"format": "date-time"
}
],
"example": [
"2019-06-16",
"2019-06-16T20:20:39+00:00"
]
},
Changing "oneOf" to "anyOf" may resolve the issue.
SONG Payload used in this example:
{
"studyId": "PACA-CA",
"analysisType": {
"name": "sequencing_experiment"
},
"samples": [
{
"submitterSampleId": "PCSI_0216_Pa_P_526",
"matchedNormalSubmitterSampleId": "PCSI_0216_St_R",
"sampleType": "Total DNA",
"specimen": {
"submitterSpecimenId": "PCSI_0216_Pa_P_526",
"specimenType": "Primary tumour",
"tumourNormalDesignation": "Tumour",
"specimenTissueSource": "Other"
},
"donor": {
"submitterDonorId": "PCSI_0216",
"gender": "Male"
}
}
],
"files": [
{
"dataType": "submittedReads",
"fileName": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"fileSize": 109017064104,
"fileType": "BAM",
"fileMd5sum": "b24f248925d1babae355001e1d0200f0",
"fileAccess": "controlled"
}
],
"read_groups": [
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_595_WG",
"platform_unit": "150106_D00353_0088_BC61P0ANXX_1_NoIndex",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_1_NoIndex"
},
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_607_WG",
"platform_unit": "150106_D00353_0088_BC61P0ANXX_2_NoIndex",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_2_NoIndex"
},
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_595_WG",
"platform_unit": "150115_D00331_0120_AC5U99ANXX_5_NoIndex",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_5_NoIndex"
},
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_607_WG",
"platform_unit": "150115_D00331_0120_AC5U99ANXX_6_NoIndex",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_6_NoIndex"
},
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_646_WG",
"platform_unit": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC"
},
{
"file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
"insert_size": null,
"is_paired_end": true,
"library_name": "PCSI_0216_Pa_P_PE_646_WG",
"platform_unit": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC",
"read_length_r1": null,
"read_length_r2": null,
"sample_barcode": null,
"submitter_read_group_id": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC"
}
],
"experiment": {
"submitter_sequencing_experiment_id": "TEST-EXP-132",
"library_strategy": "WGS",
"sequencing_center": "",
"platform": "ILLUMINA",
"platform_model": null,
"sequencing_date": null
},
"read_group_count": 6
}
This will allow earlier payloads which are missing session_id
information to be valid against the latest schemas.
We need to some control what characters are allowed in read group IDs, otherwise, some odd characters (like &
, /
etc) may cause trouble in downstream data processing
add pattern to submitter_read_group_id
field: [a-zA-Z0-9_\:\.\-]
in relevant JSON schema.
Note that other character(s) may be added to the above pattern if later deemed necessary
The current qc_metrics
schema has some issues when used by sanger_qc
short_name
as a valid enum but not requiredsequencing_alignment
as valide enum valuesequencing_alignment
as valide enum valueplatfomr
as optional not requiredAdd the following conditional fields
Field | Attribute | Description | Permissible value |
---|---|---|---|
ega_file_id | Conditional Required | EGA File Unique Accession ID | ^EGAF[0-9]{1,32}$ |
ega_dataset_id | Optional | EGA Dataset Accession ID | ^EGAD[0-9]{1,32}$ |
ega_experiment_id | Optional | EGA Experiment ID | ^EGAX[0-9]{1,32}$ |
ega_sample_id | Optional | EGA Sample Accession ID | ^EGAN[0-9]{1,32}$ |
ega_study_id | Optional | EGA Study Accession ID | ^EGAS[0-9]{1,32}$ |
ega_run_id | Optional | EGA Run Accession ID | ^EGAR[0-9]{1,32}$ |
ega_policy_id | Optional | EGA Policy Accession ID | ^EGAP[0-9]{1,32}$ |
ega_analysis_id | Optional | EGA Analysis Accession ID | ^EGAZ[0-9]{1,32}$ |
ega_submission_id | Optional | EGA Submission ID | ^EGAB[0-9]{1,32}$ |
ega_dac_id | Optional | EGA DAC Accession ID | ^EGAC[0-9]{1,32}$ |
We will need to update the ARGO file-centric
and analysis-centric
mappings to include the new fields:
Adding two optional fields under workflow
section:
Other than sequencing_experiment, the following schemas will also need to be updated to reflect the changes for targeted-seq and wxs if applicable.
This should include creating a dictionary for an entity (for example donor)
Based on Junjun's feedback we need to add a normalizer, this ticket is to investigate the benefit and
add that to the file repository argo mappings.
The conditional required field library_strandedness
for RNA-Seq data is NOT enforced in the schema.
POG-CA has started submitting RNA-Seq however current payloads are missing library_strandedness
:
https://submission-song.rdpc.cancercollaboratory.org/studies/POG-CA/analysis/8e3605d7-328e-4acf-b605-d7328eeacf8f
"experiment": {
"platform": "ILLUMINA",
"platform_model": "Illumina HiSeq 2000",
"sequencing_date": "2013-03-28",
"sequencing_center": "BCCAGSC",
"experimental_strategy": "RNA-Seq",
"submitter_sequencing_experiment_id": "A10969"
},
The payload should failed the validation.
For WXS and Targeted Sequencing data, we would like to add capture_kit
as optional field under experiment
in ALL schemas.
As we have many SONGs and the same schema in different SONG will have different system versions, if we have a field in the schema to indicate the centralized release version tracked in GIT (like 0.5.0) would be very helpful.
Plan to add the field into dynamic portion of the schema and update the version whenever we make release.
We need to verify the analyzers in maestro file centric and other mappings will actually work as expected
avoid reindexing later as much as possible
setup a demo index and use /analyze api to check that auto complete and search will work as desired.
Read_groups required fields
Task:
sequencing_experiment
to read_group
sequencing_experiment
, move read_group
to under bioentities
Tasks:
Tasks:
Currently read_group
entity can be submitted one by one independently which is generally OK, however this makes it impossible to know how many read groups to expect for a particular sample
. Knowing total number of read groups for a sample is important if we ultimately aim to programmatically launch analytic workflow depending on input data availability.
A possible solution is to impose a rule that user must submit all read groups (with the same library strategy) belonging to one sample at once.
In order to accommodate exceptions, such as, additional new sequencing is done to increase coverage for a sample, new read groups should be allowed to add. This means the rule can be overridden when needed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.