icgc-argo / argo-metadata-schemas Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 361 KB

Repo to host ARGO metadata schemas defined using JSON Schema

License: GNU Affero General Public License v3.0

Python 100.00%

hacktoberfest

argo-metadata-schemas's People

Contributors

Watchers

Forkers

rdeborja

argo-metadata-schemas's Issues

Confirm Platform File_Centric Mapping

We need a mapping for the elastic-search indices that will be used by the File Repository. Currently, the indes only contains information directly from Maestro, which means only data found in Song.

We need to supplement this information with 1) clinical data 2) release information.

Start from Maestro mapping as an example (https://github.com/icgc-argo/argo-metadata-schemas/blob/master/mappings/maestro_file_centric.json).

Format the index to include this data:

1. include all release fields from File Service. These are file level field at the top level.
embargo_stage
file_id
labels object with string keys and array values.. Example:

{
  "labels": {
    [key:string]: string[];
  }
}

2. Include all file information that originates in an analysis. Keep the structure of the file as it is now, this point is just for reference.
data source for reference is from a song analysis.
Example:

curl --location --request GET 'https://song.rdpc-qa.cancercollaboratory.org/studies/TEST-PR/analysis/328ebe1e-28a1-4670-8ebe-1e28a16670d9' \
--header 'Authorization: Bearer 58bb9610-2184-48b2-a189-34016622c0c4'

3. include all current clinical data fields. Needs to be normalized correctly so talk to @joneubank.
from clinical db.
currently only "whole program" available from API. We have mentioned we need and API eventually, but basically all teh data that can be submitted.
https://clinical.qa.argo.cancercollaboratory.org/api-docs/#/Clinical%20Data/get_clinical_donors
dictionary ui showing all clinical data: https://docs.icgc-argo.org/dictionary

Expected Outcome

A mapping file in the metadata repository that we all agree on!

Feature Request: update ES index mapping to sync with dynamic SONG schemas

Detailed Description

Dynamic portion of SONG schemas have been changed since last time ES mappings were created.
Here are the changes: 0.1.0...0.3.0 (looking for individual schema file for its specific change)

Summary of what need to be sync'd:

library_strategy changed to experiment_strategy
new field: variant_class which should be a root propertity for file_centric mappings. Populate with null if SONG metadata does not have it

New fields under file.info that are not able to be defined in dynamic SONG schema. See example payloads: here and here

file.info.data_category to be added to file_centric mappings as data_category
file.info.analysis_tools to be added to file_centric mappings as analysis_tools

data_category and analysis_tools under file.info are meant to be the same level as file.dateType in SONG, but since we cannot change SONG's base schema, we added them to under info.

So in terms of ES mapping, data_category and analysis_tools should be added to wherever data_type field goes.

Update Song Schema with add new field session_id

Update all song schema for argo to have session_id in addition to run_id

Bootstrap NodeJS project with Mongo for Dictionary Service

Need to create the basic project structure
Decide on typescript/ECMAScript version
Monitoring and process management
Base Image
Jenkins file, deployment

Move input_files to under inputs

currently input_files and inputs are both on the top level for data bundles, it makes it much more natural and easy to handle to move input_files under inputs

Dictionary Management User Stories

We need some high level user stories for how the data curation person or team will create, update, and manage the data dictionaries and their versions.

We can scope the requirements to only be related to the backend service at a high level.

🐛 Typo in metaschema resulting in improperly passed param🐛

Describe the bug

Typo of allof instead of allOf results in all experiment info being unvalidated.

Steps To Reproduce

Get Schema and example payload. Payload silently validates currently

Current payload has Illumina when value should be ILLUMINA

Looking at schema, note allof:

Correcting typo results in proper enforcement of schema

Expected behaviour

Schema should catch invalid enum values

🐛sequencing_experiment: null value for sequencing_date fails validation against schema

Describe the bug

When the SONG payload is validated against the current schema, and "sequencing_date" is null, the validation fails producing the following error:

In SONG:

[SubmitService::schema.violation] - #/experiment/sequencing_date: #: 2 subschemas matched instead of one

This appears to be related to this section of the schema where "sequencing_date" is expected to match only one of the formats.

 "sequencing_date": {
            "type": "string",
            "oneOf": [
              {
                "format": "date"
              },
              {
                "format": "date-time"
              }
            ],
            "example": [
              "2019-06-16",
              "2019-06-16T20:20:39+00:00"
            ]
          },

Possible solution:

Changing "oneOf" to "anyOf" may resolve the issue.

SONG Payload used in this example:

{
    "studyId": "PACA-CA",
    "analysisType": {
        "name": "sequencing_experiment"
    },
    "samples": [
        {
            "submitterSampleId": "PCSI_0216_Pa_P_526",
            "matchedNormalSubmitterSampleId": "PCSI_0216_St_R",
            "sampleType": "Total DNA",
            "specimen": {
                "submitterSpecimenId": "PCSI_0216_Pa_P_526",
                "specimenType": "Primary tumour",
                "tumourNormalDesignation": "Tumour",
                "specimenTissueSource": "Other"
            },
            "donor": {
                "submitterDonorId": "PCSI_0216",
                "gender": "Male"
            }
        }
    ],
    "files": [
        {
            "dataType": "submittedReads",
            "fileName": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "fileSize": 109017064104,
            "fileType": "BAM",
            "fileMd5sum": "b24f248925d1babae355001e1d0200f0",
            "fileAccess": "controlled"
        }
    ],
    "read_groups": [
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_595_WG",
            "platform_unit": "150106_D00353_0088_BC61P0ANXX_1_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_1_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_607_WG",
            "platform_unit": "150106_D00353_0088_BC61P0ANXX_2_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_2_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_595_WG",
            "platform_unit": "150115_D00331_0120_AC5U99ANXX_5_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_5_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_607_WG",
            "platform_unit": "150115_D00331_0120_AC5U99ANXX_6_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_6_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_646_WG",
            "platform_unit": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_646_WG",
            "platform_unit": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC"
        }
    ],
    "experiment": {
        "submitter_sequencing_experiment_id": "TEST-EXP-132",
        "library_strategy": "WGS",
        "sequencing_center": "",
        "platform": "ILLUMINA",
        "platform_model": null,
        "sequencing_date": null
    },
    "read_group_count": 6
}

Update the schemas to make `workflow/session_id` optional

This will allow earlier payloads which are missing session_id information to be valid against the latest schemas.

add pattern for submitter_read_group_id to avoid odd characters

Detailed Description

We need to some control what characters are allowed in read group IDs, otherwise, some odd characters (like &, / etc) may cause trouble in downstream data processing

Possible Implementation

add pattern to submitter_read_group_id field: [a-zA-Z0-9_\:\.\-] in relevant JSON schema.

Note that other character(s) may be added to the above pattern if later deemed necessary

Make qc_metrics schema support for both alignment and variant qc

The current qc_metrics schema has some issues when used by sanger_qc

#/workflow/short_name: short_name is not a valid enum value,
#/workflow/inputs/0/analysis_type: sequencing_alignment is not a valid enum value,
#/workflow/inputs/1/analysis_type: sequencing_alignment is not a valid enum value,
#/experiment: required key [platform] not found,

Possible Implementation

add short_name as a valid enum but not required
add sequencing_alignment as valide enum value
add sequencing_alignment as valide enum value
set platfomr as optional not required

Add support for EGA IDs under `file`/`info`

Add the following conditional fields

Field	Attribute	Description	Permissible value
ega_file_id	Conditional Required	EGA File Unique Accession ID	^EGAF[0-9]{1,32}$
ega_dataset_id	Optional	EGA Dataset Accession ID	^EGAD[0-9]{1,32}$
ega_experiment_id	Optional	EGA Experiment ID	^EGAX[0-9]{1,32}$
ega_sample_id	Optional	EGA Sample Accession ID	^EGAN[0-9]{1,32}$
ega_study_id	Optional	EGA Study Accession ID	^EGAS[0-9]{1,32}$
ega_run_id	Optional	EGA Run Accession ID	^EGAR[0-9]{1,32}$
ega_policy_id	Optional	EGA Policy Accession ID	^EGAP[0-9]{1,32}$
ega_analysis_id	Optional	EGA Analysis Accession ID	^EGAZ[0-9]{1,32}$
ega_submission_id	Optional	EGA Submission ID	^EGAB[0-9]{1,32}$
ega_dac_id	Optional	EGA DAC Accession ID	^EGAC[0-9]{1,32}$

Update mapping for release fields

We will need to update the ARGO file-centric and analysis-centric mappings to include the new fields:

release stage
program access date
publish date [needs to be recorded in song]
argo file id

Exit Criteria

PR with mapping updates including all of these concepts.

Add optional fields under `workflow` section

Adding two optional fields under workflow section:

metrics
pipeline_info

related schema changes for targeted-seq and wxs

Other than sequencing_experiment, the following schemas will also need to be updated to reflect the changes for targeted-seq and wxs if applicable.

sequencing_alignment
qc_metrics
variant_calling
variant_calling_supplement
variant_processing
splice_junctions
supplement

Basic Crud for a ARGO Model

This should include creating a dictionary for an entity (for example donor)

bug in variant_calling_supplement schema: `version` under `workflow` to be updated to `workflow_version`

Add normlizer to filre repository mappings

Based on Junjun's feedback we need to add a normalizer, this ticket is to investigate the benefit and
add that to the file repository argo mappings.

Possible Implementation

example:
https://github.com/NCI-GDC/gdc-models/blob/89e5a052d0bd8b39671d3cec941f4cd3e2c18c90/es-models/gdc_from_graph/settings.yaml#L23-L28

🐛 Conditional required field `library_strandedness` for RNA-Seq data is NOT enforced in the schema

Describe the bug

The conditional required field library_strandedness for RNA-Seq data is NOT enforced in the schema.

Steps To Reproduce

POG-CA has started submitting RNA-Seq however current payloads are missing library_strandedness :
https://submission-song.rdpc.cancercollaboratory.org/studies/POG-CA/analysis/8e3605d7-328e-4acf-b605-d7328eeacf8f

  "experiment": {
    "platform": "ILLUMINA",
    "platform_model": "Illumina HiSeq 2000",
    "sequencing_date": "2013-03-28",
    "sequencing_center": "BCCAGSC",
    "experimental_strategy": "RNA-Seq",
    "submitter_sequencing_experiment_id": "A10969"
  },

Expected behaviour

The payload should failed the validation.

Feature Request - extend Song schemas/experiment, read_groups, files to accommodate the submission of Targeted seq data

For WXS and Targeted Sequencing data, we would like to add capture_kit as optional field under experiment in ALL schemas.

Feature Request - Add centralized version field into SONG dynamic schema

Detailed Description

As we have many SONGs and the same schema in different SONG will have different system versions, if we have a field in the schema to indicate the centralized release version tracked in GIT (like 0.5.0) would be very helpful.

Possible Implementation

Plan to add the field into dynamic portion of the schema and update the version whenever we make release.

Verify and test mapping analyzers

We need to verify the analyzers in maestro file centric and other mappings will actually work as expected

Detailed Description

avoid reindexing later as much as possible

Possible Implementation

setup a demo index and use /analyze api to check that auto complete and search will work as desired.

Add regex pattern check to avoid empty strings passing the validation for required fields

Read_groups required fields

submitter_sequencing_experiment_id
platform_unit
file_r1
library_name

modify read_group entity definition and move it to under bioentities folder

Task:

move properties from sequencing_experiment to read_group
remove sequencing_experiment, move read_group to under bioentities
update example docs for testing

create new bundle type for bam_submission

Tasks:

define schemas for the bundle and containing bam file
add good and bad example docs to test pass and failure

create new bundle type for fastq_submission

Tasks:

define schemas for the bundle and containing fastq file(s)
add good and bad example docs to test pass and failure

Should all read groups with the same library strategy from the same sample be submitted all together?

Currently read_group entity can be submitted one by one independently which is generally OK, however this makes it impossible to know how many read groups to expect for a particular sample. Knowing total number of read groups for a sample is important if we ultimately aim to programmatically launch analytic workflow depending on input data availability.

A possible solution is to impose a rule that user must submit all read groups (with the same library strategy) belonging to one sample at once.

In order to accommodate exceptions, such as, additional new sequencing is done to increase coverage for a sample, new read groups should be allowed to add. This means the rule can be overridden when needed.

icgc-argo / argo-metadata-schemas Goto Github PK

argo-metadata-schemas's People

Contributors

Watchers

Forkers

argo-metadata-schemas's Issues

Expected Outcome

Detailed Description

Describe the bug

Steps To Reproduce

Expected behaviour

Describe the bug

Possible solution:

Detailed Description

Possible Implementation

Possible Implementation

Exit Criteria

Possible Implementation

Describe the bug

Steps To Reproduce

Expected behaviour

Detailed Description

Possible Implementation

Detailed Description

Possible Implementation

Recommend Projects

Recommend Topics

Recommend Org