Giter Site home page Giter Site logo

argo-metadata-schemas's People

Contributors

andricdu avatar anncatton avatar blabadi avatar buwujiu avatar d8660091 avatar edsu7 avatar joneubank avatar junjun-zhang avatar lindaxiang avatar mistryrn avatar rosibaj avatar samrichca avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

rdeborja

argo-metadata-schemas's Issues

Confirm Platform File_Centric Mapping

We need a mapping for the elastic-search indices that will be used by the File Repository. Currently, the indes only contains information directly from Maestro, which means only data found in Song.

We need to supplement this information with 1) clinical data 2) release information.

Start from Maestro mapping as an example (https://github.com/icgc-argo/argo-metadata-schemas/blob/master/mappings/maestro_file_centric.json).

Format the index to include this data:

  • 1. include all release fields from File Service. These are file level field at the top level.
  • embargo_stage
  • file_id
  • labels object with string keys and array values.. Example:
{
  "labels": {
    [key:string]: string[];
  }
}
  • 2. Include all file information that originates in an analysis. Keep the structure of the file as it is now, this point is just for reference.
  • data source for reference is from a song analysis.
  • Example:
curl --location --request GET 'https://song.rdpc-qa.cancercollaboratory.org/studies/TEST-PR/analysis/328ebe1e-28a1-4670-8ebe-1e28a16670d9' \
--header 'Authorization: Bearer 58bb9610-2184-48b2-a189-34016622c0c4'

Expected Outcome

  • A mapping file in the metadata repository that we all agree on!

Feature Request: update ES index mapping to sync with dynamic SONG schemas

Detailed Description

Dynamic portion of SONG schemas have been changed since last time ES mappings were created.
Here are the changes: 0.1.0...0.3.0 (looking for individual schema file for its specific change)

Summary of what need to be sync'd:

  • library_strategy changed to experiment_strategy
  • new field: variant_class which should be a root propertity for file_centric mappings. Populate with null if SONG metadata does not have it

New fields under file.info that are not able to be defined in dynamic SONG schema. See example payloads: here and here

  • file.info.data_category to be added to file_centric mappings as data_category
  • file.info.analysis_tools to be added to file_centric mappings as analysis_tools

data_category and analysis_tools under file.info are meant to be the same level as file.dateType in SONG, but since we cannot change SONG's base schema, we added them to under info.

So in terms of ES mapping, data_category and analysis_tools should be added to wherever data_type field goes.

Move input_files to under inputs

currently input_files and inputs are both on the top level for data bundles, it makes it much more natural and easy to handle to move input_files under inputs

Dictionary Management User Stories

We need some high level user stories for how the data curation person or team will create, update, and manage the data dictionaries and their versions.

We can scope the requirements to only be related to the backend service at a high level.

🐛 Typo in metaschema resulting in improperly passed param🐛

Describe the bug

Typo of allof instead of allOf results in all experiment info being unvalidated.

Steps To Reproduce

Get Schema and example payload. Payload silently validates currently
image
Current payload has Illumina when value should be ILLUMINA
image

Looking at schema, note allof:
image

Correcting typo results in proper enforcement of schema
image

Expected behaviour

Schema should catch invalid enum values

🐛sequencing_experiment: null value for sequencing_date fails validation against schema

Describe the bug

When the SONG payload is validated against the current schema, and "sequencing_date" is null, the validation fails producing the following error:

In SONG:

[SubmitService::schema.violation] - #/experiment/sequencing_date: #: 2 subschemas matched instead of one

This appears to be related to this section of the schema where "sequencing_date" is expected to match only one of the formats.

 "sequencing_date": {
            "type": "string",
            "oneOf": [
              {
                "format": "date"
              },
              {
                "format": "date-time"
              }
            ],
            "example": [
              "2019-06-16",
              "2019-06-16T20:20:39+00:00"
            ]
          },

Possible solution:

Changing "oneOf" to "anyOf" may resolve the issue.

SONG Payload used in this example:

{
    "studyId": "PACA-CA",
    "analysisType": {
        "name": "sequencing_experiment"
    },
    "samples": [
        {
            "submitterSampleId": "PCSI_0216_Pa_P_526",
            "matchedNormalSubmitterSampleId": "PCSI_0216_St_R",
            "sampleType": "Total DNA",
            "specimen": {
                "submitterSpecimenId": "PCSI_0216_Pa_P_526",
                "specimenType": "Primary tumour",
                "tumourNormalDesignation": "Tumour",
                "specimenTissueSource": "Other"
            },
            "donor": {
                "submitterDonorId": "PCSI_0216",
                "gender": "Male"
            }
        }
    ],
    "files": [
        {
            "dataType": "submittedReads",
            "fileName": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "fileSize": 109017064104,
            "fileType": "BAM",
            "fileMd5sum": "b24f248925d1babae355001e1d0200f0",
            "fileAccess": "controlled"
        }
    ],
    "read_groups": [
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_595_WG",
            "platform_unit": "150106_D00353_0088_BC61P0ANXX_1_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_1_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_607_WG",
            "platform_unit": "150106_D00353_0088_BC61P0ANXX_2_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150106_D00353_0088_BC61P0ANXX_2_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_595_WG",
            "platform_unit": "150115_D00331_0120_AC5U99ANXX_5_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_5_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_607_WG",
            "platform_unit": "150115_D00331_0120_AC5U99ANXX_6_NoIndex",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150115_D00331_0120_AC5U99ANXX_6_NoIndex"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_646_WG",
            "platform_unit": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150327_D00355_0082_BC6JNLANXX_1_AGTTCC"
        },
        {
            "file_r1": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "file_r2": "b24f248925d1babae355001e1d0200f0.PCSI_0216_Pa_P_526.bam",
            "insert_size": null,
            "is_paired_end": true,
            "library_name": "PCSI_0216_Pa_P_PE_646_WG",
            "platform_unit": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC",
            "read_length_r1": null,
            "read_length_r2": null,
            "sample_barcode": null,
            "submitter_read_group_id": "150408_D00355_0084_BC6DB1ANXX_1_AGTTCC"
        }
    ],
    "experiment": {
        "submitter_sequencing_experiment_id": "TEST-EXP-132",
        "library_strategy": "WGS",
        "sequencing_center": "",
        "platform": "ILLUMINA",
        "platform_model": null,
        "sequencing_date": null
    },
    "read_group_count": 6
}

add pattern for submitter_read_group_id to avoid odd characters

Detailed Description

We need to some control what characters are allowed in read group IDs, otherwise, some odd characters (like &, / etc) may cause trouble in downstream data processing

Possible Implementation

add pattern to submitter_read_group_id field: [a-zA-Z0-9_\:\.\-] in relevant JSON schema.

Note that other character(s) may be added to the above pattern if later deemed necessary

Make qc_metrics schema support for both alignment and variant qc

The current qc_metrics schema has some issues when used by sanger_qc

  1. #/workflow/short_name: short_name is not a valid enum value,
  2. #/workflow/inputs/0/analysis_type: sequencing_alignment is not a valid enum value,
  3. #/workflow/inputs/1/analysis_type: sequencing_alignment is not a valid enum value,
  4. #/experiment: required key [platform] not found,

Possible Implementation

  1. add short_name as a valid enum but not required
  2. add sequencing_alignment as valide enum value
  3. add sequencing_alignment as valide enum value
  4. set platfomr as optional not required

Add support for EGA IDs under `file`/`info`

Add the following conditional fields

Field Attribute Description Permissible value
ega_file_id Conditional Required EGA File Unique Accession ID ^EGAF[0-9]{1,32}$
ega_dataset_id Optional EGA Dataset Accession ID ^EGAD[0-9]{1,32}$
ega_experiment_id Optional EGA Experiment ID ^EGAX[0-9]{1,32}$
ega_sample_id Optional EGA Sample Accession ID ^EGAN[0-9]{1,32}$
ega_study_id Optional EGA Study Accession ID ^EGAS[0-9]{1,32}$
ega_run_id Optional EGA Run Accession ID ^EGAR[0-9]{1,32}$
ega_policy_id Optional EGA Policy Accession ID ^EGAP[0-9]{1,32}$
ega_analysis_id Optional EGA Analysis Accession ID ^EGAZ[0-9]{1,32}$
ega_submission_id Optional EGA Submission ID ^EGAB[0-9]{1,32}$
ega_dac_id Optional EGA DAC Accession ID ^EGAC[0-9]{1,32}$

Update mapping for release fields

We will need to update the ARGO file-centric and analysis-centric mappings to include the new fields:

  • release stage
  • program access date
  • publish date [needs to be recorded in song]
  • argo file id

Exit Criteria

  • PR with mapping updates including all of these concepts.

related schema changes for targeted-seq and wxs

Other than sequencing_experiment, the following schemas will also need to be updated to reflect the changes for targeted-seq and wxs if applicable.

  • sequencing_alignment
  • qc_metrics
  • variant_calling
  • variant_calling_supplement
  • variant_processing
  • splice_junctions
  • supplement

🐛 Conditional required field `library_strandedness` for RNA-Seq data is NOT enforced in the schema

Describe the bug

The conditional required field library_strandedness for RNA-Seq data is NOT enforced in the schema.

Steps To Reproduce

POG-CA has started submitting RNA-Seq however current payloads are missing library_strandedness :
https://submission-song.rdpc.cancercollaboratory.org/studies/POG-CA/analysis/8e3605d7-328e-4acf-b605-d7328eeacf8f

  "experiment": {
    "platform": "ILLUMINA",
    "platform_model": "Illumina HiSeq 2000",
    "sequencing_date": "2013-03-28",
    "sequencing_center": "BCCAGSC",
    "experimental_strategy": "RNA-Seq",
    "submitter_sequencing_experiment_id": "A10969"
  },

Expected behaviour

The payload should failed the validation.

Feature Request - Add centralized version field into SONG dynamic schema

Detailed Description

As we have many SONGs and the same schema in different SONG will have different system versions, if we have a field in the schema to indicate the centralized release version tracked in GIT (like 0.5.0) would be very helpful.

Possible Implementation

Plan to add the field into dynamic portion of the schema and update the version whenever we make release.

Verify and test mapping analyzers

We need to verify the analyzers in maestro file centric and other mappings will actually work as expected

Detailed Description

avoid reindexing later as much as possible

Possible Implementation

setup a demo index and use /analyze api to check that auto complete and search will work as desired.

Should all read groups with the same library strategy from the same sample be submitted all together?

Currently read_group entity can be submitted one by one independently which is generally OK, however this makes it impossible to know how many read groups to expect for a particular sample. Knowing total number of read groups for a sample is important if we ultimately aim to programmatically launch analytic workflow depending on input data availability.

A possible solution is to impose a rule that user must submit all read groups (with the same library strategy) belonging to one sample at once.

In order to accommodate exceptions, such as, additional new sequencing is done to increase coverage for a sample, new read groups should be allowed to add. This means the rule can be overridden when needed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.