icgc-argo / donor-submission-aggregator Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 1.23 MB

Event-driven donor aggregation service for the ICGC-ARGO Submission System

License: GNU Affero General Public License v3.0

TypeScript 99.75% Dockerfile 0.17% Makefile 0.07%

hacktoberfest

donor-submission-aggregator's People

Contributors

Stargazers

Watchers

donor-submission-aggregator's Issues

donor-submission-aggregator: Migrate from Dockerhub to Github Packages

QA from RDPC Numbers - Issues to investiate.

Rule #1: the total Number of alignment runs should not be greater than the total number of samples.

if it is, the biggest possibility is that there is a duplicated sequencing_alignment analysis as an example (if the dashboard is in fact reporting the correct number)

Rule #2: the total Number of Sanger VC runs should be less than or equal to the number of T Samples. As a rule right now, more rather than less should be 1:1.

Rule #3: If Sanger runs exist, then Alignments should not be 0

A sheet of running issues that have been observed are here: https://docs.google.com/spreadsheets/d/1zdf2Xy5L4ggWASBzPbsCXSQ0bIG5McH4O7UhClu1mXk/edit#gid=0

Diagnosing issues

Prod RDPC data has been indexed into QA to test this. This query has been used for QA, looking at the QA Platform but the production RDPC API:

While the dashboard is counting runs, for the QA @rosibaj has been looking at analyses, as this is the information that should be consistent with the numbers reported on the dashboard.

Prod API: https://api.rdpc.cancercollaboratory.org/graphql
Query:

query test_num_in_dash{
  analyses(filter: {donorId: "DO35424", analysisState: PUBLISHED, analysisType: "sequencing_alignment"}) {
    analysisId
    #analysisState
    #analysisType
    #analysisVersion
    experiment
    donors{
      donorId
      submitterDonorId
      specimens{
        specimenId
        submitterSpecimenId
        tumourNormalDesignation
        samples{
          sampleId
          submitterSampleId
          matchedNormalSubmitterSampleId
        }
      }
    }
    files{
      dataType
    }
    inputForRuns{
      runId
    }
    workflow {
      runId
      workflowName
      run{
        runId
        state
        inputAnalyses{
          analysisId
          analysisType
        }
    }
  }
  }
}

Expected Outcome

For each issue identified, give information about the root cause of the issue. If the issue is resolved by re-indexing, that is good but we need to know the root cause to know if it will happen again

Before deployment to production,

_most_PACA-CA numbers should reflect the correct state. This means that incorrect data must be investigated. If it is a data issue it should be suppressed/resolved with the bioinformatics team.. If it is an aggregator issue, it should be resolved in the code.
the dashboard-aggregator can reliably produce the same numbers on multiple reindexes, that are verified as correct

Fault tolerance

Currently, if there is anything failing, we are ignoring the error. For this to be production ready, we need to build some fault-tolerance (retries) to the flow. A couple places:

During chunk processing, if anything fails, should retry a few times. If fails, throw error.
During program processing, if anything fails, should delete current index, then retry a few times. If still fails, throw error.

Update fetch analyses graphql query

A new change has been made to rdpc gateway api which involves api structure change: icgc-argo/rdpc-gateway#58 (comment), in order to be compatible with the latest api, donor aggregator query needs to be updated.

Charting Indices: Mapping + Indexing for Raw_Reads (Sequencing Experiment type)

Update donor dashboard aggregator service to populate the index based on the spec'd data needs: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-2.4.2MolecularData
There should be 1 lines of data that currently exist for the molecular chart:

Chart Line	Source of Data
Raw Reads	This is a count the number of donors with the minimum number of raw reads submitted. The minimum accepted number means the donor has at least 1 Tumour/Normal pair. Count the number of donors in the time interval with analyses that fits this criteria: at least 1 analysis of (analysis_type = sequencing_experiment AND tumour_normal_designation = Tumour) AND at least 1 analysis of (analysis_type = sequencing_experiment AND tumour_normal_designation = Normal)

Expected Behaviour

The raw-reads portion is populate with the correct data as per the logic specified above.

Investigate expired jwt

Dev donor aggregator logged a rdpc token expired error on Jan.4 after holiday break, this is not the expected behaviour as aggregator should automatically get a new jwt token from ego when previous token expires.
Possible cause is unknown yet, in order to investigate, an expired jwt token is needed, but as dev aggregator is pointing to prod ego and ego private key is not accessible, we cannot manually generate tokens.
Investigation should continue once aggregator points to dev/qa ego.
This can be done as a unit test. write test for aggregator side.

deployment to dev and QA

Devops and all that jazz:

helm charts
jenkins provision pipeline
call the provision pipeline from existing Jenkinsfile
make sure fits with helm 3

Mutect 2: Mapping Update

Update ES Mapping to include the fields needed for Mutect2
There should be three fields for each of the running states. (running/complete/error)
Possibly need to migrate indices to new mapping since we are adding fields?

Update donor mapping to accommodate gateway api filter by donor id

Add Mutect2 aggregations to donor dashboard aggregator

add logic to aggregator to support Mutect2

should mimic the sanger logic
GatK Mutect 2 wf is the one to filter on for this set of data

Expected Behavior

data should be visible in the ES donor_dashboard index

Add Mutect2 logic to the aggregator to populate index

Address invalid kafka messages

The aggregator got stuck on consuming incorrectly formatted messages i.e. missing studyId or programId,

When consuming an invalid message, it's throwing an error and does nothing so the event is never marked as processed, causing the aggregator to stall.

In the event that an invalid message is passed to the aggregator:

log invalid messages out to slack, indicating the topic, partition and offset of the event,
pass invalid messages to dlq. Example implementation: https://github.com/icgc-argo/files-service/blob/master/src/kafka.ts
Have the aggregator move on to the next message

In the event that an invalid message is passed to the aggregator:

log invalid messages out to slack, indicating the topic, partition and offset of the event,
Have the aggregator move on to the next message

FOLLOWUP FOR NEXT TICKET
pass invalid messages to dlq. Example implementation: https://github.com/icgc-argo/files-service/blob/master/src/kafka.ts

Implements transformation to Elasticsearch documents

The file src/indexProgram/transformToEsDonor.ts contains the transformation logic from Mongo to Elasticsearch documents. Currently this just contains some placeholder values.

Tests will need to be updated

Feature Request: DLQ for Aggregator

Follow-up to #113

pass invalid messages to dlq. Example implementation: https://github.com/icgc-argo/files-service/blob/master/src/kafka.ts

Kafka integration as web service

src/index.ts is the main entry point to the indexer. It is currently set up as a script for manual trigger. This will need to be turned into a web service with kafka integration.

Requirements:

Kafka integration
Queues up events by programs
Healthcheck endpoint exposing internal state (with a swagger doc)

Better retry mechanism

Currently we have a pretty extreme exponential retry policy that looks like this:

{
        factor: 2,
        retries: 100,
        minTimeout: 1000,
        maxTimeout: Infinity,
}

This means if indexing fails for a given program, it will keep retrying that one program practically forever. Then if it exhausts all the retries, it doesn't do anything but completely ignore that program forever and move on...

Problems that arrises from this:

Failure of one program practically blocks processing of all other programs that happen to be assigned to the same kafka partition in the queue (not the original source topics).
After retry exhaustion, the original event is practically lost.

Solutions:

lower the retry policy, maybe just do up to 10 times
when retries are exhausted, re-queue the event (where - dlq?) so that it gets picked up again in the future.

ETL Sanger Numbers to Donor Aggregation Index

Create the query to the RDPC for sanger-analyses (behind feature flag)
Transform data to donor centric view
--- new step in transform to count unique T/N pairs
Fill in the donor mapping with RDPC data that was transformed (for sanger only)

Take the query from the the RDPC and transform the data into the needed shape

Expected Outcome

Can query for the numbers from the API for sanger-analyses
Has a feature flag to TURN OFF for production deploy

Adjust kafka consumer settings

Donor aggregator is taking more than 5 minutes to start, to reduce long start up time, we need to adjust these kafka consumer settings:

sessionTimeout
rebalanceTimeout
heartbeatInterval

Security Upgrade: Aggregator

A needed scurity upgrade has been indiated:

🐛 Unpublish/Suppress not working on Donor dashboard aggregator

Describe the bug

Steps To Reproduce

this was done in PROD with prod-submission-song with TEST-INTL program

look at a count on the dashboard
unpublish a raw_reads analysis
see tha the number in teh raw reads columnb does not go down. theser are the logs that happen:

2021-03-31T17:31:15.110Z info: starts processing RDPC event for program TEST-INTL
2021-03-31T17:31:15.118Z info: Existing index settings match default settings, obtaining a new index name from rollcall, clone=true.
2021-03-31T17:31:16.409Z info: obtained new index name: donor_centric_program_testintl_re_8
2021-03-31T17:31:16.804Z info: Enabled WRITE to index : donor_centric_program_testintl_re_8
2021-03-31T17:31:16.804Z info: Processing program: TEST-INTL from https://api.rdpc.cancercollaboratory.org/graphql.
2021-03-31T17:31:16.804Z info: fetching ego public key...
2021-03-31T17:31:17.355Z warn: No document to index for program TEST-INTL
2021-03-31T17:31:17.355Z info: releasing index donor_centric_program_testintl_re_8 to alias donor_submission_summary

logs when publishing an analysis:

info: starts processing RDPC event for program TEST-INTL
2021-03-31T17:33:03.267Z info: Existing index settings match default settings, obtaining a new index name from rollcall, clone=true.
2021-03-31T17:33:04.687Z info: obtained new index name: donor_centric_program_testintl_re_9
2021-03-31T17:33:05.133Z info: Enabled WRITE to index : donor_centric_program_testintl_re_9
2021-03-31T17:33:05.134Z info: Processing program: TEST-INTL from https://api.rdpc.cancercollaboratory.org/graphql.
2021-03-31T17:33:05.134Z info: fetching ego public key...
2021-03-31T17:33:06.240Z info: streaming analyses with Specimens for donor DO250552
2021-03-31T17:33:06.240Z info: fetching ego public key...
2021-03-31T17:33:06.269Z info: Fetching sequencing experiment analyses with specimens from rdpc.....
2021-03-31T17:33:06.307Z info: Streaming 5 of sequencing experiment analyses with specimens...
2021-03-31T17:33:06.307Z info: fetching ego public key...
2021-03-31T17:33:06.331Z info: Fetching sequencing experiment analyses with specimens from rdpc.....
2021-03-31T17:33:06.706Z info: streaming analyses for donor DO250552
2021-03-31T17:33:06.707Z info: fetching ego public key...
2021-03-31T17:33:06.733Z info: Starting to query sequencing_experiment analyses for alignment workflow runs
2021-03-31T17:33:06.823Z info: Streaming 5 of sequencing_experiment analyses...
2021-03-31T17:33:06.823Z info: fetching ego public key...
2021-03-31T17:33:06.845Z info: Starting to query sequencing_experiment analyses for alignment workflow runs
2021-03-31T17:33:06.878Z info: streaming analyses for donor DO250552
2021-03-31T17:33:06.879Z info: fetching ego public key...
2021-03-31T17:33:06.900Z info: Starting to query sequencing_alignment analyses for sanger variant calling workflow runs
2021-03-31T17:33:06.931Z info: streaming analyses for donor DO250552
2021-03-31T17:33:06.931Z info: fetching ego public key...
2021-03-31T17:33:06.955Z info: Starting to query sequencing_alignment analyses for mutect2 workflow runs
2021-03-31T17:33:06.994Z info: Begin bulk indexing donors of program TEST-INTL...
2021-03-31T17:33:07.692Z info: Successfully indexed all donors of program TEST-INTL to index: donor_centric_program_testintl_re_9
2021-03-31T17:33:07.693Z info: releasing index donor_centric_program_testintl_re_9 to alias donor_submission_summary
2021-03-31T17:33:08.631Z info: TEST-INTL duration: 110761

Expected behaviour

UNPUBLISH and SUPPRESS SHOULD edit the document.

Security Upgrade: Donor Aggregator

A needed security upgrade has been indicated:

Update Rollcall version in ARGO

Changes being made in Rollcall here for vault support: overture-stack/rollcall#25

Upgrade the Rollcall version in ARGO
Modify the chart
Update the configuration of the environment variables & vault policy

Make sure Rollcall is pointed to the NEW Elasticsearch as part of this configuration

add support for authenticated requests TO rdpc using app to app tokens

The aggregator connects to the rdpc-api, which now requires AUTH headers to allow for data extraction. We need to add support for authenticated requests to rdpc using app to app tokens. The aggregator needs to:

send auth headers to rdpc-api on all calls for authorization
be registered as an application in ego
be able to generate an app token from ego that is sent in the headers
register aggregator as a application in ego; store secrets in vault
write some code/use a library to refresh the auth token regularly.
hold onto the jwt in memory; use it normally;
each time we try and reach out to rdpc-api, do a preflight check on duration of jwt as a self-validation before reaching out to jwt., if it has not expires use it, it it has not
issue an application token and use it in the auth headers to the rdpc-gateway-api

Charting Indices: Indexing for Alignment (Sequencing Alignment type)

Update donor dashboard aggregator service to populate the index based on the spec'd data needs: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-2.4.2MolecularData
There should be 4 lines of data that currently exist for the molecular chart:

Chart Line	Source of Data
Alignment	This is a count the number of donors with the minimum number of alignments that have completed processing. The minimum accepted number means the donor has at least 1 Tumour/Normal pair worth of alignments. Count the number of donors in the time interval with analyses that fits this criteria: at least 1 analysis of (analysis_type = sequencing_alignment AND tumour_normal_designation = Tumour) AND at least 1 analysis of (analysis_type = sequencing_alignment AND tumour_normal_designation = Normal)

Expected Behaviour

The donor-dashboard-aggregator mapping has a new section/fields that deliver the time data for the spec'd lines of the chart.

Chart Tracking Mapping Update

Define the fields needed for the update.
Update the ES mapping for the aggregator to include the needed fields for molecular data chart

🐛 Donor Dashboard Event Indexing: Is it working?

Describe the bug

I was looking at the dashboard for another reason, but noticed some data anomalies.

Steps To Reproduce

Steps to reproduce the behavior:

Go to https://platform.icgc-argo.org/submission/program/OCCAMS-GB/dashboard
Sort the Sanger VC column descending
Set page size of the table to 100. Go to Page 2.
Search for donor: DO234422. On the dashboard you can see the have 1 in progress Sanger, and 1 in progress Mutect 2:
I noticed there were a lot of in progress in this section; more than we ever have concurrently running. So i decided to check if this donor had any published analyses that correspond to the Sanger VC. I did that with this query:

query test_num_in_dash {
  analyses(
    filter: {
      donorId: "DO234422"
      analysisState: PUBLISHED
      analysisType: "variant_calling"
      #analysisId: "8ebe3a61-e06c-4616-be3a-61e06ca616ce"
    }
  ) {
    info{
      totalHits
    }
    content {
      analysisId
      analysisState
      analysisType
      #analysisVersion
      updatedAt
      firstPublishedAt
      studyId
      experiment
      donors {
        # donorId
        # submitterDonorId
        specimens {
          # specimenId
          # submitterSpecimenId
          tumourNormalDesignation
      #     samples {
      #       sampleId
      #       submitterSampleId
      #       matchedNormalSubmitterSampleId
      #     }
         }
       }
      repositories{
        code
      }
      files {
        dataType
      }
      inputForRuns {
        runId
        state
      }
      workflow{
        runId
        run{
          runId
          state
          startTime
          completeTime
        }
        inputs
      }
    }
  }
}

You can see there the run that generated these is listed as COMPLTED on sanger for runid wes-1e647943c32e4f6a92a2b0d62631ec22

Looking at the epoch time in the results, I can see it was completed a few days ago:
Completed Sunday March 9, 2021:

Expected behaviour

This should show a completed run.

Integrate STUDY filter into the rdpc queries for the study level aggregation

Blocked by: icgc-argo/rdpc-gateway#44

Document Donor Aggregator

Time to wrap up and write down the complex logic!

Connect Analysis View in Dashboard to Event based indexing

For #71, #79 used a script in development
these should be kicked off by Song PUBLISH events in Kafka
Keep behind feature flag for production deployment.

Expected Outcome

Can publish alignment and sanger vc analysis and see it index in dashboard

Add endpoint to trigger donor index given a single program

We have a swagger for the donor-submission-aggregator

Add an endpoint that takes a program-id, that queues a task to update that program.

For on-demand update of a program (instead of triggered by an event)

Expected Outcome

Endpoint in a swagger that can be tested with a Program id in QA.

Queries to RDPC for WF Data

investigate RDPC queries needed for the dashboard
write a query to talk to the RDPC gateway
--- consider the query load out to the rdpcs while designing the queries.
--- there are not too many requests
--- batch requests in smaller chunks
transform the data from the RDPC into the correct shape needed for the platform
populate the data in the (https://github.com/icgc-argo/donor-submission-aggregator/blob/develop/src/elasticsearch/donorIndexMapping.json )

Convert dashboard index to snake_case

The aggregator index was initialized in camelCase. For consistency across the platform, we want all mappings/indices to be the same.

Convert the donor submission summary indices to snake case

Update indexing method to also read data during upsert

Current there is no read step in the indexing. For future indexing steps, we need to have the indexing flow be this:

clone current program index
read documents of interest in the clone
complete aggregation updates to merge data
upsert to cloned index
swap index out to newly done data

Exit Criteria

Update aggregator and have it working the same as it currently does, but with the new format.

Specify shard count on index creation

Currently all the index created are going on one shard. We might want to increase this for better workload distribution

Set the default to 3

Detailed Description

Shards allow distribution of workload across cluster node. Multiple shards will prevent big programs from putting stress on one ES node.

Possible Implementation

Rollcall has a feature to accept index setting for index creation. We should pass something there. @jaserud has more info for rollcall.

Vault Integration for elasticsearch

Integrate with new Elasticsearch using secret in rollcall

needs to update code
needs to update config in dev and qa

The secrets have all been configured in Vault, just need to pull from there.

Configure dev and QA kafka and enable aggregator

Set up Kafka topic in Dev and QA
update documentation https://wiki.oicr.on.ca/display/icgcargotech/Kafka+Topics

update existing qa and dev indices setting to the new shard setting

The problem: recently we updated the default es index settings in this ticket: #54, this resulted in rollcall unable to clone existing indices because of the mismatch in es settings, as es only allows index clone when old and new index have the same settings.

old index with number_of_shards = 1, numer_of_replicas= 1:

new index settings:

Expected results:
all donor-aggregator indices in qa and dev should be updated to new settings:

"settings": {
    "index.number_of_shards": 3,
    "index.number_of_replicas": 2
  }

Possible Solution:

Manually update indices settings using es PUT /_settings endpoint

can only update number_of_replicas not number_of_shards

change aggregator rollcall CreateResolvableIndexRequestrequest to cloneFromReleasedIndex = false which disables clone
?

Sanger + Mutect Aggregations for Chart

Update donor dashboard aggregator service to populate the index based on the spec'd data needs: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-2.4.2MolecularData
Complete the aggregations for these 2 lines of data that currently exist for the molecular chart:

Chart Line	Source of Data
Sanger VC	This is a count the number of donors with the minimum number of sanger variant callings that have completed processing.Count the number of donors in the interval with at least one analysis that fits this criteria: analysis_type = variant_calling AND workflow_name =Sanger WGS Variant Calling OR workflow_name =Sanger WXS Variant Calling
Mutect2	This is a count the number of donors with the minimum number of mutect2 variant callings that have completed processing.Count the number of donors in the interval with at least one analysis that fits this criteria: analysis_type = variant_calling AND workflow_name = GATK Mutect2 Variant Calling

Expected Behavior

The donor-dashboard-aggregator mapping has a new section/fields that deliver the time data for the spec'd lines of the chart.

POC: Process RDPC Indexing events

The RDPC issues events for different items.

Discuss which rdpc events may need to be subscribed to to get the information that we need. maybe need to subscribe to a new topic (@lepsalex will know which topics and can discuss this with us)
Consider the case of multiple RDPCs - can events self identify the RDPC that the event originated from?
Verify if we can tell the donor/program from the RDPC event as input to the donor aggregator
Get song publish event
Reach out to rdpc api
initiate calculation of RDPC metrics for the dashboard
publish dashboard index

Exit Criteria

For this POC, just show one number calculating on the dashboard:

This is the count of donors with a RUNNING workflow.

🐛 Re-index starts with a cloned index, leading to id confusion.

Describe the bug

aggregator does make a distinction between the event type in its queue:

donor-submission-aggregator/src/programQueueProcessor/types.ts

Line 8 in 7c09127

export type QueueRecord = { programId: string; requeued?: boolean } & (
and it indexes based on which type it is:

donor-submission-aggregator/src/programQueueProcessor/eventProcessor.ts

Line 202 in 7c09127

if (queuedEvent.type === KnownEventType.CLINICAL) {

The bug here is that it asks rollcall for a cloned index before deciding what to do: https://github.com/icgc-argo/donor-submission-aggregator/blob/7c09127a21f867e19074[…]5f539e7cde79fd17fe8/src/programQueueProcessor/eventProcessor.ts, when it should ask for the index based on the type of the event which for SYNC should be new (edite

Steps To Reproduce

Look at an indexed program.
delete a donor that is in the index from the clinical DB from that program.
Re-index the program to confirm donor appears in the ES index.

Expected behaviour

On re-index, the index should actually be re-done. Not done from clone. The steps above will result in the removed donor no longer being in the index.

ETL Alignment data from RDPC --> Aggregator

Create the query to the RDPC for alignment-analyses (behind feature flag)
add env var for collab RDPC endpoint for the aggregator right now
Transform data to donor centric view
Fill in the donor mapping with RDPC data that was transformed (for alignments only)
Make this all behind a feature flag (do not run in production yet)

Take the query from the the RDPC

and transform the data into the needed shape

Expected Outcome

Can query for the numbers from the API for alignment-analyses
Has a feature flag to TURN OFF for production deploy

Connect `sequencing_experiment" numbers in dashboard

Connect the "Raw Reads" column to data coming from the RDpC

This column shows how many sequencing reads that have been registered are ACTUALLY submitted. For that donor, count:

how many sequencing_experiment analysis that is PUBLISHED with tumorNormalDesignation = Tumour
how many sequencing_experiment analyses are PUBLISHED with tumorNormalDesignation = Normal

Wiki Specs describing content: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-SummaryTableDescription

Initialize the Aggregation Service

Given a program_name,

Ask rollcall for the latest index for this program (if there is no latest, rollcall will create it)
Fetch in chunks all donor data for a program from Mongo
Build the data into the index documents for all of the required clinical fields:
Index design: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-DonorIndexDesign
Field Locations: https://wiki.oicr.on.ca/display/icgcargotech/Program+Dashboard+Page+Specs#ProgramDashboardPageSpecs-TableDataFields,Filters,&DataSources
Index the documents

Add a `per-analysis` filter to the aggregation update from the RDPC

We are subscribing to analysis update events, that comes with an analysis id. We are looping through all analysis. We can reduce the amount of processing done by adding an ANALYSIS filter

Update by ANALYSIS rather than by the whole program.

If there is a study id only, update the whole study. if an analsis id is prvoided as part of the event, then only index by analaysis.
Filter goes on top of the existing feature.

We need to add incremental updates to the donor-submission-aggregator

Add an analysis_id filter onto the query to do it at a per-analysis-event level; donor_id can be garnered from the analysis id.
Per-study can de kept as a default method for when we want to sync data across the rdpc
tests; make sure this is an optional filter (used only if provided)

Bug: TEST-PR failed to index in qa

A bug was found when indexing test-pr in qa:

2021-06-23T20:01:53.112Z info: Begin processing event: SYNC - TEST-PR 
2021-06-23T20:01:54.826Z info: Obtaining new index, first for program. 
2021-06-23T20:01:59.990Z info: Obtained new index name: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:02.209Z info: Enabled index writing for: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:02.868Z info: streaming 2 donor(s) from chunk #0 of program TEST-PR duration: 458
2021-06-23T20:02:02.874Z info: Processing program: TEST-PR from https://api.rdpc-qa.cancercollaboratory.org/graphql. 
2021-06-23T20:02:02.879Z info: Fetching sequencing experiment analyses with specimens from rdpc..... 
2021-06-23T20:02:03.799Z info: Streaming 14 of sequencing_experiment analyses with specimens and samples... 
2021-06-23T20:02:03.801Z info: Fetching sequencing experiment analyses with specimens from rdpc..... 
2021-06-23T20:02:03.881Z info: Streaming 50 of variant calling analyses for sanger/mutect first published dates... 
2021-06-23T20:02:03.883Z warn: Failed to index program TEST-PR on attempt #1: TypeError: Cannot read property 'toLocaleLowerCase' of null 
2021-06-23T20:02:06.666Z info: Index was removed: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:07.674Z info: Obtaining new index, first for program. 
2021-06-23T20:02:11.068Z info: Obtained new index name: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:12.608Z info: Enabled index writing for: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:13.105Z info: streaming 2 donor(s) from chunk #0 of program TEST-PR duration: 487
2021-06-23T20:02:13.114Z info: Processing program: TEST-PR from https://api.rdpc-qa.cancercollaboratory.org/graphql. 
2021-06-23T20:02:13.114Z info: Fetching sequencing experiment analyses with specimens from rdpc..... 
2021-06-23T20:02:13.162Z info: Streaming 14 of sequencing_experiment analyses with specimens and samples... 
2021-06-23T20:02:13.163Z info: Fetching sequencing experiment analyses with specimens from rdpc..... 
2021-06-23T20:02:13.251Z info: Streaming 50 of variant calling analyses for sanger/mutect first published dates... 
2021-06-23T20:02:13.252Z warn: Failed to index program TEST-PR on attempt #2: TypeError: Cannot read property 'toLocaleLowerCase' of null 
2021-06-23T20:02:13.945Z info: Index was removed: donor_centric_program_testpr_re_10963 
2021-06-23T20:02:15.965Z info: Obtaining new index, first for program.

The cause was that workflowname was null in rdpc:

{
          "analysisId": "08d245c8-caf0-4264-9245-c8caf07264a0",
          "analysisType": "variant_calling",
          "firstPublishedAt": "1606929512353",
          "workflow": {
            "workflowName": null
          },
          "donors": [
            {
              "donorId": "DO250183"
            }
          ]
        },

This broke aggregator as it was expecting a value:
const workflowName = analysis.workflow.workflowName.toLocaleLowerCase();

Update Mapping/Index to include Mutect 2 fields

aggregate source events into a single topic

As we expand the aggregator to handle multiple topics, we need to maintain proper processing queue for events that relate to the same programs.

Detailed Description

basically implement the Kafka event aggregation design here: https://drive.google.com/file/d/1dAJL7yre96sxA0WWMLklAujp-xCeJ_-O/view?usp=sharing

Also refactor the current flow to be specifically for Clinical data to make space for integrating RDPC event processing.
Include tests!

Rollcall integration

src/rollCall/index.ts contains the rollcall proxy, this needs to be updated to expose an actual rollcall client.
Please include integration tests with a rollcal image.

Bug: aggregator failed to index when a `CREATE` event is received

Issue is observed in dev and qa, when a CREATE event is received, donor aggregator fails to index analysis as it was expecting either publish, unpublish, or suppress event.
Message that broke indexing:

{
    "topic": "song_analysis",
    "key": null,
    "value": {
      "analysisId": "dfb1d6c3-c21a-47f4-b1d6-c3c21a47f483",
      "studyId": "ROSI-RU",
      "state": "UNPUBLISHED",
      "action": "CREATE",
      "songServerId": "song.collab",
      "analysis": {
        "analysisId": "dfb1d6c3-c21a-47f4-b1d6-c3c21a47f483",
        "studyId": "ROSI-RU",
        "analysisState": "UNPUBLISHED",
        "createdAt": "2021-06-25T14:21:20.239744",
        "updatedAt": "2021-06-25T14:21:20.239774",
        "firstPublishedAt": null,
        "publishedAt": null,
        "analysisStateHistory": [],
        "samples": [
          {
            "sampleId": "SA622678",
            "specimenId": "SP222655",
            "submitterSampleId": "sample-6.1",
            "matchedNormalSubmitterSampleId": null,
            "sampleType": "Amplified DNA",
            "specimen": {
              "specimenId": "SP222655",
              "donorId": "DO262424",
              "submitterSpecimenId": "specimen-6.1",
              "tumourNormalDesignation": "Normal",
              "specimenTissueSource": "Blood derived",
              "specimenType": "Normal"
            },
            "donor": {
              "donorId": "DO262424",
              "studyId": "ROSI-RU",
              "gender": "Male",
              "submitterDonorId": "Donor-6"
            }
          }
        ],
        "files": [
          {
            "info": {
              "analysis_tools": [
                "BWA-MEM",
                "biobambam2:bammarkduplicates2"
              ],
              "data_category": "Sequencing Reads"
            },
            "objectId": "78f92452-6abf-59d0-b673-f5d692891b21",
            "studyId": "ROSI-RU",
            "analysisId": "dfb1d6c3-c21a-47f4-b1d6-c3c21a47f483",
            "fileName": "ROSI-RU.DO262424.SA622678.wxs.20210625.aln.cram",
            "fileSize": 1971126,
            "fileType": "CRAM",
            "fileMd5sum": "0056c5d7f00e5c3f3466c6982b7eb8da",
            "fileAccess": "controlled",
            "dataType": "Aligned Reads"
          },
          {
            "info": {
              "analysis_tools": [
                "BWA-MEM",
                "biobambam2:bammarkduplicates2"
              ],
              "data_category": "Sequencing Reads"
            },
            "objectId": "fd16cee0-11e1-538a-abb3-4432d56d140a",
            "studyId": "ROSI-RU",
            "analysisId": "dfb1d6c3-c21a-47f4-b1d6-c3c21a47f483",
            "fileName": "ROSI-RU.DO262424.SA622678.wxs.20210625.aln.cram.crai",
            "fileSize": 509,
            "fileType": "CRAI",
            "fileMd5sum": "6eccbecad3f97563117a6d6a4d769a82",
            "fileAccess": "controlled",
            "dataType": "Aligned Reads Index"
          }
        ],
        "analysisType": {
          "name": "sequencing_alignment",
          "version": 11
        },
        "experiment": {
          "experimental_strategy": "WXS",
          "platform": "ILLUMINA",
          "platform_model": "HiSeq 2000",
          "sequencing_center": "EXT",
          "sequencing_date": "2014-12-12",
          "submitter_sequencing_experiment_id": "TEST_EXP"
        },
        "read_group_count": 67,
        "workflow": {
          "genome_build": "GRCh38_hla_decoy_ebv",
          "inputs": [
            {
              "analysis_type": "sequencing_experiment",
              "input_analysis_id": "c52c6e97-2b13-451c-ac6e-972b13751c86"
            }
          ],
          "run_id": "wes-9669b00a389d472c98562720b839c195",
          "session_id": "d8573c63-a72f-402c-8476-466ac0cfac4b",
          "workflow_name": "DNA Seq Alignment",
          "workflow_version": "1.5.1"
        }
      }
    },
    "partition": 0,
    "offset": 11279
  },

Expected bebaviour: aggregator should fetch the latest analysis and index the donor whenever an event is received.

Vault integration

Two things to pull:

Mongo
Elasticsearch

Needs to a "useVault" option for each data store

icgc-argo / donor-submission-aggregator Goto Github PK

donor-submission-aggregator's People

Contributors

Stargazers

Watchers

donor-submission-aggregator's Issues

Diagnosing issues

Expected Outcome

Expected Behaviour

Expected Outcome

Describe the bug

Steps To Reproduce

Expected behaviour

Expected Behaviour

Describe the bug

Steps To Reproduce

Expected behaviour

Expected Outcome

Expected Outcome

Exit Criteria

Detailed Description

Possible Implementation

Expected Behavior

Exit Criteria

Describe the bug

Steps To Reproduce

Expected behaviour

Expected Outcome

Detailed Description

Recommend Projects

Recommend Topics

Recommend Org