The dream-challenge from data2health

Develop Synapse management pipeline

Prepare material for the next DREAM newsletter

From Julie:

Hi Thomas
We are now aiming for end of July to align with an announcement for a challenge for the Allen Institute. Just plan to have something to me by mid July.
Thanks

Problem with permissions of UW logs

When the pipeline at UW writes the pipeline generated logfile to Synapse, the permissions are set so that only the ehrdreamservice account can access it. We need to change this so that all log files written to Synapse can be read by the submitter and by the EHR DREAM admin team.

Review of the challenge data curation script

Jupyter notebook describing the curation of the data: https://github.com/data2health/DREAM-Challenge/tree/master/documentation

Draft the challenge timeline

Background: DREAM challenges typically provides a figure representing the timeline of the challenge. This figure, which is often used in presentations too, should provide participants with the start and end date of the different phases of the challenge (open phase, leaderboard rounds, validation phase). Here is an example of timeline:

Task: Draft the challenge timeline, which will be displayed on the EHR Challenge wiki.

there are some issues on NCATS that lead submissions to stop randomly

On Monday 14, @thomasyu888 reported that

there are some issues on NCATS that lead submissions to stop randomly, Tom is restarting the submissions but we need a better fix.

@thomasyu888 Can you provide a better description of the issue?

Adjust gold standard benchmarking answers for 180 days

We needed to adjust the gold standard benchmarks to calculate time to death of patients using 180 days rather than the 6 month function. This was to make the prediction window more consistent.

Incorporate new UW data into challenge data

Identify the Challenge Questions

Describe the Challenge Question(s) that participants will address
- Include a brief description of the input data. Another page is used to provide detailed information about the data.
- Describe the format of the predictions that models must generate

Please add content to the wiki page 2.1 - Challenge Questions

Troubleshoot challenge infrastructure issues relative to the configuration of UW servers

Waiting on #33 to get access to the UW server on which we will troubleshoot the issues that the challenge infrastructure has due to the specific configuration of the UW servers.

Make sure that invalid submissions in UW queue are propogated to the main queue

@thomasyu888 Is this correct that you are doing some work on tiding the submissions IDs? Can you please keep track of what you are doing here and close this ticket once the task is completed?

Apply Sage submission pipeline to UW environments

Onboard Mt. Sinai into the evaluation network.

We are working with the Mt. Sinai site to onboard them into the DREAM Challenge evaluation network.

Instantiate UW server replica and give access to Sage IT Engineers

There are currently technical issues that prevent UW engineers to run the challenge infrastructure developed by Sage on their server. In order to speed up the resolution of the process, it has been decided yesterday that @jprosser will instantiate a second server which has the same configuration (at least OS and security wise) than the server that will be used for the challenge. This replica will hold no data. @jprosser will then give Sage team (Bruce, @thomasyu888 and myself) ssh access to this second server so that we can troubleshoot the errors more effectively.

Come up with 4-5 suggestions of names for this challenge

On March 18, Justin Guinney asked us to come up with 4-5 suggestions of names for this challenge.

Notes:

The "Patient mortality Challenge" name used in some documents is too specific as this challenge will address additional questions (?)

Create and release the data dictionary

We have created a list of concepts that appear in more than 100 patients in the UW data. We are currently reviewing this list and plan to release it shortly.
Currently, the codes are uploaded here.

@yy6linda and @tschaffter can you review this data?

Run Mortality Models on Internal Data

Make a proposal for the content of the evaluation set for the Validation Phase

consider using both existing data and data that will be collected in the future
report number of positive and negative subjects expected to have in this evaluation set
make a new proposal for the timeline that include the period required to collect the data

Timed out docker containers are not being stopped

When containers run for more than the allotted time (10 hrs.) the toil workflow hook that is running the container is stopped but the submitted docker container is not. It will continue until the administrator manually kills it or it stops on its own.
This could be a problem if the workflow hook thinks that the still running container has been stopped and it pulls in a new submission leading to a memory overflow.
The other issue is the workflow hook is not being stopped gracefully, so logs are not being saved after the time quota.

Identify how best use the WashU synthetic dataset

These are "meaningful" synthetic data that we could use for a sub-challenge, for example.

We have a call with Randi on June 18 between 8 and 9am PDT.

Contact Randi for additional information about the dataset (number of positive, etc.)
Depending on the number, prepare a proposal on how we recommend using the data

Deploy and test challenge infrastructure at NCATS

Justin mentioned that NCATS will host the synthetic data (Synpuf and/or Wash U) while the challenge data will be hosted at UW.

Notes:

If we want to automatically run the models on Synpuf data before running on the real data, then a copy of Synpuf data must be hosted on UW cloud.
- Is it really beneficial to run model on Synpuf data before running on the challenge data? The answer depends on the amount of time we expect the model to take to make prediction for all the subject in the evaluation set.
- If the purpose of NCATS is to host only the Synpuf data, why not use only UW cloud? (I understand that in the long run we want to make our infrastructure available on NCATS so that anyone could host a challenge).

Depending on the above answers:

Contact Usman Sheikh [email protected] and Raju Hemadri [email protected] (NCATS) to initial the deployment of the infrastructure at NCATS
Connect to NCATS (AWS?) resources allocated to this project
Deploy the Synapse Workflow Hook + Synpuf data
Test the infrastructure

Provide list of organizing institutions and partners

https://docs.google.com/spreadsheets/d/1dSFPVD_T7QprB-dzRZsFFxdmt5qGngU9EuYUY-Oy2DM/edit#gid=0

Provide synthetic data to Sage for infrastructure development and testing

Upload synthetic data, for example to EHR Challenge Staging project

Provide challenge data to Sage to enable development and testing of the submission infrastructure

Background: Sage is taking care of developing the IT infrastructure responsible for:

Pulling participant submissions from Synapse (Docker images)
Run the submissions (training, inference) and push results to Synapse

Task: Provide Sage with the following components to enable the development and testing of the IT infrastructure for the EHR Challenge:

Data (synthetic training and evaluation)
Model (docker image, description of input and output)
Gold standard
- synpuf_clean/evaluate/evaluation_patient_status.csv
Scoring script (could be a dummy script at first that will be replaced later)

According to Tom, we could deploy and test an initial version of the IT infrastructure on Sage AWS instances in 1-2 days once we have received the above components.

Quota time leads models sent on the fast lane to fail

On Monday 14, @thomasyu888 reported that models are failing on the NCATS server because they hit the maximum runtime. I believe that the maximum runtime is currently set to 1 hour.

Resolution:

@thomasyu888 to confirm that the maximum runtime is set to 1h
@trberg to provide an update regarding the update of the Synpuf data on the NCATS server and how to make the size of the dataset so that methods can reasonably run on it for less than 1h.

Test challenge infrastructure at UW

@thomasyu888 tested the challenge infrastructure on Sage network using a part of the Synpuf data and one of Yao's model. The goal of this task is for Tim to confirm that he can run the infrastructure on UW network using the instructions provided by Tim. Once done, the next step will be to have the infrastructure deployed by NCATS.

Discussion on whether to return additional information to participants

@trberg Can you please describe here what the participant who is requesting additional data would like us to provide. Can you also describe the solution that you had in mind on Monday for us to review? Thanks!

Prepare 10-min presentation for DREAM Directors (May 28)

On May 28, I'll be giving a 10-min overview of the EHR DREAM Challenge during the DREAM Directors meeting.

Here are the information that I plan to include:

Identity of the organizers and partner organizations
Scientific questions
Timeline
Data
- Source of data: synpuf, UW data
- Description of the population and curation protocol
Scoring metrics
Submission & IT infrastructure
Baseline method
- High-level description
- Performance on UW data?
Where we stand regarding the timeline / remaining tasks

Note: Whenever relevant, include link to resources: Synapse/GitHub project, code of the baseline method, etc.

@trberg @yy6linda Can you please point me to existing presentations on the EHR DREAM Challenge? If not done yet, we need to add them to a Google folder where we can place them so we can easily refer and reuse them.

Document how to build and run the baseline method

The goal of this task is to provide a description of how to dockerize and run locally the baseline model developed by Yao.

Baseline codebase: https://github.com/yy6linda/mortality_prediction_docker_model

Add data and resource information and descriptions

Documentation: https://www.synapse.org/#!Synapse:syn18405992/wiki/589659

Engage CTSA centers with mortality models

Submit revised IRB request

Original IRB was narrow in scope, limiting who could submit models. New IRB will expand the challenge to enable public submissions

Finalize Terms of Use and SNOMED license

Convert Measurements and Observations to Snomed

Build UW infrastructure

Pilot a pipeline that pulls docker submissions from Synapse into UW RIT environments that will train and test on UW OMOP data.

Identify base-line models from literature

Add instructions for submissions

Release Webinar #2 material

Upload slides
Edit and upload webinar recording
Format Q&A
Add announcement on to the discussion forum (https://www.synapse.org/#!Synapse:syn18405991/discussion/threadId=6288)

Create challenge advertisement material

Background: DREAM Challenges usually have a graphic banner that advertises the challenge on the home page of the wiki. A placeholder for this banner is actually included in the DREAM Challenge Wiki Template which has been used for the EHR Challenge Staging project. The banner usually includes: 1) the full name of the challenge, 2) one short sentence that describes what the challenge is about, 3) an illustration and 4) the logos of participating organizations.

Task:

Provide name of the challenge (tracked here: #10 )
Provide list of organizing institutions and partners (tracked here: #12 )
- Collect logos
Provide illustration or ideas of illustrations (ideas tracked in this ticket)
Make initial version of the banner (Thomas or X)

Building Mortality Models

Identify evaluation metrics

Identify and describe evaluation metrics used to score submissions
Add description to this wiki page

Finalize Licenses and MoU with Julie

Send SNOMED use license language and finalize the MoU with Sean and Julie

Consider creating sub-challenges where patients are grouped by diseases

E.g. cancer, coronary heart disease, type-II diabetes, chronic obstructive pulmonary disease.

@yy6linda Could you have a look at the number of patients who died for each of these disease?

Onboard Erick Scott to the Challenge Organizer group

Email address: [email protected]

Add to Monday's call calendar invite
Give access to Google shared folder
Set up bi-weekly call dedicated to MS (Thomas)
Give Erick access to Slack
Add Erick to Synapse groups
Final step: Send email to Erick to orient him with the access to all these resources

Improve workflow hook to enable horizontal deployment

Tess Thyer at Sage is working on improving the Synapse Workflow Hook (used in this challenge) to support among other horizontal deployment of Toil engines to process more submissions when under heavy load.

While horizontal deployment is not a requirement for this challenge, this could be a nice feature to have. This will also be a requirement for the infrastructure that we will deploy at NCATS in the ling run.

Create Wiki on Synapse

Create and release Synthetic Data Version 3

We have had a couple instances of models passing the fast lane but failing on the real UW data. Most of these seemed to be caused by small discrepancies between the real data and the synthetic data, mainly the presence of null values in the real data and none of those null values in the real data. We will be attempting to address these issues in the next version of the synthetic data.

Participant are unable to see their dashboard

The dashboard shows an error similar to is showing Index: -1, Size: 33.

According to Tim, the dashboard is displayed if there is at least one "successfully" submission. Users with only failed submissions can not see their dashboard.

Onboard WashU into the evaluation network.

We are working with Randi to onboard the WashU site into the DREAM Challenge evaluation network.

Fill in challenge pre-registration page on Synapse

Background: The pre-registration page for the EHR Challenge is available here: https://www.synapse.org/#!Synapse:syn18405991/wiki/589657. The goal of this page is to provide information about the challenge (Overview, Challenge Organizers, Data Contributors, Journal Partners, Funders and Sponsors). This page also enable Synapse users to pre-register in order to receive news about the challenge launch. The pre-registration page is currently visible only to organizers.

Task: Fill in the different sections of the per-registration page before making it publicly visible to anyone.

data2health / dream-challenge Goto Github PK

dream-challenge's People

Contributors

Stargazers

Watchers

Forkers

dream-challenge's Issues

Recommend Projects

Recommend Topics

Recommend Org