Giter Site home page Giter Site logo

etl-synthea's Introduction

Utility to Load Synthea CSV data to OMOP CDM

Currently supports CDM v5.3 and v5.4

Follow the steps on the synthea wiki to run the program and generate the files. This builder works off of the csv files, not the fhir files. To do this the exporter.csv.export option in the ./src/main/resources/synthea.properties file needs to be set to TRUE.

Step by Step Example

 devtools::install_github("OHDSI/ETL-Synthea")

 library(ETLSyntheaBuilder)

 # We are loading a version 5.4 CDM into a local PostgreSQL database called "synthea10".
 # The ETLSyntheaBuilder package leverages the OHDSI/CommonDataModel package for CDM creation.
 # Valid CDM versions are determined by executing CommonDataModel::listSupportedVersions().
 # The strings representing supported CDM versions are currently "5.3" and "5.4". 
 # The Synthea version we use in this example is 2.7.0.
 # However, at this time we also support 3.0.0, 3.1.0 and 3.2.0.
 # Please note that Synthea's MASTER branch is always active and this package will be updated to support
 # future versions as possible.
 # The schema to load the Synthea tables is called "native".
 # The schema to load the Vocabulary and CDM tables is "cdm_synthea10".  
 # The username and pw are "postgres" and "lollipop".
 # The Synthea and Vocabulary CSV files are located in /tmp/synthea/output/csv and /tmp/Vocabulary_20181119, respectively.
 
 # For those interested in seeing the CDM changes from 5.3 to 5.4, please see: http://ohdsi.github.io/CommonDataModel/cdm54Changes.html
 
cd <- DatabaseConnector::createConnectionDetails(
  dbms     = "postgresql", 
  server   = "localhost/synthea10", 
  user     = "postgres", 
  password = "lollipop", 
  port     = 5432, 
  pathToDriver = "d:/drivers"  
)

cdmSchema      <- "cdm_synthea10"
cdmVersion     <- "5.4"
syntheaVersion <- "2.7.0"
syntheaSchema  <- "native"
syntheaFileLoc <- "/tmp/synthea/output/csv"
vocabFileLoc   <- "/tmp/Vocabulary_20181119"

ETLSyntheaBuilder::CreateCDMTables(connectionDetails = cd, cdmSchema = cdmSchema, cdmVersion = cdmVersion)
                                     
ETLSyntheaBuilder::CreateSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaVersion = syntheaVersion)
                                       
ETLSyntheaBuilder::LoadSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaFileLoc = syntheaFileLoc)
                                     
ETLSyntheaBuilder::LoadVocabFromCsv(connectionDetails = cd, cdmSchema = cdmSchema, vocabFileLoc = vocabFileLoc)

ETLSyntheaBuilder::CreateMapAndRollupTables(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, cdmVersion = cdmVersion, syntheaVersion = syntheaVersion)

## Optional Step to create extra indices
ETLSyntheaBuilder::CreateExtraIndices(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, syntheaVersion = syntheaVersion)
                                    
ETLSyntheaBuilder::LoadEventTables(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, cdmVersion = cdmVersion, syntheaVersion = syntheaVersion)

Simulating Data with Synthea

For commented code used to convert the Synthea data see extras/codeToRun.R

For more information on Synthea visit: https://synthetichealth.github.io/synthea/

Get Synthea from GitHub: https://github.com/synthetichealth/synthea

etl-synthea's People

Contributors

abercus avatar anthonymolinaro avatar anthonysena avatar burrowse avatar clairblacketer avatar croeder avatar ericavoss avatar fdefalco avatar gowthamrao avatar iain-s avatar jamesswiggins avatar katy-sadowski avatar mccullen avatar mgkahn avatar mtarng avatar tevariou avatar wibeasley avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etl-synthea's Issues

strings longer than 100 in procedures.description

A recent Synthea data set that I generated contained a procedure with the 105-character description

Human epidermal growth factor receptor 2 gene detection by fluorescence in situ hybridization (procedure)

but procedures.description is allocated as varchar(100) in synthea_ddl.sql

increasing it to varchar(250) eliminates an error message

Variable type for dates not correctly rendered to SQL SERVER

I don't know if ETLSyntheaBuilder is intended to work with other servers than postgresql, but in case anyone tries to use sql server, the following issue needs to be fixed: The code in ETLSyntheaBuilder::CreateEventTables (and most likely later queries as well) do not render correctly, as it tries to set TIMESTAMP as class for dates, which should be set to something like DATETIME.

Running the line ETLSyntheaBuilder::CreateEventTables(insert your defaults) gives you this error message:

Error in rJava::.jcall(statement, "V", "close") : 
  com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one timestamp column. Because table 'person' already has one, the column 'death_datetime' cannot be added.

After a bit of drilling, you find that two variables in the rendered SQL-table-creation code for person table (and quite a few other tables) are set to TIMESTAMP.

CREATE TABLE cdm_synthea10.person
(
   person_id					BIGINT	  	NOT NULL , 
  gender_concept_id				INTEGER	  	NOT NULL ,
  year_of_birth					INTEGER	  	NOT NULL ,
  month_of_birth				INTEGER	  	NULL,
  day_of_birth					INTEGER	  	NULL,
  birth_datetime				TIMESTAMP	NULL,
  death_datetime				TIMESTAMP     NULL,
  race_concept_id				INTEGER		NOT NULL,

TIMESTAMP in sql server is synonym to rowversion, and not at all related to storing data on dates and times.

Final document name

The final iteration of the ETL design document should be renamed to be more self-evident, like "synthea ETL specification"

Update documentation

In the readme, it says v5.3, it should say v6. Also, instructions for how to use v5.3.1.

Sync ETL-Synthea with latest master

In my use of ETL-Synthea, I cloned the latest version of https://github.com/synthetichealth/synthea (which is this commit: synthetichealth/synthea@47d09bf) and found that the CSVs that Synthea produced had a different structure from the expected structure in ETL-Synthea. I've updated the ETL-Synthea package to reflect the new Synthea structure and will contribute this back to this repository once I receive contributor rights. Just noting this here in the meantime.

Condition Records Dropped if Mapped to 0

In RAW_LAUREN, I accidentally used a CONCEPT_ID for the CODE.

CODE = 194696 (INCORRECT CODE AND WOULDN'T HAVE MAPPED TO ANYTHING)
DESCRIPTION = Dysmenorrhea

The records were dropped in the CDM. The record should still be written and just set to CONDITION_CONCEPT_ID = 0.

Single rows missing concept name, while running LoadVocabFromCsv

Hello! While inserting the vocabularies with ETLSyntheaBuilder::LoadVocabFromCsv, using vocabularies from athena, merged with the UMLS as instructed, the call stops at two occassions, as rows have NA where there shouldn't be any.

vocabTable[3602658,]
   concept_id concept_name  domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date
1:   36311145         <NA> Meas Value         LOINC           Answer                S    LA12334-1       1970-01-01
   valid_end_date invalid_reason
1:     2099-12-31   

The other row, in concept_synonym.csv:

   concept_id concept_synonym_name language_concept_id
1:   36311145                 <NA>             4180186

I removed these, and LoadVocabFromCsv() ran through after that.

Create R project to run the builder

There are quite a few sql files for this build, some of which that are no longer being used. An R project would help dictate the order that the files should be run, as well as allowing for parameterization for schemas. We could also put the test cases in the project so that it is entirely self-contained.

R script needs lubridate casting in synthea/native and allowing NULLs in concept table?

Thanks for releasing this tool.

I'm finding that I have to make some modifications in order to get the R code (and embedded SQL) to run. Is this surprising?

I'm using Windows 10, Postgres 11, and R 3.5.0

I created my synthea data set from GitBash like this:

./run_synthea -s 42 -p 1000 Massachusetts

So far I have modified LoadSyntheaTables to forcibly cast dates, and to cast UTILIZATION to numeric, not character. See below. The native synthea tables load now.

I also had to allow NULLs in concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name, and concept_relationship.valid_start_date. Now I can run all of the steps up to LoadEventTables, which give an error. I have substituted LoadCDMTables for that. See #14

Modified LoadSyntheaTables

  function (connectionDetails,
            syntheaDatabaseSchema,
            syntheaFileLoc) {
    csvList <-
      c(
        "allergies.csv",
        "conditions.csv",
        "imaging_studies.csv",
        "medications.csv",
        "organizations.csv",
        "procedures.csv",
        "careplans.csv",
        "encounters.csv",
        "immunizations.csv",
        "observations.csv",
        "patients.csv",
        "providers.csv"
      )
    conn <- DatabaseConnector::connect(connectionDetails)
    for (csv in csvList) {
      syntheaTable <- data.table::fread(
        file = paste0(syntheaFileLoc,
                      "/", csv),
        stringsAsFactors = FALSE,
        header = TRUE,
        sep = ","
      )
      writeLines(paste0("Loading: ", csv))
      if ("CODE" %in% colnames(syntheaTable))
        syntheaTable$CODE <- as.character(syntheaTable$CODE)
      if ("REASONCODE" %in% colnames(syntheaTable))
        syntheaTable$REASONCODE <-
        as.character(syntheaTable$REASONCODE)
      if ("PHONE" %in% colnames(syntheaTable))
        syntheaTable$PHONE <- as.character(syntheaTable$PHONE)
      
      # if ("UTILIZATION" %in% colnames(syntheaTable))
      #   syntheaTable$UTILIZATION <-
      #   as.character(syntheaTable$UTILIZATION)
      
      if ("UTILIZATION" %in% colnames(syntheaTable))
        syntheaTable$UTILIZATION <-
        as.numeric(as.character(syntheaTable$UTILIZATION))
      
      print(tableName <- paste0(syntheaDatabaseSchema,
                                ".", strsplit(csv, "[.]")[[1]][1]))
      print(head(syntheaTable))
      
      if ("START" %in% colnames(syntheaTable)) {
        syntheaTable$START <- lubridate::as_date(syntheaTable$START)
      }
      if ("STOP" %in% colnames(syntheaTable)) {
        syntheaTable$STOP <- lubridate::as_date(syntheaTable$STOP)
      }
      if ("DATE" %in% colnames(syntheaTable)) {
        syntheaTable$DATE <- lubridate::as_date(syntheaTable$DATE)
      }
      
      if ("BIRTHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$BIRTHDATE <- lubridate::as_date(syntheaTable$BIRTHDATE)
      }
      
      if ("DEATHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$DEATHDATE <- lubridate::as_date(syntheaTable$DEATHDATE)
      }
      
      DatabaseConnector::insertTable(
        conn,
        tableName,
        data = as.data.frame(syntheaTable),
        dropTableIfExists = FALSE,
        createTable = FALSE,
        progressBar = TRUE
      )
    }
    on.exit(DatabaseConnector::disconnect(conn))
  }

Some warnings

Loading: concept_relationship.csv
...snip... 100%

1: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 50165: <<44822272 "Light-for-dates"without mention of fetal malnutrition, 1,250- 1,499 grams 4180186>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

2: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 64070: <<2618083 "opt out" physician or practitioner emergency or urgent service Observation HCPCS HCPCS Modifier S GJ 19981001 20991231 >>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
3: In max(nchar(as.character(obj)), na.rm = TRUE) :
no non-missing arguments to max; returning -Inf

and

Running: final_visit_ids.sql
...snip... 100% ... Executing SQL took 0.296 secs

In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema) :
Parameter 'synthea_schema' not found in SQL

Selected OMOP vocabularies

vocabulary_id vocabulary_name vocabulary_reference vocabulary_version vocabulary_concept_id
ABMS Provider Specialty (American Board of Medical Specialties) http://www.abms.org/member-boards/specialty-subspecialty-certificates 2018-06-26 ABMS 45756746
ATC WHO Anatomic Therapeutic Chemical Classification FDB UK distribution package RXNORM 2018-08-12 44819117
CDM OMOP Common DataModel https://github.com/OHDSI/CommonDataModel CDM v6.0.0 32485
Cohort Type OMOP Cohort Type OMOP generated 44819234
Concept Class OMOP Concept Class OMOP generated 44819233
Condition Type OMOP Condition Occurrence Type OMOP generated 44819127
Cost OMOP Cost OMOP generated 581457
Cost Type OMOP Cost Type OMOP generated 5029
CPT4 Current Procedural Terminology version 4 (AMA) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 2018 Release 44819100
Death Type OMOP Death Type OMOP generated 44819135
Device Type OMOP Device Type OMOP generated 44819151
dm+d Dictionary of Medicines and Devices (NHS) https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/24/releases dm+d Version 6.0.0 20180604 232
Domain OMOP Domain OMOP generated 44819147
Drug Type OMOP Drug Exposure Type OMOP generated 44819126
EphMRA ATC Anatomical Classification of Pharmaceutical Products (EphMRA) http://www.ephmra.org/Anatomical-Classification EphMRA ATC 2016 243
Episode OMOP Episode OMOP generated 32523
Episode Type OMOP Episode Type OMOP generated 32542
Ethnicity OMOP Ethnicity OMOP generated 44819134
GCN_SEQNO Clinical Formulation ID (FDB) FDB US distribution package 20151119 Release 44819141
Gender OMOP Gender OMOP generated 44819108
GGR Commented Drug Directory (BCFI) http://www.bcfi.be/nl/download GGR 20180901 581450
HCPCS Healthcare Common Procedure Coding System (CMS) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html 20180101 Alpha Numeric HCPCS File 44819101
ICD10 International Classification of Diseases, Tenth Revision (WHO) http://www.who.int/classifications/icd/icdonlineversions/en/ 2016 Release 44819124
ICD10CM International Classification of Diseases, Tenth Revision, Clinical Modification (NCHS) http://www.cdc.gov/nchs/icd/icd10cm.htm ICD10CM FY2019 code descriptions 44819098
ICD10PCS ICD-10 Procedure Coding System (CMS) http://www.cms.gov/Medicare/Coding/ICD10/index.html ICD10PCS 2019 44819125
ICD9CM International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 1 and 2 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 5046
ICD9Proc International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS) http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html ICD9CM v32 master descriptions 44819099
ICDO3 International Classification of Diseases for Oncology, Third Edition (WHO) https://seer.cancer.gov/icd-o-3/ ICDO3 SEER Site/Histology Released 09/18/2015 581426
LOINC Logical Observation Identifiers Names and Codes (Regenstrief Institute) http://loinc.org/downloads/loinc 2.65 44819102
Meas Type OMOP Measurement Type OMOP generated 44819152
Medicare Specialty Medicare provider/supplier specialty codes (CMS) http://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/MedicareProviderSupEnroll/Taxonomy.html 2018-06-26 Specialty 44819138
MEDRT Medication Reference Terminology MED-RT (VA) https://nciterms.nci.nih.gov/ncitbrowser/pages/vocabulary.jsf?dictionary=MED-RT MED-RT 2018-09-04 32537
Multum Cerner Multum (Cerner) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html 7/10/2013 44819112
NDC National Drug Code (FDA and manufacturers) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html, http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM070838.zip NDC 20190310 44819105
NDFRT National Drug File - Reference Terminology (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819103
NFC New Form Code (EphMRA) http://www.ephmra.org/New-Form-Codes-Classification NFC 20160704 245
None OMOP Standardized Vocabularies OMOP generated v5.0 13-MAR-19 44819096
Note Type OMOP Note Type OMOP generated 44819146
Obs Period Type OMOP Observation Period Type OMOP generated 44819149
Observation Type OMOP Observation Type OMOP generated 44819129
OSM OpenStreetMap https://www.openstreetmap.org/copyright/en, https://wambachers-osm.website/boundaries/ OSM Release 2019-02-21 32541
PCORNet National Patient-Centered Clinical Research Network (PCORI) OMOP generated 44819148
PHDSC Source of Payment Typology (PHDSC) http://www.phdsc.org/standards/payer-typology-source.asp Version 3.0 32473
Plan Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor 32471
Plan Stop Reason Plan Stop Reason - Reason for termination of the Health Plan 32474
PPI AllOfUs_PPI (Columbia) http://terminology.pmi-ops.org/CodeSystem/ppi Codebook Version 0.3.12 581404
Procedure Type OMOP Procedure Occurrence Type OMOP generated 44819128
Race Race and Ethnicity Code Set (USBC) http://www.cdc.gov/nchs/data/dvs/Race_Ethnicity_CodeSet.pdf Version 1.0 44819109
Relationship OMOP Relationship OMOP generated 44819235
RxNorm RxNorm (NLM) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RxNorm 20190204 44819104
RxNorm Extension RxNorm Extension (OMOP) OMOP generated RxNorm Extension 2019-03-11 252
SNOMED Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO) http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html Snomed Release 20190131 44819097
Specimen Type OMOP Specimen Type OMOP generated 581376
SPL Structured Product Labeling (FDA) http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm NDC 20190310 44819140
Sponsor Sponsor - institution or individual financing healthcare transactions 32472
UB04 Point of Origin UB04 Claim Source Inpatient Admission Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Source-Inpatient-Admission-Code 32045
UB04 Pri Typ of Adm UB04 Claim Inpatient Admission Type Code (CMS) https://www.resdac.org/cms-data/variables/Claim-Inpatient-Admission-Type-Code 32046
UB04 Pt dis status UB04 Patient Discharge Status Code (CMS) https://www.resdac.org/cms-data/variables/patient-discharge-status-code 32047
UB04 Typ bill UB04 Type of Bill - Institutional (USHIK) https://ushik.ahrq.gov/ViewItemDetails?&system=apcd&itemKey=196987000 32044
UCUM Unified Code for Units of Measure (Regenstrief Institute) http://aurora.regenstrief.org/~ucum/ucum.html#section-Alphabetic-Index Version 1.8.2 44819107
US Census United States Census Bureau https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html US Census 2017 Release 32570
VA Class VA National Drug File Class (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819122
VA Product VA National Drug File Product (VA) http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html RXNORM 2018-08-12 44819120
Visit OMOP Visit OMOP generated 44819119
Visit Type OMOP Visit Type OMOP generated 44819150
Vocabulary OMOP Vocabulary OMOP generated 44819232

Is LoadEventTables (from the documentation) now LoadCDMTables?

I struggled to run the synthea-to-OMOP CDM conversion on a Windows 10 computer, using Postgres 11 and R 3.5.0. I see there have been a lot of commits in the last day or so. Is it possible that there's a mismatch between the latest commit and the documentation? Should I pull a different commit, or from a different branch?

At roughly 2019-03-16-13:00 EDT, I did the following

git clone https://github.com/OHDSI/ETL-Synthea.git

(in GitBash) and ran synthea like this

./run_synthea -s 42 -p 1000 Massachusetts

And then attempted the conversion according to https://github.com/OHDSI/ETL-Synthea#step-by-step-example-r-package. See #13 for some modifications I made to date casting and allowing NULLs. Now everything runs with just a few warnings, up until the last step: ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native"), which gives

Error: object 'LoadEventTables' not found

I've already called LoadSyntheaTables and LoadVocabFromCsv . LoadVocabFromSchema isn't necessary on top of LoadVocabFromCsv , is it? So I just used the remaining load function instead:
LoadCDMTables. That completed without any intervention or any errors, but with the following warnings:

Running: insert_drug_era.sql
...snip... 100% ... Executing SQL took 0.0229 secs

1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL

4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

5: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL

6: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

7: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL
LoadCDMTables. That's running on insert_measurement.sqlnow, without any intervention.


I also tried the Windows batch command, which seems to give a similar error

C:\Users\Mark Miller\ETL-Synthea>"C:\Program Files\R\R-3.5.0\bin\x64\Rscript.exe" WindowsLoader.r postgresql localhost ehrlike native cdm_synthea10 postgres SECRETPASSWORD C:\synthea_omop\synthea_output_csv C:\synthea_omop\vocabulary_downloads

Error: 'DropCDMTables' is not an exported object from 'namespace:ETLSyntheaBuilder'

I'm willing to recreate my environment in Ubuntu, if that's the best bet, but I don;t get the feeling that this issue is OS related.

scaling issue in insert_drug_era.sql

NOTE: Using main latest I always get 0 records in my drug_era table and it finishes in less then a second. I had to use the pull request #48 to get the drug_era conversion table populated. I basically took that branch and remove the OMOP 6.0 specific code and ended up with a working migration after I also took pull request #53 to get the OMOP 5.3.1 death table. I then added some code to migrate the location data in synthea patient table to OMOP location table. Code is at https://github.com/scivm/ETL-Synthea.

Doing testing with this patched setup, I found that with 100K patients generated from synthea, the conversion in insert_drug_era.sql takes 1.85 days. 10k patients took less then a minute and 1k patients took about 15 seconds. I don't think throwing more computer power at it will help since postgres only uses a single core to run the migration and I have already tried to use 8GB ram?

Setup:
Postgres 10 on windows 10 laptop with i7 2 cpu with total 8 core, 32GB ram and Ubuntu for windows.

Used 8 GB of ram for shared_buffers

shared_buffers = 8096MB                 # min 128kB
temp_buffers = 128MB                    # min 800kB
work_mem = 64MB                         # min 64kB
maintenance_work_mem = 64MB             # min 1MB
max_stack_depth = 2MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
max_worker_processes = 6                # (change requires restart)
max_parallel_workers_per_gather = 2     # taken from max_parallel_workers
max_parallel_workers = 6                # maximum number of max_worker_processes that

Put indexes on vocabulary tables after loading them:

ALTER TABLE cdm.concept ADD CONSTRAINT xpk_concept PRIMARY KEY (concept_id);
ALTER TABLE cdm.vocabulary ADD CONSTRAINT xpk_vocabulary PRIMARY KEY (vocabulary_id);
ALTER TABLE cdm.domain ADD CONSTRAINT xpk_domain PRIMARY KEY (domain_id);
ALTER TABLE cdm.concept_class ADD CONSTRAINT xpk_concept_class PRIMARY KEY (concept_class_id);
ALTER TABLE cdm.concept_relationship ADD CONSTRAINT xpk_concept_relationship PRIMARY KEY (concept_id_1,concept_id_2,relationship_id);
ALTER TABLE cdm.relationship ADD CONSTRAINT xpk_relationship PRIMARY KEY (relationship_id);
ALTER TABLE cdm.concept_ancestor ADD CONSTRAINT xpk_concept_ancestor PRIMARY KEY (ancestor_concept_id,descendant_concept_id);
ALTER TABLE cdm.source_to_concept_map ADD CONSTRAINT xpk_source_to_concept_map PRIMARY KEY (source_vocabulary_id,target_concept_id,source_code,valid_end_date);
ALTER TABLE cdm.drug_strength ADD CONSTRAINT xpk_drug_strength PRIMARY KEY (drug_concept_id, ingredient_concept_id);

Analyzed all tables before starting

duplicate key error in insert_measurement.sql, others for postgresql

I ran into a duplicate key error when running insert_measurement and others. I saw the union of two or three selects that use row_number(), and guessed each where starting from 1. I recreated the sequences and replaced the row_number() calls with nextval, and the problem went away. I assume these were added to give databases a chance to run the individual selects concurrently and that re-introducing use of the sequence would serialize and slow down the insert.

The commit that introduced the use of row_number() is b282d80.

actual 1000 people demo data

This how to guide is valuable. Thank you for doing it.

For our group, however, having a set of CSV files (or postgres dump file (way more preferred)) with the final product would help much more.

Consider starting a second repo that would have that.
I don't think synthea puts any restrictions such that this would not be possible to do.

That way the fruit of your labor would be used by a lot more people! tagging @clairblacketer

Visit_Occurrence_id in event tables should use the final_visit_ids intermediate table

These are all joining to visit_occurrence to get the visit_occurrence_id. This is incorrect as not every encounter_id will be represented in visit_occurrence due to how we collapse encounters to create visits (page 3 of design document). Instead, the join should be to the final_visit_id table created to link all encounters with their correct visit_occurrence_id.

Error handling

When I execute this code, I get error like database "synthea10" does not exist...
Do I have to do some preparation in advance?

./bulk-load synthea10 localhost native cdm_synthea10 postgres lollipop /tmp/synthea/output/csv /tmp/Vocabulary_20181119

Duplicated row IDs related to "Parameter 'vocab_schema' not found in SQL"?

After doing a hybrid of the Linux bulk load and some R scripts (#17), I'm seeing duplication of row IDs in some tables. For example, my observations table has

observation_id|person_id|observation_concept_id|observation_date|observation_datetime|observation_type_concept_id|value_as_number|value_as_string|value_as_concept_id|qualifier_concept_id|unit_concept_id|provider_id|visit_occurrence_id|visit_detail_id|observation_source_value|observation_source_concept_id|unit_source_value|qualifier_source_value| 
--------------|---------|----------------------|----------------|--------------------|---------------------------|---------------|---------------|-------------------|--------------------|---------------|-----------|-------------------|---------------|------------------------|-----------------------------|-----------------|----------------------| 
             1|        8|               4174876|      1985-09-13|          1985-09-13|                   38000280|               |               |                  0|                   0|              0|          0|                211|              0|424213003               |                      4174876|                 |                      | 
             1|        1|                     0|      2010-07-26|          2010-07-26|                   38000276|               |               |                  0|                   0|              0|          0|                 10|              0|233604007               |                       255848|                 |                      | 

Does person 8 (0141bae5-c190-4e83-aab0-eed8dc2e91bd) have a 424213003 observation from visit 211 (acf6725e-23ea-4e10-a1f6-d8e2196d51fb) on 1985-09-13 or does person 1 (00002c66-a365-4e88-8e80-d52bcad4869e) have a 233604007 from visit 10 (120aa894-4465-4b04-af96-1928191f1c36) on 2010-07-26?

ubuntu@ip-172-31-88-67:~/synthea/output/csv$ grep 424213003 * | grep 0141bae5-c190-4e83-aab0-eed8dc2e91bd
allergies.csv:1985-09-13,,0141bae5-c190-4e83-aab0-eed8dc2e91bd,acf6725e-23ea-4e10-a1f6-d8e2196d51fb,424213003,Allergy to bee venom

ubuntu@ip-172-31-88-67:~/synthea/output/csv$ grep 233604007 * | grep 120aa894-4465-4b04-af96-1928191f1c36
conditions.csv:2010-07-26,,00002c66-a365-4e88-8e80-d52bcad4869e,120aa894-4465-4b04-af96-1928191f1c36,233604007,Pneumonia
encounters.csv:120aa894-4465-4b04-af96-1928191f1c36,2010-07-26T08:39:03Z,2010-07-27T08:39:03Z,00002c66-a365-4e88-8e80-d52bcad4869e,69c66b65-96fc-3702-a7d1-ea4ba3ac3b87,inpatient,32485007,Hospital admission,105.37,233604007,Pneumonia

Same thing for drug_exposure and measurement.

drug_exposure_id|person_id|drug_concept_id|drug_exposure_start_date|drug_exposure_start_datetime|drug_exposure_end_date|drug_exposure_end_datetime|verbatim_end_date|drug_type_concept_id|stop_reason|refills|quantity|days_supply|sig|route_concept_id|lot_number|provider_id|visit_occurrence_id|visit_detail_id|drug_source_value|drug_source_concept_id|route_source_value|dose_unit_source_value|
----------------|---------|---------------|------------------------|----------------------------|----------------------|--------------------------|-----------------|--------------------|-----------|-------|--------|-----------|---|----------------|----------|-----------|-------------------|---------------|-----------------|----------------------|------------------|----------------------|
               1|        1|        1539464|              2003-12-25|                  2003-12-25|            2004-12-24|                2004-12-24|       2004-12-24|            38000177|           |      0|       0|        365|   |               0|0         |          0|                 11|              0|316672           |               1539464|                  |                      |
               1|        1|       40213154|              2008-09-24|                  2008-09-24|            2008-09-24|                2008-09-24|       2008-09-24|              581452|           |      0|       0|          0|   |               0|0         |          0|                 40|              0|140              |              40213154|                  |                      |
               1|        1|              0|              1981-09-03|                  1981-09-03|            1981-09-03|                1981-09-03|                 |              581452|           |      0|       0|          0|   |               0|0         |          0|                  4|              0|92691004         |                200970|                  |                      |

BTW, source_to_concept_map is empty

Could this have something to do with my vocabulary choices? I'm using everything that can be downloaded without requiring a licence. I have loaded the remotely-downloaded CPT codes.

Possibly relevant warnings?

> local.LoadCDMTables(cd,"cdm_synthea10","native")
Connecting using PostgreSQL driver
Running: insert_person.sql
  |======================================================================| 100%
Executing SQL took 0.00914 secs
Running: insert_observation_period.sql
  |======================================================================| 100%
Executing SQL took 0.027 secs
Running: insert_visit_occurrence.sql
  |======================================================================| 100%
Executing SQL took 0.153 secs
Running: insert_condition_occurrence.sql
  |======================================================================| 100%
Executing SQL took 20.2 secs
Running: insert_observation.sql
  |======================================================================| 100%
Executing SQL took 21.7 secs
Running: insert_measurement.sql
  |======================================================================| 100%
Executing SQL took 8.85 mins
Running: insert_procedure_occurrence.sql
  |======================================================================| 100%
Executing SQL took 47.6 secs
Running: insert_drug_exposure.sql
  |======================================================================| 100%
Executing SQL took 1.3 mins
Running: insert_condition_era.sql
  |======================================================================| 100%
Executing SQL took 0.152 secs
Running: insert_drug_era.sql
  |======================================================================| 100%
Executing SQL took 25.2 secs
Warning messages:
1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
5: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
6: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
7: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
> 

Conditions not getting mapped to 0

Expected behavior

The majority of the source codes are mapped to concepts.

Actual behavior

About 63% of the codes are mapped to 0. It looks like some values are coming across as descriptions rather than ICD10 codes. We need to figure out how to get these mapped.

SELECT '0 RECORDS' AS TYPE, COUNT(*) ROW_COUNT
FROM CONDITION_OCCURRENCE
WHERE CONDITION_CONCEPT_ID =0
UNION ALL
SELECT 'ALL RECORDS' AS TYPE, COUNT(*) ROW_COUNT
FROM CONDITION_OCCURRENCE

0 RECORDS = 4942
ALL RECORDS = 8120

image

Native measurement values not copied to CDM

Many columns in my cdm_synthea10.measurement table had all null or 0 values at the end of my hybrid ETL (#17)

  • value_as_number
  • value_as_concept_id
  • unit_concept_id
  • range_low
  • range_high
  • provider_id
  • unit_source_value
  • value_source_value

The person, visit and measurement concept IDs were present, so I just copied the raw values and units back from the native schema.

update
	cdm_synthea10.measurement m2
set
	value_source_value = natob.value,
	unit_source_value = natob.units
from
	cdm_synthea10.measurement m
join cdm_synthea10.visit_occurrence vo on
	m.visit_occurrence_id = vo.visit_occurrence_id
join cdm_synthea10.person p on
	p.person_id = vo.person_id
join native.observations natob on
	natob.patient = p.person_source_value
	and natob.encounter = vo.visit_source_value
	and natob.code = m.measurement_source_value
where
	m2.person_id = m.person_id
	and m2.visit_occurrence_id = m.visit_occurrence_id
	and m2.measurement_source_value = m.measurement_source_value;

Then set the unit concept types as much as possible

update
	cdm_synthea10.measurement m2
set
	unit_concept_id = sc2stvm.target_concept_id
from
	cdm_synthea10.measurement m
join cdm_synthea10.source_to_standard_vocab_map sc2stvm on
	m.unit_source_value = sc2stvm.source_code
where
	m2.person_id = m.person_id
	and m2.visit_occurrence_id = m.visit_occurrence_id
	and m2.measurement_source_value = m.measurement_source_value;

There are some Synthea units that don't get transformed by that query.

unit_source_value count
  22676
mmHg 20700
{score} 12525
{nominal} 4288
U/L 3402
mL/min/{1.73_m2} 3117
kU/L 1200
{count} 140
{T-score} 82
IU/L 6
ng/dl 4
m[IU]/L 4
M/uL 2

Fix some manually?

update
	cdm_synthea10.measurement m2
set
	unit_concept_id = 8876
where
	unit_source_value = 'mmHg' 

Then populate value_as_number. This could probably be done much more rigorously.

update
	cdm_synthea10.measurement
set
	value_as_number = cast(value_source_value as decimal)
where
	unit_concept_id > 0

A this point there are incomplete conversions due to the units dead ends described above. I haven't done anything with the nominal/descriptive findings that could go in value_as_concept_id. And the following columns remain empty/null/zero:

  • value_as_concept_id
  • range_low
  • range_high
  • provider_id

Bulk load script needs documentation for execution.

Script structure is shown as:
# ./bulk-load database native cdm user password /path/to/data /path/to/vocab

With an example of:
# ./bulk-load cdm2 ohdsi ohdsi /synthea/output/csv

The /path/to/vocab is easy to sort out, but what are the native and cdm parameters?

Learnings from using the Synthea data generator for use with ETL-Synthea

I faced a few challenges that I was able to overcome using Synthea which I'll contribute back to the README. For now, I'm adding them to this issue in case others are facing similar issues.

Synthea Setup & Usage

  • After cloning the Synthea repo (synthetichealth/synthea@47d09bf), you need to edit src/main/resources/synthea.properties to update the exporter settings to make sure that CSVs are produced as these are the input into the ETL-Synthea process. During my work, I had to change the following settings:
exporter.fhir.export = false
exporter.fhir.transaction_bundle = false
exporter.hospital.fhir.export = false
exporter.practitioner.fhir.export = false
exporter.csv.export = true

I don't think there is any harm in leaving the fhir settings to true except that your producing extra data that you won't use with ETL-Synthea.

  • Make sure you have the JAVA_HOME environmental variable set. I did not have this set and when running ./gradlew build check test I ran into some Java problems.
  • Once the steps above are done, to produce a 1,000 person data set you can run run_synthea.bat -p 1000 which then creates the CSVs in the /output/csv folder in the Synthea directory. Note this location as it is needed for the ETL-Synthea processing.
  • I faced an issue with using the CSVs from the latest Synthea builder as noted in #43 and hope to have these incorporated soon.

Running ETL-Synthea

  • Download the vocabulary files from Athena: http://athena.ohdsi.org/vocabulary/list. You'll need to make sure these files are unzipped and available as CSVs.
  • You'll need to establish an empty database with 2 schemas: cdm will hold the OMOP CDM v5.3.1 tables and native will hold the native Synthea data.
  • ETL-Synthea has functions to create your CDM & Vocabulary tables and load the Vocabulary tables in the cdm schema mentioned above. Alternatively, you can establish your CDM and Vocabulary tables using the v5.3.1 scripts on https://github.com/OHDSI/CommonDataModel.
  • Once my CDM & Vocabulary tables were set and the vocabulary was loaded, I opted to create indicies on the vocabulary tables only using the script on https://github.com/OHDSI/CommonDataModel. The idea was that since the ETL process will use these tables, it would be best if they had some indices to improve performance.
  • I used the R package approach as described in the README: https://github.com/OHDSI/ETL-Synthea#step-by-step-example-r-package.

Post-processing Synthea

whitespace in "insert_drug_exposure.sql " in LoadCDMTables's queries list

function (connectionDetails, cdmDatabaseSchema, syntheaDatabaseSchema, 
    vocabDatabaseSchema = cdmDatabaseSchema) 
{
    queries <- c("insert_person.sql", "insert_observation_period.sql", 
        "insert_visit_occurrence.sql", "insert_condition_occurrence.sql", 
        "insert_observation.sql", "insert_measurement.sql", "insert_procedure_occurrence.sql", 
        "insert_drug_exposure.sql ", "insert_condition_era.sql", 
        "insert_drug_era.sql")

---snip---

replace with something like

local.LoadCDMTables <- function (connectionDetails, cdmDatabaseSchema, syntheaDatabaseSchema, vocabDatabaseSchema = cdmDatabaseSchema) 
{
    queries <- c("insert_person.sql", "insert_observation_period.sql", 
        "insert_visit_occurrence.sql", "insert_condition_occurrence.sql", 
        "insert_observation.sql", "insert_measurement.sql", "insert_procedure_occurrence.sql", 
        "insert_drug_exposure.sql", "insert_condition_era.sql", 
        "insert_drug_era.sql")
    conn <- DatabaseConnector::connect(connectionDetails)
    for (query in queries) {
        pathToSql <- base::system.file("sql/sql_server", package = "ETLSyntheaBuilder")
        sqlFile <- base::paste0(pathToSql, "/", query)
        sqlQuery <- base::readChar(sqlFile, base::file.info(sqlFile)$size)
        renderedSql <- SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, 
            synthea_schema = syntheaDatabaseSchema, vocab_schema = vocabDatabaseSchema)
        translatedSql <- SqlRender::translate(renderedSql, targetDialect = connectionDetails$dbms)
        writeLines(paste0("Running: ", query))
        DatabaseConnector::executeSql(conn, translatedSql, progressBar = TRUE, 
            reportOverallTime = TRUE)
    }
    on.exit(DatabaseConnector::disconnect(conn))
}

Cleanup mapping & rollup tables used during ETL process

The following R functions are used to create some interim tables that are used during the ETL process:

ETLSyntheaBuilder::CreateVocabMapTables(cd,cdmDatabaseSchema)
ETLSyntheaBuilder::CreateVisitRollupTables(cd,cdmDatabaseSchema,nativeSchema)

After running ETLSyntheaBuilder::LoadEventTables(cd,cdmDatabaseSchema,nativeSchema), I'd expect that these interim tables are removed. I'm guessing that these are currently preserved to debug items during the ETL process.

I'd propose (and I've written) a new function to drop these interim tables. I'll contribute these back in a PR.

LoadVocabFromCsv ERROR: null value in column "concept_name" violates not-null constraint

While following the steps on the Step by Step Example in the read me, I encountered a problem during the ETLSyntheaBuilder::LoadVocabFromCsv(cd,"cdm_synthea10","/tmp/Vocabulary_20181119") step

#13 mentioned this issue and avoided it by "allowing NULLs in concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name, and concept_relationship.valid_start_date."

How should I allow for these NULLS? Does LoadVocabFromCsv.r needs to be adapted?

I am using the latest pull of the master with vocabulary v5

obs_event_field_concept_id

The column obs_event_field_concept_id was added to the observation table but not to the insert_observation.sql. This violates the NOT NULL constraint. Instead, if not used it should be set to 0.

Example code not working for Master

The basic example to perform ETL through R did not work correctly (v6), some function names/parameter names were not coherent between readme.md and actual code.

Made a pull request #64 to fix it.

missing column in synthea ddl

Using the latest synthea version, the ETL scripts fail due to a missing column in the encounters table (organization).

Capture Metadata

Capture the four elements that define a CDM and should reside in the CDM metadata.

Source Data Identification
Vocabulary Version
CDM Version
ETL Version

Broken search paths and bad SQL file names in Linux bulk load

I had trouble creating a Synthea OMOP database with the R scripts (see #13 #14 #15)

So today I tried the Linux bulk loader now on Ubuntu 18 and PostgreSQL 11. I think everything went well up to vocab mapping and core ETL, but then it was all errors after that. See end of post.

I think the search path my not be set correctly in ETL/SQL/AllVisitTable.sql or the ETL steps. Also, the reference to ETL/SQL/AAViTable.sql should be replaced with ETL/SQL/AAVITable.sql

I tried running some of the R scripts individually after that, I think that exposed some other bugs. But it also may have led to duplicate row IDs and mismatched concept realms (see #18)

So finally I set the search path (in DBeaver)

set search_path=cdm_synthea10,native;

and then just copied, pasted, and executed he remaining ETL/SQL/*.sql files in DBeaver, too.

PS there's also a bad filename in

PGOPTIONS="--search_path=$CDM" psql -f "ETL/SQL/cdm_v5.3.constraints_ddl.sql" "postgresql://$USERNAME:$PASSWORD@localhost/$DB"

but maybe you don't care as much about that since it's commented out.

Attempted bulk load session

ubuntu@ip-172-31-88-67:~/ETL-Synthea$ ./bulk-load mimicomop native cdm_synthea10 mimicuser SECRET /home/ubuntu/synthea/output/csv/ /opt/mimicomop/data/vocab
DROP TABLE
CREATE TABLE
---etc---
creating cdm tables
psql:ETL/SQL/drop_cdm_tables.sql:1: ERROR:  table "achilles_analysis" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:2: ERROR:  table "achilles_heel_results" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:3: ERROR:  table "achilles_results" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:4: ERROR:  table "achilles_results_derived" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:5: ERROR:  table "achilles_results_dist" does not exist
DROP TABLE
---etc---
psql:ETL/SQL/drop_cdm_tables.sql:26: ERROR:  table "drug_strength" does not exist
DROP TABLE
---etc---
CREATE TABLE
---etc---
loading data from synthea csv files to relational database
COPY 639
COPY 3096
COPY 8103
COPY 37811
COPY 14695
COPY 8468
COPY 182176
COPY 1152
COPY 31648
loading data from vocab csv files to relational database
COPY 6136766
COPY 60879903
COPY 41501098
COPY 356
COPY 7856597
COPY 46
COPY 2616623
COPY 488
COPY 90
creating vocab maps...
DROP TABLE
SELECT 6136766
CREATE INDEX
CREATE INDEX
DROP TABLE
SELECT 4902384
CREATE INDEX
CREATE INDEX
creating visit logic tables...
psql:ETL/SQL/AllVisitTable.sql:1: NOTICE:  table "ip_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:2: NOTICE:  table "er_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:3: NOTICE:  table "op_visits" does not exist, skipping
DROP TABLE
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:57: ERROR:  relation "encounters" does not exist
LINE 14:    FROM encounters
                 ^
psql:ETL/SQL/AllVisitTable.sql:90: ERROR:  relation "encounters" does not exist
LINE 22:   FROM encounters CL1
                ^
psql:ETL/SQL/AllVisitTable.sql:113: ERROR:  relation "encounters" does not exist
LINE 10:  FROM encounters
               ^
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/AllVisitTable.sql:131: ERROR:  relation "ip_visits" does not exist
LINE 6:    SELECT * FROM IP_VISITS
                         ^
psql:ETL/SQL/AllVisitTable.sql:133: NOTICE:  table "ip_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:134: NOTICE:  table "er_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:135: NOTICE:  table "op_visits" does not exist, skipping
DROP TABLE
ETL/SQL/AAViTable.sql: No such file or directory
DROP TABLE
SELECT 539
Performing ETL
Loading person
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_person.sql:56: ERROR:  relation "patients" does not exist
LINE 51:  from patients p
               ^
Loading observation_period
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_observation_period.sql:24: ERROR:  relation "encounters" does not exist
LINE 18:   join encounters e
                ^
Loading visit_occurrence
psql:ETL/SQL/insert_visit_occurrence.sql:54: ERROR:  relation "all_visits" does not exist
LINE 48: from all_visits av
              ^
Loading observation
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_observation.sql:96: ERROR:  relation "allergies" does not exist
LINE 46: from allergies a
              ^
Loading condition_occurrence
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_condition_occurrence.sql:56: ERROR:  relation "conditions" does not exist
LINE 44: from conditions c
              ^
Loading procedure_occurrence
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_procedure_occurrence.sql:50: ERROR:  relation "procedures" does not exist
LINE 38: from procedures pr
              ^
Loading measurement
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_measurement.sql:102: ERROR:  relation "procedures" does not exist
LINE 50: from procedures pr
              ^
Loading drug exposure
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_drug_exposure.sql:154: ERROR:  relation "conditions" does not exist
LINE 56: from conditions c
              ^
Loading condition_era
DROP SEQUENCE
CREATE SEQUENCE
INSERT 0 0
Loading drug_era
DROP SEQUENCE
CREATE SEQUENCE
INSERT 0 0

Synthea has added additional cost columns to Patient, Medication, and Encounter tables--breaks bulk-load

It appears that the CSV output from Synthea now contains additional columns relating to cost of encounters and medications, and total healthcare expenses to patients. These columns are not created in the native synthea schema by the ETL-Synthea DDL. As a result, bulk-load fails for these three tables, viz.,

ERROR: extra data after last expected column
CONTEXT: COPY encounters, line 2: "1010abf7-397f-8393-9ff0-2c16c1058605,2011-06-20T04:44:41Z,2011-06-20T04:59:41Z,8c2ec300-a8d1-0e5e-88..."

readme doesn't acknowledge required host parameter

I just pulled today, and the bulk load script now requires 8 arguments.

DB=$1
HOSTNAME=$2
NATIVE=$3
CDM=$4
USERNAME=$5
PASSWORD=$6
DATA_DIRECTORY=$7
VOCAB_DIRECTORY=$8

The readme still gives a 7 argument example, with no hostname.

./bulk-load synthea10 native cdm_synthea10 postgres lollipop /tmp/synthea/output/csv /tmp/Vocabulary_20181119

Cannot apply foreign keys to CDM after ETL

After running the ETL process, I'd like to add all of the indices and FKs to the resulting CDM. When I attempted to do this, I found that the provider_id in the drug_exposure, measurement and observation tables were set to 0 instead of NULL. Making a note of this here as I have fix to address this item.

Parameter 'synthea_schema' not found in SQL

Hi,

I ran into some final issues while running the example R code provided in the following steps:

ETLSyntheaBuilder::CreateVisitRollupTables(cd,"cdm_synthea10","native")

Warning message:
In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema) :
  Parameter 'synthea_schema' not found in SQL

ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native")
Error: Error executing SQL: org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text Hint: You will need to rewrite or cast the expression. Position: 665 An error report has been created at /Users/tine.geldof/errorReport.txt In addition: Warning messages: 1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL 2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL 3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'synthea_schema' not found in SQL 4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL

I see the provider_id should be fixed in the latest pull request. Any recommendations on how to handle the 'vocab_schema' and 'synthea_schema' that cannot be found?

Thank you

[Error] CONTEXT: COPY xxxxx

Whats my problem?
How to solve The errors below?

CONTEXT: COPY encounters, line 2: "99edd026-a004-4b3f-9e91-23a824a57e52,2012-06-21T16:43:42Z,2012-06-21T17:13:42Z,cd6499a0-18fe-4c2c-a8..."
COPY 113
ERROR: extra data after last expected column
CONTEXT: COPY medications, line 2: "2010-01-04,2010-12-30,244f6a12-dfe3-4fc4-b77e-ddf5275f21a1,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,1e23..."
COPY 1065
ERROR: extra data after last expected column
CONTEXT: COPY patients, line 2: "cd6499a0-18fe-4c2c-a88c-3bd630f8e20a,1984-04-13,,999-49-3625,S99964365,X30889808X,Mr.,Leslie90,Rice9..."

extra data after last expected column

when I run ./bulk-load, it report some errros like this:

loading data from synthea csv files to relational database
COPY 526
COPY 3521
COPY 8551
ERROR: extra data after last expected column
CONTEXT: COPY encounters, line 2: "465dbc2e-ee30-418b-a3f8-f08f48479804,1977-11-19T06:59:41Z,1977-11-19T07:59:41Z,8ea2abeb-ef06-4691-98..."
COPY 14420
ERROR: extra data after last expected column
CONTEXT: COPY medications, line 2: "1993-02-11,,8ea2abeb-ef06-4691-98f8-fff89980afd7,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,c39b240b-763b-..."
COPY 282826
ERROR: extra data after last expected column
CONTEXT: COPY patients, line 2: "8ea2abeb-ef06-4691-98f8-fff89980afd7,1974-11-19,,999-91-8917,S99914151,X46842681X,Mr.,Marion502,Thie..."
COPY 36850

I check the schema's tables, find that the table encounters columns BASE_ENCOUNTER_COST,TOTAL_CLAIM_COST,PAYER_COVERAGE have add as cost
image,

how can I solve this

observation insert failing on provider_id type

When running the LoadEventTables function, I run into the error below. I think the bolded provider_id should be cast(provider_id as integer)

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text
  Hint: You will need to rewrite or cast the expression.
  Position: 666

SQL:
insert into synthea_cdm531.observation
(
observation_id,
person_id,
observation_concept_id,
observation_date,
observation_datetime,
observation_type_concept_id,
value_as_number,
value_as_string,
value_as_concept_id,
qualifier_concept_id,
unit_concept_id,
provider_id,
visit_occurrence_id,
visit_detail_id,
observation_source_value,
observation_source_concept_id,
unit_source_value,
qualifier_source_value
)
select row_number()over(order by person_id) observation_id,
person_id,
observation_concept_id,
observation_date,
observation_datetime,
observation_type_concept_id,
value_as_number,
value_as_string,
value_as_concept_id,
qualifier_concept_id,
unit_concept_id,
**provider_id,**
visit_occurrence_id,
visit_detail_id,
observation_source_value,
observation_source_concept_id,
unit_source_value,
qualifier_source_value
from (
select
  p.person_id,
  case when srctostdvm.target_concept_id is NULL then 0 else srctostdvm.target_concept_id end as observation_concept_id,
  a.start observation_date,
  a.start observation_datetime,
  38000280 observation_type_concept_id,
  cast(null as NUMERIC) value_as_number,
  cast(null as varchar) value_as_string,
  0 value_as_concept_id,
  0 qualifier_concept_id,
  0 unit_concept_id,
  NULL provider_id,
  fv.visit_occurrence_id_new visit_occurrence_id,
  0 visit_detail_id,
  a.code observation_source_value,
  coalesce(srctosrcvm.source_concept_id,0) observation_source_concept_id,
  cast(null as varchar) unit_source_value,
  cast(null as varchar) qualifier_source_value
from synthea_native.allergies a
  left join synthea_cdm531.source_to_standard_vocab_map   srctostdvm
on srctostdvm.source_code             = a.code
 and srctostdvm.target_domain_id        = 'Observation'
 and srctostdvm.target_vocabulary_id    = 'SNOMED'
 and srctostdvm.source_vocabulary_id    = 'SNOMED'
 and srctostdvm.target_standard_concept = 'S'
 and srctostdvm.target_invalid_reason IS NULL
left join synthea_cdm531.source_to_source_vocab_map srctosrcvm
  on srctosrcvm.source_code             = a.code
 and srctosrcvm.source_vocabulary_id    = 'SNOMED'
left join synthea_cdm531.final_visit_ids fv
  on fv.encounter_id = a.encounter
join synthea_cdm531.person p
  on p.person_source_value    = a.patient
union all
select
  p.person_id,
  case when srctostdvm.target_concept_id is NULL then 0 else srctostdvm.target_concept_id end as target_concept_id,
  c.start,
  c.start,
  38000276,
  cast(null as NUMERIC),
  cast(null as varchar),
  0,
  0,
  0,
  NULL,
  fv.visit_occurrence_id_new visit_occurrence_id,
  0,
  c.code,
  coalesce(srctosrcvm.source_concept_id,0),
  cast(null as varchar),
  cast(null as varchar)
from synthea_native.conditions c
  left join synthea_cdm531.source_to_standard_vocab_map   srctostdvm
on srctostdvm.source_code             = c.code
 and srctostdvm.target_domain_id        = 'Observation'
 and srctostdvm.target_vocabulary_id    = 'SNOMED'
 and srctostdvm.source_vocabulary_id    = 'SNOMED'
 and srctostdvm.target_standard_concept = 'S'
 and srctostdvm.target_invalid_reason IS NULL
left join synthea_cdm531.source_to_source_vocab_map srctosrcvm
  on srctosrcvm.source_code              = c.code
 and srctosrcvm.source_vocabulary_id     = 'SNOMED'
left join synthea_cdm531.final_visit_ids fv
  on fv.encounter_id                     = c.encounter
join synthea_cdm531.person p
  on p.person_source_value    = c.patient
  ) tmp

Primary key duplicates

In the drug_exposure, measurement, and observation insert sql scripts create duplicate values in the _id fields (e.g. drug_exposure_id) that are supposed to be primary keys.

Load Event Tables

When I execute Load Event Table command: ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native"),
the error below is output.

I think the attribute of provider_id is wrong.
Could you check it?

Thank you for your always support.

Error:
org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text
ヒント: You will need to rewrite or cast the expression.
位置: 665

'ヒント' means tips
'位置' means line number

Some concept ID columns are all 0

I've run the pipeline using the fork provided by psbrandt at https://github.com/psbrandt/ETL-Synthea. I've used this fork because of the issue described in #47.

I spun up a Postgresql instance and followed the steps as described in the README using RStudio. I obtained all non-license required Athena vocabularies from http://athena.ohdsi.org/vocabulary/list and created native and cdm_synthea10 schemas.

The pipeline ran to completion but the following columns of interest had all of their rows as 0 instead of a relevant OMOP concept id:

  • in table condition_occurrence: condition_concept_id
  • in table drug_exposure: drug_concept_id
  • in table measurement: measurement_concept_id

Is it possible that the code designed to work with an older version of the Athena vocabularies than what I downloaded two days ago? If not, what would be a useful next step towards debugging?

ETL scripts need both schemas on search path?

I can't get the bulk load script to complete without either

  • replacing the search_path declaration of $CDM with $CDM,$NATIVE in all of the steps after "creating visit logic tables"
  • OR using the same schema for both the native Synthea tables, as well as the OMOP-shaped tables, in the bulk loader invocation:

./bulk-load synthea10 127.0.0.1 cdm_synthea10 cdm_synthea10 USERNAME PASSWORD /synthea_data_path /omop_voacb_path

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.