ohdsi / etl-synthea Goto Github PK

View Code? Open in Web Editor NEW

95.0 16.0 70.0 23.55 MB

A package supporting the conversion from Synthea CSV to OMOP CDM

Home Page: https://ohdsi.github.io/ETL-Synthea/

R 99.99% Rez 0.01%

etl

etl-synthea's Introduction

Utility to Load Synthea CSV data to OMOP CDM

Currently supports CDM v5.3 and v5.4

Follow the steps on the synthea wiki to run the program and generate the files. This builder works off of the csv files, not the fhir files. To do this the exporter.csv.export option in the ./src/main/resources/synthea.properties file needs to be set to TRUE.

Step by Step Example

 devtools::install_github("OHDSI/ETL-Synthea")

 library(ETLSyntheaBuilder)

 # We are loading a version 5.4 CDM into a local PostgreSQL database called "synthea10".
 # The ETLSyntheaBuilder package leverages the OHDSI/CommonDataModel package for CDM creation.
 # Valid CDM versions are determined by executing CommonDataModel::listSupportedVersions().
 # The strings representing supported CDM versions are currently "5.3" and "5.4". 
 # The Synthea version we use in this example is 2.7.0.
 # However, at this time we also support 3.0.0, 3.1.0 and 3.2.0.
 # Please note that Synthea's MASTER branch is always active and this package will be updated to support
 # future versions as possible.
 # The schema to load the Synthea tables is called "native".
 # The schema to load the Vocabulary and CDM tables is "cdm_synthea10".  
 # The username and pw are "postgres" and "lollipop".
 # The Synthea and Vocabulary CSV files are located in /tmp/synthea/output/csv and /tmp/Vocabulary_20181119, respectively.
 
 # For those interested in seeing the CDM changes from 5.3 to 5.4, please see: http://ohdsi.github.io/CommonDataModel/cdm54Changes.html
 
cd <- DatabaseConnector::createConnectionDetails(
  dbms     = "postgresql", 
  server   = "localhost/synthea10", 
  user     = "postgres", 
  password = "lollipop", 
  port     = 5432, 
  pathToDriver = "d:/drivers"  
)

cdmSchema      <- "cdm_synthea10"
cdmVersion     <- "5.4"
syntheaVersion <- "2.7.0"
syntheaSchema  <- "native"
syntheaFileLoc <- "/tmp/synthea/output/csv"
vocabFileLoc   <- "/tmp/Vocabulary_20181119"

ETLSyntheaBuilder::CreateCDMTables(connectionDetails = cd, cdmSchema = cdmSchema, cdmVersion = cdmVersion)
                                     
ETLSyntheaBuilder::CreateSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaVersion = syntheaVersion)
                                       
ETLSyntheaBuilder::LoadSyntheaTables(connectionDetails = cd, syntheaSchema = syntheaSchema, syntheaFileLoc = syntheaFileLoc)
                                     
ETLSyntheaBuilder::LoadVocabFromCsv(connectionDetails = cd, cdmSchema = cdmSchema, vocabFileLoc = vocabFileLoc)

ETLSyntheaBuilder::CreateMapAndRollupTables(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, cdmVersion = cdmVersion, syntheaVersion = syntheaVersion)

## Optional Step to create extra indices
ETLSyntheaBuilder::CreateExtraIndices(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, syntheaVersion = syntheaVersion)
                                    
ETLSyntheaBuilder::LoadEventTables(connectionDetails = cd, cdmSchema = cdmSchema, syntheaSchema = syntheaSchema, cdmVersion = cdmVersion, syntheaVersion = syntheaVersion)

Simulating Data with Synthea

For commented code used to convert the Synthea data see extras/codeToRun.R

For more information on Synthea visit: https://synthetichealth.github.io/synthea/

Get Synthea from GitHub: https://github.com/synthetichealth/synthea

etl-synthea's People

Contributors

Stargazers

Watchers

Forkers

yovoss dantheazureman ericavoss jamesswiggins clairblacketer chanukyalakamsani wibeasley psbrandt thehyve ksdkalluri scivm alondhe ielis kybimil puuraj alabarga rohankhera wei-ding ha5h1sbr0wn podalv brickmeister jfontestad vallialla qualitymeasurement austinwalsh tevariou vcastro mtarng laacdm yradsmikham arkhn mwinstok iain-s meerapatelmd tfeusels-edence gowthamrao alan-turing-institute ablack3 chgl priagopal odyosg wadeschulz kzollove wnet500 msaqr14 nachc-cad hollorol snetzy95 mccullen dimshitc minthantsin gopi-kris adambouras h3nry1in katy-sadowski mgkahn christian-hentschel hi676 cian zachary-higgins thondeboer melsiddieg dr-awe lsugatan matteosanchez-dahl

etl-synthea's Issues

strings longer than 100 in procedures.description

A recent Synthea data set that I generated contained a procedure with the 105-character description

Human epidermal growth factor receptor 2 gene detection by fluorescence in situ hybridization (procedure)

but procedures.description is allocated as varchar(100) in synthea_ddl.sql

increasing it to varchar(250) eliminates an error message

Variable type for dates not correctly rendered to SQL SERVER

I don't know if ETLSyntheaBuilder is intended to work with other servers than postgresql, but in case anyone tries to use sql server, the following issue needs to be fixed: The code in ETLSyntheaBuilder::CreateEventTables (and most likely later queries as well) do not render correctly, as it tries to set TIMESTAMP as class for dates, which should be set to something like DATETIME.

Running the line ETLSyntheaBuilder::CreateEventTables(insert your defaults) gives you this error message:

Error in rJava::.jcall(statement, "V", "close") : 
  com.microsoft.sqlserver.jdbc.SQLServerException: A table can only have one timestamp column. Because table 'person' already has one, the column 'death_datetime' cannot be added.

After a bit of drilling, you find that two variables in the rendered SQL-table-creation code for person table (and quite a few other tables) are set to TIMESTAMP.

CREATE TABLE cdm_synthea10.person
(
   person_id					BIGINT	  	NOT NULL , 
  gender_concept_id				INTEGER	  	NOT NULL ,
  year_of_birth					INTEGER	  	NOT NULL ,
  month_of_birth				INTEGER	  	NULL,
  day_of_birth					INTEGER	  	NULL,
  birth_datetime				TIMESTAMP	NULL,
  death_datetime				TIMESTAMP     NULL,
  race_concept_id				INTEGER		NOT NULL,

TIMESTAMP in sql server is synonym to rowversion, and not at all related to storing data on dates and times.

v5.3.1 misspelled column name

revenue_code_source_value is misspelled here:

https://github.com/OHDSI/ETL-Synthea/blob/v5.3.1/inst/sql/sql_server/create_event_tables.sql#L511

Is this v5.3.1?

Hi @AnthonyMolinaro -- is this built for v5.3.1? It looks like it's not using the right cost DDL or visit_detail DDL.

date_part translation through sql render

date_part function in postgres requires single quotes around the date part you want, e.g. 'year'. It doesn't look like sqlrender handles this.

Final document name

The final iteration of the ETL design document should be renamed to be more self-evident, like "synthea ETL specification"

Update documentation

In the readme, it says v5.3, it should say v6. Also, instructions for how to use v5.3.1.

Sync ETL-Synthea with latest master

In my use of ETL-Synthea, I cloned the latest version of https://github.com/synthetichealth/synthea (which is this commit: synthetichealth/synthea@47d09bf) and found that the CSVs that Synthea produced had a different structure from the expected structure in ETL-Synthea. I've updated the ETL-Synthea package to reflect the new Synthea structure and will contribute this back to this repository once I receive contributor rights. Just noting this here in the meantime.

Condition Records Dropped if Mapped to 0

In RAW_LAUREN, I accidentally used a CONCEPT_ID for the CODE.

CODE = 194696 (INCORRECT CODE AND WOULDN'T HAVE MAPPED TO ANYTHING)
DESCRIPTION = Dysmenorrhea

The records were dropped in the CDM. The record should still be written and just set to CONDITION_CONCEPT_ID = 0.

Single rows missing concept name, while running LoadVocabFromCsv

Hello! While inserting the vocabularies with ETLSyntheaBuilder::LoadVocabFromCsv, using vocabularies from athena, merged with the UMLS as instructed, the call stops at two occassions, as rows have NA where there shouldn't be any.

vocabTable[3602658,]
   concept_id concept_name  domain_id vocabulary_id concept_class_id standard_concept concept_code valid_start_date
1:   36311145         <NA> Meas Value         LOINC           Answer                S    LA12334-1       1970-01-01
   valid_end_date invalid_reason
1:     2099-12-31

The other row, in concept_synonym.csv:

   concept_id concept_synonym_name language_concept_id
1:   36311145                 <NA>             4180186

I removed these, and LoadVocabFromCsv() ran through after that.

Create R project to run the builder

There are quite a few sql files for this build, some of which that are no longer being used. An R project would help dictate the order that the files should be run, as well as allowing for parameterization for schemas. We could also put the test cases in the project so that it is entirely self-contained.

R script needs lubridate casting in synthea/native and allowing NULLs in concept table?

Thanks for releasing this tool.

I'm finding that I have to make some modifications in order to get the R code (and embedded SQL) to run. Is this surprising?

I'm using Windows 10, Postgres 11, and R 3.5.0

I created my synthea data set from GitBash like this:

./run_synthea -s 42 -p 1000 Massachusetts

So far I have modified LoadSyntheaTables to forcibly cast dates, and to cast UTILIZATION to numeric, not character. See below. The native synthea tables load now.

I also had to allow NULLs in concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name, and concept_relationship.valid_start_date. Now I can run all of the steps up to LoadEventTables, which give an error. I have substituted LoadCDMTables for that. See #14

Modified `LoadSyntheaTables`

  function (connectionDetails,
            syntheaDatabaseSchema,
            syntheaFileLoc) {
    csvList <-
      c(
        "allergies.csv",
        "conditions.csv",
        "imaging_studies.csv",
        "medications.csv",
        "organizations.csv",
        "procedures.csv",
        "careplans.csv",
        "encounters.csv",
        "immunizations.csv",
        "observations.csv",
        "patients.csv",
        "providers.csv"
      )
    conn <- DatabaseConnector::connect(connectionDetails)
    for (csv in csvList) {
      syntheaTable <- data.table::fread(
        file = paste0(syntheaFileLoc,
                      "/", csv),
        stringsAsFactors = FALSE,
        header = TRUE,
        sep = ","
      )
      writeLines(paste0("Loading: ", csv))
      if ("CODE" %in% colnames(syntheaTable))
        syntheaTable$CODE <- as.character(syntheaTable$CODE)
      if ("REASONCODE" %in% colnames(syntheaTable))
        syntheaTable$REASONCODE <-
        as.character(syntheaTable$REASONCODE)
      if ("PHONE" %in% colnames(syntheaTable))
        syntheaTable$PHONE <- as.character(syntheaTable$PHONE)
      
      # if ("UTILIZATION" %in% colnames(syntheaTable))
      #   syntheaTable$UTILIZATION <-
      #   as.character(syntheaTable$UTILIZATION)
      
      if ("UTILIZATION" %in% colnames(syntheaTable))
        syntheaTable$UTILIZATION <-
        as.numeric(as.character(syntheaTable$UTILIZATION))
      
      print(tableName <- paste0(syntheaDatabaseSchema,
                                ".", strsplit(csv, "[.]")[[1]][1]))
      print(head(syntheaTable))
      
      if ("START" %in% colnames(syntheaTable)) {
        syntheaTable$START <- lubridate::as_date(syntheaTable$START)
      }
      if ("STOP" %in% colnames(syntheaTable)) {
        syntheaTable$STOP <- lubridate::as_date(syntheaTable$STOP)
      }
      if ("DATE" %in% colnames(syntheaTable)) {
        syntheaTable$DATE <- lubridate::as_date(syntheaTable$DATE)
      }
      
      if ("BIRTHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$BIRTHDATE <- lubridate::as_date(syntheaTable$BIRTHDATE)
      }
      
      if ("DEATHDATE" %in% colnames(syntheaTable)) {
        syntheaTable$DEATHDATE <- lubridate::as_date(syntheaTable$DEATHDATE)
      }
      
      DatabaseConnector::insertTable(
        conn,
        tableName,
        data = as.data.frame(syntheaTable),
        dropTableIfExists = FALSE,
        createTable = FALSE,
        progressBar = TRUE
      )
    }
    on.exit(DatabaseConnector::disconnect(conn))
  }

Some warnings

Loading: concept_relationship.csv
...snip... 100%

1: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 50165: <<44822272 "Light-for-dates"without mention of fetal malnutrition, 1,250- 1,499 grams 4180186>>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.

2: In data.table::fread(file = paste0(vocabFileLoc, "/", csv), stringsAsFactors = FALSE, :
Found and resolved improper quoting out-of-sample. First healed line 64070: <<2618083 "opt out" physician or practitioner emergency or urgent service Observation HCPCS HCPCS Modifier S GJ 19981001 20991231 >>. If the fields are not quoted (e.g. field separator does not appear within any field), try quote="" to avoid this warning.
3: In max(nchar(as.character(obj)), na.rm = TRUE) :
no non-missing arguments to max; returning -Inf

and

Running: final_visit_ids.sql
...snip... 100% ... Executing SQL took 0.296 secs

In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema) :
Parameter 'synthea_schema' not found in SQL

Selected OMOP vocabularies

vocabulary_id	vocabulary_name	vocabulary_reference	vocabulary_version	vocabulary_concept_id
ABMS	Provider Specialty (American Board of Medical Specialties)	http://www.abms.org/member-boards/specialty-subspecialty-certificates	2018-06-26 ABMS	45756746
ATC	WHO Anatomic Therapeutic Chemical Classification	FDB UK distribution package	RXNORM 2018-08-12	44819117
CDM	OMOP Common DataModel	https://github.com/OHDSI/CommonDataModel	CDM v6.0.0	32485
Cohort Type	OMOP Cohort Type	OMOP generated		44819234
Concept Class	OMOP Concept Class	OMOP generated		44819233
Condition Type	OMOP Condition Occurrence Type	OMOP generated		44819127
Cost	OMOP Cost	OMOP generated		581457
Cost Type	OMOP Cost Type	OMOP generated		5029
CPT4	Current Procedural Terminology version 4 (AMA)	http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html	2018 Release	44819100
Death Type	OMOP Death Type	OMOP generated		44819135
Device Type	OMOP Device Type	OMOP generated		44819151
dm+d	Dictionary of Medicines and Devices (NHS)	https://isd.hscic.gov.uk/trud3/user/authenticated/group/0/pack/1/subpack/24/releases	dm+d Version 6.0.0 20180604	232
Domain	OMOP Domain	OMOP generated		44819147
Drug Type	OMOP Drug Exposure Type	OMOP generated		44819126
EphMRA ATC	Anatomical Classification of Pharmaceutical Products (EphMRA)	http://www.ephmra.org/Anatomical-Classification	EphMRA ATC 2016	243
Episode	OMOP Episode	OMOP generated		32523
Episode Type	OMOP Episode Type	OMOP generated		32542
Ethnicity	OMOP Ethnicity	OMOP generated		44819134
GCN_SEQNO	Clinical Formulation ID (FDB)	FDB US distribution package	20151119 Release	44819141
Gender	OMOP Gender	OMOP generated		44819108
GGR	Commented Drug Directory (BCFI)	http://www.bcfi.be/nl/download	GGR 20180901	581450
HCPCS	Healthcare Common Procedure Coding System (CMS)	http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html	20180101 Alpha Numeric HCPCS File	44819101
ICD10	International Classification of Diseases, Tenth Revision (WHO)	http://www.who.int/classifications/icd/icdonlineversions/en/	2016 Release	44819124
ICD10CM	International Classification of Diseases, Tenth Revision, Clinical Modification (NCHS)	http://www.cdc.gov/nchs/icd/icd10cm.htm	ICD10CM FY2019 code descriptions	44819098
ICD10PCS	ICD-10 Procedure Coding System (CMS)	http://www.cms.gov/Medicare/Coding/ICD10/index.html	ICD10PCS 2019	44819125
ICD9CM	International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 1 and 2 (NCHS)	http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html	ICD9CM v32 master descriptions	5046
ICD9Proc	International Classification of Diseases, Ninth Revision, Clinical Modification, Volume 3 (NCHS)	http://www.cms.gov/Medicare/Coding/ICD9ProviderDiagnosticCodes/codes.html	ICD9CM v32 master descriptions	44819099
ICDO3	International Classification of Diseases for Oncology, Third Edition (WHO)	https://seer.cancer.gov/icd-o-3/	ICDO3 SEER Site/Histology Released 09/18/2015	581426
LOINC	Logical Observation Identifiers Names and Codes (Regenstrief Institute)	http://loinc.org/downloads/loinc	2.65	44819102
Meas Type	OMOP Measurement Type	OMOP generated		44819152
Medicare Specialty	Medicare provider/supplier specialty codes (CMS)	http://www.cms.gov/Medicare/Provider-Enrollment-and-Certification/MedicareProviderSupEnroll/Taxonomy.html	2018-06-26 Specialty	44819138
MEDRT	Medication Reference Terminology MED-RT (VA)	https://nciterms.nci.nih.gov/ncitbrowser/pages/vocabulary.jsf?dictionary=MED-RT	MED-RT 2018-09-04	32537
Multum	Cerner Multum (Cerner)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html	7/10/2013	44819112
NDC	National Drug Code (FDA and manufacturers)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html, http://www.fda.gov/downloads/Drugs/DevelopmentApprovalProcess/UCM070838.zip	NDC 20190310	44819105
NDFRT	National Drug File - Reference Terminology (VA)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html	RXNORM 2018-08-12	44819103
NFC	New Form Code (EphMRA)	http://www.ephmra.org/New-Form-Codes-Classification	NFC 20160704	245
None	OMOP Standardized Vocabularies	OMOP generated	v5.0 13-MAR-19	44819096
Note Type	OMOP Note Type	OMOP generated		44819146
Obs Period Type	OMOP Observation Period Type	OMOP generated		44819149
Observation Type	OMOP Observation Type	OMOP generated		44819129
OSM	OpenStreetMap	https://www.openstreetmap.org/copyright/en, https://wambachers-osm.website/boundaries/	OSM Release 2019-02-21	32541
PCORNet	National Patient-Centered Clinical Research Network (PCORI)	OMOP generated		44819148
PHDSC	Source of Payment Typology (PHDSC)	http://www.phdsc.org/standards/payer-typology-source.asp	Version 3.0	32473
Plan	Health Plan - contract to administer healthcare transactions by the payer, facilitated by the sponsor			32471
Plan Stop Reason	Plan Stop Reason - Reason for termination of the Health Plan			32474
PPI	AllOfUs_PPI (Columbia)	http://terminology.pmi-ops.org/CodeSystem/ppi	Codebook Version 0.3.12	581404
Procedure Type	OMOP Procedure Occurrence Type	OMOP generated		44819128
Race	Race and Ethnicity Code Set (USBC)	http://www.cdc.gov/nchs/data/dvs/Race_Ethnicity_CodeSet.pdf	Version 1.0	44819109
Relationship	OMOP Relationship	OMOP generated		44819235
RxNorm	RxNorm (NLM)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html	RxNorm 20190204	44819104
RxNorm Extension	RxNorm Extension (OMOP)	OMOP generated	RxNorm Extension 2019-03-11	252
SNOMED	Systematic Nomenclature of Medicine - Clinical Terms (IHTSDO)	http://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html	Snomed Release 20190131	44819097
Specimen Type	OMOP Specimen Type	OMOP generated		581376
SPL	Structured Product Labeling (FDA)	http://www.fda.gov/Drugs/InformationOnDrugs/ucm142438.htm	NDC 20190310	44819140
Sponsor	Sponsor - institution or individual financing healthcare transactions			32472
UB04 Point of Origin	UB04 Claim Source Inpatient Admission Code (CMS)	https://www.resdac.org/cms-data/variables/Claim-Source-Inpatient-Admission-Code		32045
UB04 Pri Typ of Adm	UB04 Claim Inpatient Admission Type Code (CMS)	https://www.resdac.org/cms-data/variables/Claim-Inpatient-Admission-Type-Code		32046
UB04 Pt dis status	UB04 Patient Discharge Status Code (CMS)	https://www.resdac.org/cms-data/variables/patient-discharge-status-code		32047
UB04 Typ bill	UB04 Type of Bill - Institutional (USHIK)	https://ushik.ahrq.gov/ViewItemDetails?&system=apcd&itemKey=196987000		32044
UCUM	Unified Code for Units of Measure (Regenstrief Institute)	http://aurora.regenstrief.org/~ucum/ucum.html#section-Alphabetic-Index	Version 1.8.2	44819107
US Census	United States Census Bureau	https://www.census.gov/geo/maps-data/data/tiger-cart-boundary.html	US Census 2017 Release	32570
VA Class	VA National Drug File Class (VA)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html	RXNORM 2018-08-12	44819122
VA Product	VA National Drug File Product (VA)	http://www.nlm.nih.gov/research/umls/rxnorm/docs/rxnormfiles.html	RXNORM 2018-08-12	44819120
Visit	OMOP Visit	OMOP generated		44819119
Visit Type	OMOP Visit Type	OMOP generated		44819150
Vocabulary	OMOP Vocabulary	OMOP generated		44819232

Is LoadEventTables (from the documentation) now LoadCDMTables?

I struggled to run the synthea-to-OMOP CDM conversion on a Windows 10 computer, using Postgres 11 and R 3.5.0. I see there have been a lot of commits in the last day or so. Is it possible that there's a mismatch between the latest commit and the documentation? Should I pull a different commit, or from a different branch?

At roughly 2019-03-16-13:00 EDT, I did the following

git clone https://github.com/OHDSI/ETL-Synthea.git

(in GitBash) and ran synthea like this

./run_synthea -s 42 -p 1000 Massachusetts

And then attempted the conversion according to https://github.com/OHDSI/ETL-Synthea#step-by-step-example-r-package. See #13 for some modifications I made to date casting and allowing NULLs. Now everything runs with just a few warnings, up until the last step: ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native"), which gives

Error: object 'LoadEventTables' not found

I've already called LoadSyntheaTables and LoadVocabFromCsv . LoadVocabFromSchema isn't necessary on top of LoadVocabFromCsv , is it? So I just used the remaining load function instead:
LoadCDMTables. That completed without any intervention or any errors, but with the following warnings:

Running: insert_drug_era.sql
...snip... 100% ... Executing SQL took 0.0229 secs

1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL

4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

5: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL

6: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'vocab_schema' not found in SQL

7: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, :
Parameter 'synthea_schema' not found in SQL
LoadCDMTables. That's running on insert_measurement.sqlnow, without any intervention.

I also tried the Windows batch command, which seems to give a similar error

C:\Users\Mark Miller\ETL-Synthea>"C:\Program Files\R\R-3.5.0\bin\x64\Rscript.exe" WindowsLoader.r postgresql localhost ehrlike native cdm_synthea10 postgres SECRETPASSWORD C:\synthea_omop\synthea_output_csv C:\synthea_omop\vocabulary_downloads

Error: 'DropCDMTables' is not an exported object from 'namespace:ETLSyntheaBuilder'

I'm willing to recreate my environment in Ubuntu, if that's the best bet, but I don;t get the feeling that this issue is OS related.

scaling issue in insert_drug_era.sql

NOTE: Using main latest I always get 0 records in my drug_era table and it finishes in less then a second. I had to use the pull request #48 to get the drug_era conversion table populated. I basically took that branch and remove the OMOP 6.0 specific code and ended up with a working migration after I also took pull request #53 to get the OMOP 5.3.1 death table. I then added some code to migrate the location data in synthea patient table to OMOP location table. Code is at https://github.com/scivm/ETL-Synthea.

Doing testing with this patched setup, I found that with 100K patients generated from synthea, the conversion in insert_drug_era.sql takes 1.85 days. 10k patients took less then a minute and 1k patients took about 15 seconds. I don't think throwing more computer power at it will help since postgres only uses a single core to run the migration and I have already tried to use 8GB ram?

Setup:
Postgres 10 on windows 10 laptop with i7 2 cpu with total 8 core, 32GB ram and Ubuntu for windows.

Used 8 GB of ram for shared_buffers

shared_buffers = 8096MB                 # min 128kB
temp_buffers = 128MB                    # min 800kB
work_mem = 64MB                         # min 64kB
maintenance_work_mem = 64MB             # min 1MB
max_stack_depth = 2MB                   # min 100kB
dynamic_shared_memory_type = posix      # the default is the first option
max_worker_processes = 6                # (change requires restart)
max_parallel_workers_per_gather = 2     # taken from max_parallel_workers
max_parallel_workers = 6                # maximum number of max_worker_processes that

Put indexes on vocabulary tables after loading them:

ALTER TABLE cdm.concept ADD CONSTRAINT xpk_concept PRIMARY KEY (concept_id);
ALTER TABLE cdm.vocabulary ADD CONSTRAINT xpk_vocabulary PRIMARY KEY (vocabulary_id);
ALTER TABLE cdm.domain ADD CONSTRAINT xpk_domain PRIMARY KEY (domain_id);
ALTER TABLE cdm.concept_class ADD CONSTRAINT xpk_concept_class PRIMARY KEY (concept_class_id);
ALTER TABLE cdm.concept_relationship ADD CONSTRAINT xpk_concept_relationship PRIMARY KEY (concept_id_1,concept_id_2,relationship_id);
ALTER TABLE cdm.relationship ADD CONSTRAINT xpk_relationship PRIMARY KEY (relationship_id);
ALTER TABLE cdm.concept_ancestor ADD CONSTRAINT xpk_concept_ancestor PRIMARY KEY (ancestor_concept_id,descendant_concept_id);
ALTER TABLE cdm.source_to_concept_map ADD CONSTRAINT xpk_source_to_concept_map PRIMARY KEY (source_vocabulary_id,target_concept_id,source_code,valid_end_date);
ALTER TABLE cdm.drug_strength ADD CONSTRAINT xpk_drug_strength PRIMARY KEY (drug_concept_id, ingredient_concept_id);

Analyzed all tables before starting

duplicate key error in insert_measurement.sql, others for postgresql

I ran into a duplicate key error when running insert_measurement and others. I saw the union of two or three selects that use row_number(), and guessed each where starting from 1. I recreated the sequences and replaced the row_number() calls with nextval, and the problem went away. I assume these were added to give databases a chance to run the individual selects concurrently and that re-introducing use of the sequence would serialize and slow down the insert.

The commit that introduced the use of row_number() is b282d80.

actual 1000 people demo data

This how to guide is valuable. Thank you for doing it.

For our group, however, having a set of CSV files (or postgres dump file (way more preferred)) with the final product would help much more.

Consider starting a second repo that would have that.
I don't think synthea puts any restrictions such that this would not be possible to do.

That way the fruit of your labor would be used by a lot more people! tagging @clairblacketer

Visit_Occurrence_id in event tables should use the final_visit_ids intermediate table

Line 53 of the condition_occurrence insert
Line 68 of drug_exposure
Line 113 of drug_exposure
Line 159 of drug_exposure
Line 62 of measurement
Line 104 of measurement
Line 58 of observation
Line 98 of observation

These are all joining to visit_occurrence to get the visit_occurrence_id. This is incorrect as not every encounter_id will be represented in visit_occurrence due to how we collapse encounters to create visits (page 3 of design document). Instead, the join should be to the final_visit_id table created to link all encounters with their correct visit_occurrence_id.

Map unit_concept_id coming from the observations table

There are units available in the native observations table that are not being mapped to a unit_concept_id in the MEASUREMENT table.

Error handling

When I execute this code, I get error like database "synthea10" does not exist...
Do I have to do some preparation in advance?

./bulk-load synthea10 localhost native cdm_synthea10 postgres lollipop /tmp/synthea/output/csv /tmp/Vocabulary_20181119

Duplicated row IDs related to "Parameter 'vocab_schema' not found in SQL"?

After doing a hybrid of the Linux bulk load and some R scripts (#17), I'm seeing duplication of row IDs in some tables. For example, my observations table has

observation_id|person_id|observation_concept_id|observation_date|observation_datetime|observation_type_concept_id|value_as_number|value_as_string|value_as_concept_id|qualifier_concept_id|unit_concept_id|provider_id|visit_occurrence_id|visit_detail_id|observation_source_value|observation_source_concept_id|unit_source_value|qualifier_source_value| 
--------------|---------|----------------------|----------------|--------------------|---------------------------|---------------|---------------|-------------------|--------------------|---------------|-----------|-------------------|---------------|------------------------|-----------------------------|-----------------|----------------------| 
             1|        8|               4174876|      1985-09-13|          1985-09-13|                   38000280|               |               |                  0|                   0|              0|          0|                211|              0|424213003               |                      4174876|                 |                      | 
             1|        1|                     0|      2010-07-26|          2010-07-26|                   38000276|               |               |                  0|                   0|              0|          0|                 10|              0|233604007               |                       255848|                 |                      |

Does person 8 (0141bae5-c190-4e83-aab0-eed8dc2e91bd) have a 424213003 observation from visit 211 (acf6725e-23ea-4e10-a1f6-d8e2196d51fb) on 1985-09-13 or does person 1 (00002c66-a365-4e88-8e80-d52bcad4869e) have a 233604007 from visit 10 (120aa894-4465-4b04-af96-1928191f1c36) on 2010-07-26?

ubuntu@ip-172-31-88-67:~/synthea/output/csv$ grep 424213003 * | grep 0141bae5-c190-4e83-aab0-eed8dc2e91bd
allergies.csv:1985-09-13,,0141bae5-c190-4e83-aab0-eed8dc2e91bd,acf6725e-23ea-4e10-a1f6-d8e2196d51fb,424213003,Allergy to bee venom

ubuntu@ip-172-31-88-67:~/synthea/output/csv$ grep 233604007 * | grep 120aa894-4465-4b04-af96-1928191f1c36
conditions.csv:2010-07-26,,00002c66-a365-4e88-8e80-d52bcad4869e,120aa894-4465-4b04-af96-1928191f1c36,233604007,Pneumonia
encounters.csv:120aa894-4465-4b04-af96-1928191f1c36,2010-07-26T08:39:03Z,2010-07-27T08:39:03Z,00002c66-a365-4e88-8e80-d52bcad4869e,69c66b65-96fc-3702-a7d1-ea4ba3ac3b87,inpatient,32485007,Hospital admission,105.37,233604007,Pneumonia

Same thing for drug_exposure and measurement.

drug_exposure_id|person_id|drug_concept_id|drug_exposure_start_date|drug_exposure_start_datetime|drug_exposure_end_date|drug_exposure_end_datetime|verbatim_end_date|drug_type_concept_id|stop_reason|refills|quantity|days_supply|sig|route_concept_id|lot_number|provider_id|visit_occurrence_id|visit_detail_id|drug_source_value|drug_source_concept_id|route_source_value|dose_unit_source_value|
----------------|---------|---------------|------------------------|----------------------------|----------------------|--------------------------|-----------------|--------------------|-----------|-------|--------|-----------|---|----------------|----------|-----------|-------------------|---------------|-----------------|----------------------|------------------|----------------------|
               1|        1|        1539464|              2003-12-25|                  2003-12-25|            2004-12-24|                2004-12-24|       2004-12-24|            38000177|           |      0|       0|        365|   |               0|0         |          0|                 11|              0|316672           |               1539464|                  |                      |
               1|        1|       40213154|              2008-09-24|                  2008-09-24|            2008-09-24|                2008-09-24|       2008-09-24|              581452|           |      0|       0|          0|   |               0|0         |          0|                 40|              0|140              |              40213154|                  |                      |
               1|        1|              0|              1981-09-03|                  1981-09-03|            1981-09-03|                1981-09-03|                 |              581452|           |      0|       0|          0|   |               0|0         |          0|                  4|              0|92691004         |                200970|                  |                      |

BTW, source_to_concept_map is empty

Could this have something to do with my vocabulary choices? I'm using everything that can be downloaded without requiring a licence. I have loaded the remotely-downloaded CPT codes.

Possibly relevant warnings?

> local.LoadCDMTables(cd,"cdm_synthea10","native")
Connecting using PostgreSQL driver
Running: insert_person.sql
  |======================================================================| 100%
Executing SQL took 0.00914 secs
Running: insert_observation_period.sql
  |======================================================================| 100%
Executing SQL took 0.027 secs
Running: insert_visit_occurrence.sql
  |======================================================================| 100%
Executing SQL took 0.153 secs
Running: insert_condition_occurrence.sql
  |======================================================================| 100%
Executing SQL took 20.2 secs
Running: insert_observation.sql
  |======================================================================| 100%
Executing SQL took 21.7 secs
Running: insert_measurement.sql
  |======================================================================| 100%
Executing SQL took 8.85 mins
Running: insert_procedure_occurrence.sql
  |======================================================================| 100%
Executing SQL took 47.6 secs
Running: insert_drug_exposure.sql
  |======================================================================| 100%
Executing SQL took 1.3 mins
Running: insert_condition_era.sql
  |======================================================================| 100%
Executing SQL took 0.152 secs
Running: insert_drug_era.sql
  |======================================================================| 100%
Executing SQL took 25.2 secs
Warning messages:
1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
5: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
6: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'vocab_schema' not found in SQL
7: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema,  :
  Parameter 'synthea_schema' not found in SQL
>

Conditions not getting mapped to 0

Expected behavior

The majority of the source codes are mapped to concepts.

Actual behavior

About 63% of the codes are mapped to 0. It looks like some values are coming across as descriptions rather than ICD10 codes. We need to figure out how to get these mapped.

SELECT '0 RECORDS' AS TYPE, COUNT(*) ROW_COUNT
FROM CONDITION_OCCURRENCE
WHERE CONDITION_CONCEPT_ID =0
UNION ALL
SELECT 'ALL RECORDS' AS TYPE, COUNT(*) ROW_COUNT
FROM CONDITION_OCCURRENCE

0 RECORDS = 4942
ALL RECORDS = 8120

Native measurement values not copied to CDM

Many columns in my cdm_synthea10.measurement table had all null or 0 values at the end of my hybrid ETL (#17)

value_as_number
value_as_concept_id
unit_concept_id
range_low
range_high
provider_id
unit_source_value
value_source_value

The person, visit and measurement concept IDs were present, so I just copied the raw values and units back from the native schema.

update
	cdm_synthea10.measurement m2
set
	value_source_value = natob.value,
	unit_source_value = natob.units
from
	cdm_synthea10.measurement m
join cdm_synthea10.visit_occurrence vo on
	m.visit_occurrence_id = vo.visit_occurrence_id
join cdm_synthea10.person p on
	p.person_id = vo.person_id
join native.observations natob on
	natob.patient = p.person_source_value
	and natob.encounter = vo.visit_source_value
	and natob.code = m.measurement_source_value
where
	m2.person_id = m.person_id
	and m2.visit_occurrence_id = m.visit_occurrence_id
	and m2.measurement_source_value = m.measurement_source_value;

Then set the unit concept types as much as possible

update
	cdm_synthea10.measurement m2
set
	unit_concept_id = sc2stvm.target_concept_id
from
	cdm_synthea10.measurement m
join cdm_synthea10.source_to_standard_vocab_map sc2stvm on
	m.unit_source_value = sc2stvm.source_code
where
	m2.person_id = m.person_id
	and m2.visit_occurrence_id = m.visit_occurrence_id
	and m2.measurement_source_value = m.measurement_source_value;

There are some Synthea units that don't get transformed by that query.

unit_source_value	count
	22676
mmHg	20700
{score}	12525
{nominal}	4288
U/L	3402
mL/min/{1.73_m2}	3117
kU/L	1200
{count}	140
{T-score}	82
IU/L	6
ng/dl	4
m[IU]/L	4
M/uL	2

Fix some manually?

update
	cdm_synthea10.measurement m2
set
	unit_concept_id = 8876
where
	unit_source_value = 'mmHg'

Then populate value_as_number. This could probably be done much more rigorously.

update
	cdm_synthea10.measurement
set
	value_as_number = cast(value_source_value as decimal)
where
	unit_concept_id > 0

A this point there are incomplete conversions due to the units dead ends described above. I haven't done anything with the nominal/descriptive findings that could go in value_as_concept_id. And the following columns remain empty/null/zero:

value_as_concept_id
range_low
range_high
provider_id

Bulk load script needs documentation for execution.

Script structure is shown as:
# ./bulk-load database native cdm user password /path/to/data /path/to/vocab

With an example of:
# ./bulk-load cdm2 ohdsi ohdsi /synthea/output/csv

The /path/to/vocab is easy to sort out, but what are the native and cdm parameters?

Learnings from using the Synthea data generator for use with ETL-Synthea

I faced a few challenges that I was able to overcome using Synthea which I'll contribute back to the README. For now, I'm adding them to this issue in case others are facing similar issues.

Synthea Setup & Usage

After cloning the Synthea repo (synthetichealth/synthea@47d09bf), you need to edit src/main/resources/synthea.properties to update the exporter settings to make sure that CSVs are produced as these are the input into the ETL-Synthea process. During my work, I had to change the following settings:

exporter.fhir.export = false
exporter.fhir.transaction_bundle = false
exporter.hospital.fhir.export = false
exporter.practitioner.fhir.export = false
exporter.csv.export = true

I don't think there is any harm in leaving the fhir settings to true except that your producing extra data that you won't use with ETL-Synthea.

Make sure you have the JAVA_HOME environmental variable set. I did not have this set and when running ./gradlew build check test I ran into some Java problems.
Once the steps above are done, to produce a 1,000 person data set you can run run_synthea.bat -p 1000 which then creates the CSVs in the /output/csv folder in the Synthea directory. Note this location as it is needed for the ETL-Synthea processing.
I faced an issue with using the CSVs from the latest Synthea builder as noted in #43 and hope to have these incorporated soon.

Running ETL-Synthea

Download the vocabulary files from Athena: http://athena.ohdsi.org/vocabulary/list. You'll need to make sure these files are unzipped and available as CSVs.
You'll need to establish an empty database with 2 schemas: cdm will hold the OMOP CDM v5.3.1 tables and native will hold the native Synthea data.
ETL-Synthea has functions to create your CDM & Vocabulary tables and load the Vocabulary tables in the cdm schema mentioned above. Alternatively, you can establish your CDM and Vocabulary tables using the v5.3.1 scripts on https://github.com/OHDSI/CommonDataModel.
Once my CDM & Vocabulary tables were set and the vocabulary was loaded, I opted to create indicies on the vocabulary tables only using the script on https://github.com/OHDSI/CommonDataModel. The idea was that since the ETL process will use these tables, it would be best if they had some indices to improve performance.
I used the R package approach as described in the README: https://github.com/OHDSI/ETL-Synthea#step-by-step-example-r-package.

Post-processing Synthea

After the ETL ran successfully, I applied the indices using the v5.3.1 script on https://github.com/OHDSI/CommonDataModel.
I also attempted to apply the foreign key constraints using the v5.3.1 script on https://github.com/OHDSI/CommonDataModel but ran into a problem noted in #42.

whitespace in "insert_drug_exposure.sql " in LoadCDMTables's queries list

function (connectionDetails, cdmDatabaseSchema, syntheaDatabaseSchema, 
    vocabDatabaseSchema = cdmDatabaseSchema) 
{
    queries <- c("insert_person.sql", "insert_observation_period.sql", 
        "insert_visit_occurrence.sql", "insert_condition_occurrence.sql", 
        "insert_observation.sql", "insert_measurement.sql", "insert_procedure_occurrence.sql", 
        "insert_drug_exposure.sql ", "insert_condition_era.sql", 
        "insert_drug_era.sql")

---snip---

replace with something like

local.LoadCDMTables <- function (connectionDetails, cdmDatabaseSchema, syntheaDatabaseSchema, vocabDatabaseSchema = cdmDatabaseSchema) 
{
    queries <- c("insert_person.sql", "insert_observation_period.sql", 
        "insert_visit_occurrence.sql", "insert_condition_occurrence.sql", 
        "insert_observation.sql", "insert_measurement.sql", "insert_procedure_occurrence.sql", 
        "insert_drug_exposure.sql", "insert_condition_era.sql", 
        "insert_drug_era.sql")
    conn <- DatabaseConnector::connect(connectionDetails)
    for (query in queries) {
        pathToSql <- base::system.file("sql/sql_server", package = "ETLSyntheaBuilder")
        sqlFile <- base::paste0(pathToSql, "/", query)
        sqlQuery <- base::readChar(sqlFile, base::file.info(sqlFile)$size)
        renderedSql <- SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, 
            synthea_schema = syntheaDatabaseSchema, vocab_schema = vocabDatabaseSchema)
        translatedSql <- SqlRender::translate(renderedSql, targetDialect = connectionDetails$dbms)
        writeLines(paste0("Running: ", query))
        DatabaseConnector::executeSql(conn, translatedSql, progressBar = TRUE, 
            reportOverallTime = TRUE)
    }
    on.exit(DatabaseConnector::disconnect(conn))
}

Cleanup mapping & rollup tables used during ETL process

The following R functions are used to create some interim tables that are used during the ETL process:

ETLSyntheaBuilder::CreateVocabMapTables(cd,cdmDatabaseSchema)
ETLSyntheaBuilder::CreateVisitRollupTables(cd,cdmDatabaseSchema,nativeSchema)

After running ETLSyntheaBuilder::LoadEventTables(cd,cdmDatabaseSchema,nativeSchema), I'd expect that these interim tables are removed. I'm guessing that these are currently preserved to debug items during the ETL process.

I'd propose (and I've written) a new function to drop these interim tables. I'll contribute these back in a PR.

LoadVocabFromCsv ERROR: null value in column "concept_name" violates not-null constraint

While following the steps on the Step by Step Example in the read me, I encountered a problem during the ETLSyntheaBuilder::LoadVocabFromCsv(cd,"cdm_synthea10","/tmp/Vocabulary_20181119") step

#13 mentioned this issue and avoided it by "allowing NULLs in concept.concept_name, concept.concept_code, concept_synonym.concept_synonym_name, and concept_relationship.valid_start_date."

How should I allow for these NULLS? Does LoadVocabFromCsv.r needs to be adapted?

I am using the latest pull of the master with vocabulary v5

obs_event_field_concept_id

The column obs_event_field_concept_id was added to the observation table but not to the insert_observation.sql. This violates the NOT NULL constraint. Instead, if not used it should be set to 0.

Example code not working for Master

The basic example to perform ETL through R did not work correctly (v6), some function names/parameter names were not coherent between readme.md and actual code.

Made a pull request #64 to fix it.

The createEventTable function adds admitted_from_ columns.

These columns then flag at error as insert_visit_occurrence refers to admitting_source_ columns instead.

missing column in synthea ddl

Using the latest synthea version, the ETL scripts fail due to a missing column in the encounters table (organization).

Capture Metadata

Capture the four elements that define a CDM and should reside in the CDM metadata.

Source Data Identification
Vocabulary Version
CDM Version
ETL Version

Broken search paths and bad SQL file names in Linux bulk load

I had trouble creating a Synthea OMOP database with the R scripts (see #13 #14 #15)

So today I tried the Linux bulk loader now on Ubuntu 18 and PostgreSQL 11. I think everything went well up to vocab mapping and core ETL, but then it was all errors after that. See end of post.

I think the search path my not be set correctly in ETL/SQL/AllVisitTable.sql or the ETL steps. Also, the reference to ETL/SQL/AAViTable.sql should be replaced with ETL/SQL/AAVITable.sql

I tried running some of the R scripts individually after that, I think that exposed some other bugs. But it also may have led to duplicate row IDs and mismatched concept realms (see #18)

So finally I set the search path (in DBeaver)

set search_path=cdm_synthea10,native;

and then just copied, pasted, and executed he remaining ETL/SQL/*.sql files in DBeaver, too.

PS there's also a bad filename in

PGOPTIONS="--search_path=$CDM" psql -f "ETL/SQL/cdm_v5.3.constraints_ddl.sql" "postgresql://$USERNAME:$PASSWORD@localhost/$DB"

but maybe you don't care as much about that since it's commented out.

Attempted bulk load session

ubuntu@ip-172-31-88-67:~/ETL-Synthea$ ./bulk-load mimicomop native cdm_synthea10 mimicuser SECRET /home/ubuntu/synthea/output/csv/ /opt/mimicomop/data/vocab
DROP TABLE
CREATE TABLE
---etc---
creating cdm tables
psql:ETL/SQL/drop_cdm_tables.sql:1: ERROR:  table "achilles_analysis" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:2: ERROR:  table "achilles_heel_results" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:3: ERROR:  table "achilles_results" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:4: ERROR:  table "achilles_results_derived" does not exist
psql:ETL/SQL/drop_cdm_tables.sql:5: ERROR:  table "achilles_results_dist" does not exist
DROP TABLE
---etc---
psql:ETL/SQL/drop_cdm_tables.sql:26: ERROR:  table "drug_strength" does not exist
DROP TABLE
---etc---
CREATE TABLE
---etc---
loading data from synthea csv files to relational database
COPY 639
COPY 3096
COPY 8103
COPY 37811
COPY 14695
COPY 8468
COPY 182176
COPY 1152
COPY 31648
loading data from vocab csv files to relational database
COPY 6136766
COPY 60879903
COPY 41501098
COPY 356
COPY 7856597
COPY 46
COPY 2616623
COPY 488
COPY 90
creating vocab maps...
DROP TABLE
SELECT 6136766
CREATE INDEX
CREATE INDEX
DROP TABLE
SELECT 4902384
CREATE INDEX
CREATE INDEX
creating visit logic tables...
psql:ETL/SQL/AllVisitTable.sql:1: NOTICE:  table "ip_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:2: NOTICE:  table "er_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:3: NOTICE:  table "op_visits" does not exist, skipping
DROP TABLE
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:57: ERROR:  relation "encounters" does not exist
LINE 14:    FROM encounters
                 ^
psql:ETL/SQL/AllVisitTable.sql:90: ERROR:  relation "encounters" does not exist
LINE 22:   FROM encounters CL1
                ^
psql:ETL/SQL/AllVisitTable.sql:113: ERROR:  relation "encounters" does not exist
LINE 10:  FROM encounters
               ^
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/AllVisitTable.sql:131: ERROR:  relation "ip_visits" does not exist
LINE 6:    SELECT * FROM IP_VISITS
                         ^
psql:ETL/SQL/AllVisitTable.sql:133: NOTICE:  table "ip_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:134: NOTICE:  table "er_visits" does not exist, skipping
DROP TABLE
psql:ETL/SQL/AllVisitTable.sql:135: NOTICE:  table "op_visits" does not exist, skipping
DROP TABLE
ETL/SQL/AAViTable.sql: No such file or directory
DROP TABLE
SELECT 539
Performing ETL
Loading person
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_person.sql:56: ERROR:  relation "patients" does not exist
LINE 51:  from patients p
               ^
Loading observation_period
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_observation_period.sql:24: ERROR:  relation "encounters" does not exist
LINE 18:   join encounters e
                ^
Loading visit_occurrence
psql:ETL/SQL/insert_visit_occurrence.sql:54: ERROR:  relation "all_visits" does not exist
LINE 48: from all_visits av
              ^
Loading observation
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_observation.sql:96: ERROR:  relation "allergies" does not exist
LINE 46: from allergies a
              ^
Loading condition_occurrence
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_condition_occurrence.sql:56: ERROR:  relation "conditions" does not exist
LINE 44: from conditions c
              ^
Loading procedure_occurrence
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_procedure_occurrence.sql:50: ERROR:  relation "procedures" does not exist
LINE 38: from procedures pr
              ^
Loading measurement
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_measurement.sql:102: ERROR:  relation "procedures" does not exist
LINE 50: from procedures pr
              ^
Loading drug exposure
DROP SEQUENCE
CREATE SEQUENCE
psql:ETL/SQL/insert_drug_exposure.sql:154: ERROR:  relation "conditions" does not exist
LINE 56: from conditions c
              ^
Loading condition_era
DROP SEQUENCE
CREATE SEQUENCE
INSERT 0 0
Loading drug_era
DROP SEQUENCE
CREATE SEQUENCE
INSERT 0 0

Synthea has added additional cost columns to Patient, Medication, and Encounter tables--breaks bulk-load

It appears that the CSV output from Synthea now contains additional columns relating to cost of encounters and medications, and total healthcare expenses to patients. These columns are not created in the native synthea schema by the ETL-Synthea DDL. As a result, bulk-load fails for these three tables, viz.,

ERROR: extra data after last expected column
CONTEXT: COPY encounters, line 2: "1010abf7-397f-8393-9ff0-2c16c1058605,2011-06-20T04:44:41Z,2011-06-20T04:59:41Z,8c2ec300-a8d1-0e5e-88..."

readme doesn't acknowledge required host parameter

I just pulled today, and the bulk load script now requires 8 arguments.

DB=$1
HOSTNAME=$2
NATIVE=$3
CDM=$4
USERNAME=$5
PASSWORD=$6
DATA_DIRECTORY=$7
VOCAB_DIRECTORY=$8

The readme still gives a 7 argument example, with no hostname.

./bulk-load synthea10 native cdm_synthea10 postgres lollipop /tmp/synthea/output/csv /tmp/Vocabulary_20181119

Cannot apply foreign keys to CDM after ETL

After running the ETL process, I'd like to add all of the indices and FKs to the resulting CDM. When I attempted to do this, I found that the provider_id in the drug_exposure, measurement and observation tables were set to 0 instead of NULL. Making a note of this here as I have fix to address this item.

Parameter 'synthea_schema' not found in SQL

Hi,

I ran into some final issues while running the example R code provided in the following steps:

ETLSyntheaBuilder::CreateVisitRollupTables(cd,"cdm_synthea10","native")

Warning message:
In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema) :
  Parameter 'synthea_schema' not found in SQL

ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native")
Error: Error executing SQL: org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text Hint: You will need to rewrite or cast the expression. Position: 665 An error report has been created at /Users/tine.geldof/errorReport.txt In addition: Warning messages: 1: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL 2: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL 3: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'synthea_schema' not found in SQL 4: In SqlRender::render(sqlQuery, cdm_schema = cdmDatabaseSchema, synthea_schema = syntheaDatabaseSchema, : Parameter 'vocab_schema' not found in SQL

I see the provider_id should be fixed in the latest pull request. Any recommendations on how to handle the 'vocab_schema' and 'synthea_schema' that cannot be found?

Thank you

[Error] CONTEXT: COPY xxxxx

Whats my problem?
How to solve The errors below?

CONTEXT: COPY encounters, line 2: "99edd026-a004-4b3f-9e91-23a824a57e52,2012-06-21T16:43:42Z,2012-06-21T17:13:42Z,cd6499a0-18fe-4c2c-a8..."
COPY 113
ERROR: extra data after last expected column
CONTEXT: COPY medications, line 2: "2010-01-04,2010-12-30,244f6a12-dfe3-4fc4-b77e-ddf5275f21a1,7c4411ce-02f1-39b5-b9ec-dfbea9ad3c1a,1e23..."
COPY 1065
ERROR: extra data after last expected column
CONTEXT: COPY patients, line 2: "cd6499a0-18fe-4c2c-a88c-3bd630f8e20a,1984-04-13,,999-49-3625,S99964365,X30889808X,Mr.,Leslie90,Rice9..."

Truncate tables doesn't run on postgres

It seems like postgresql does not like the "if object_id('table', 'U')" checks in the truncate table statements.

Incorrect source vocabulary in drug_exposure code

https://github.com/OHDSI/ETL-Synthea/blob/master/inst/sql/sql_server/insert_drug_exposure.sql#L63

This line is set to source_vocabulary_id = 'SNOMED' when it should be 'RxNorm'

extra data after last expected column

when I run ./bulk-load, it report some errros like this:

loading data from synthea csv files to relational database
COPY 526
COPY 3521
COPY 8551
ERROR: extra data after last expected column
CONTEXT: COPY encounters, line 2: "465dbc2e-ee30-418b-a3f8-f08f48479804,1977-11-19T06:59:41Z,1977-11-19T07:59:41Z,8ea2abeb-ef06-4691-98..."
COPY 14420
ERROR: extra data after last expected column
CONTEXT: COPY medications, line 2: "1993-02-11,,8ea2abeb-ef06-4691-98f8-fff89980afd7,b1c428d6-4f07-31e0-90f0-68ffa6ff8c76,c39b240b-763b-..."
COPY 282826
ERROR: extra data after last expected column
CONTEXT: COPY patients, line 2: "8ea2abeb-ef06-4691-98f8-fff89980afd7,1974-11-19,,999-91-8917,S99914151,X46842681X,Mr.,Marion502,Thie..."
COPY 36850

I check the schema's tables, find that the table encounters columns BASE_ENCOUNTER_COST,TOTAL_CLAIM_COST,PAYER_COVERAGE have add as cost
,

how can I solve this

observation insert failing on provider_id type

When running the LoadEventTables function, I run into the error below. I think the bolded provider_id should be cast(provider_id as integer)

DBMS:
postgresql

Error:
org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text
  Hint: You will need to rewrite or cast the expression.
  Position: 666

SQL:
insert into synthea_cdm531.observation
(
observation_id,
person_id,
observation_concept_id,
observation_date,
observation_datetime,
observation_type_concept_id,
value_as_number,
value_as_string,
value_as_concept_id,
qualifier_concept_id,
unit_concept_id,
provider_id,
visit_occurrence_id,
visit_detail_id,
observation_source_value,
observation_source_concept_id,
unit_source_value,
qualifier_source_value
)
select row_number()over(order by person_id) observation_id,
person_id,
observation_concept_id,
observation_date,
observation_datetime,
observation_type_concept_id,
value_as_number,
value_as_string,
value_as_concept_id,
qualifier_concept_id,
unit_concept_id,
**provider_id,**
visit_occurrence_id,
visit_detail_id,
observation_source_value,
observation_source_concept_id,
unit_source_value,
qualifier_source_value
from (
select
  p.person_id,
  case when srctostdvm.target_concept_id is NULL then 0 else srctostdvm.target_concept_id end as observation_concept_id,
  a.start observation_date,
  a.start observation_datetime,
  38000280 observation_type_concept_id,
  cast(null as NUMERIC) value_as_number,
  cast(null as varchar) value_as_string,
  0 value_as_concept_id,
  0 qualifier_concept_id,
  0 unit_concept_id,
  NULL provider_id,
  fv.visit_occurrence_id_new visit_occurrence_id,
  0 visit_detail_id,
  a.code observation_source_value,
  coalesce(srctosrcvm.source_concept_id,0) observation_source_concept_id,
  cast(null as varchar) unit_source_value,
  cast(null as varchar) qualifier_source_value
from synthea_native.allergies a
  left join synthea_cdm531.source_to_standard_vocab_map   srctostdvm
on srctostdvm.source_code             = a.code
 and srctostdvm.target_domain_id        = 'Observation'
 and srctostdvm.target_vocabulary_id    = 'SNOMED'
 and srctostdvm.source_vocabulary_id    = 'SNOMED'
 and srctostdvm.target_standard_concept = 'S'
 and srctostdvm.target_invalid_reason IS NULL
left join synthea_cdm531.source_to_source_vocab_map srctosrcvm
  on srctosrcvm.source_code             = a.code
 and srctosrcvm.source_vocabulary_id    = 'SNOMED'
left join synthea_cdm531.final_visit_ids fv
  on fv.encounter_id = a.encounter
join synthea_cdm531.person p
  on p.person_source_value    = a.patient
union all
select
  p.person_id,
  case when srctostdvm.target_concept_id is NULL then 0 else srctostdvm.target_concept_id end as target_concept_id,
  c.start,
  c.start,
  38000276,
  cast(null as NUMERIC),
  cast(null as varchar),
  0,
  0,
  0,
  NULL,
  fv.visit_occurrence_id_new visit_occurrence_id,
  0,
  c.code,
  coalesce(srctosrcvm.source_concept_id,0),
  cast(null as varchar),
  cast(null as varchar)
from synthea_native.conditions c
  left join synthea_cdm531.source_to_standard_vocab_map   srctostdvm
on srctostdvm.source_code             = c.code
 and srctostdvm.target_domain_id        = 'Observation'
 and srctostdvm.target_vocabulary_id    = 'SNOMED'
 and srctostdvm.source_vocabulary_id    = 'SNOMED'
 and srctostdvm.target_standard_concept = 'S'
 and srctostdvm.target_invalid_reason IS NULL
left join synthea_cdm531.source_to_source_vocab_map srctosrcvm
  on srctosrcvm.source_code              = c.code
 and srctosrcvm.source_vocabulary_id     = 'SNOMED'
left join synthea_cdm531.final_visit_ids fv
  on fv.encounter_id                     = c.encounter
join synthea_cdm531.person p
  on p.person_source_value    = c.patient
  ) tmp

several tables have 0 rows when ETL complete

Primary key duplicates

In the drug_exposure, measurement, and observation insert sql scripts create duplicate values in the _id fields (e.g. drug_exposure_id) that are supposed to be primary keys.

Observation_Period logic takes the earliest start and latest start

Expected Behavior: The OBSERVATION_PERIOD for a patient will take the earliest encounters.start and the latest encounters.stop

Observed Behavior: The OBSERVATION_PERIOD_END_DATE appears to be the latest encounters.start rather than encounters.stop

Load Event Tables

When I execute Load Event Table command: ETLSyntheaBuilder::LoadEventTables(cd,"cdm_synthea10","native"),
the error below is output.

I think the attribute of provider_id is wrong.
Could you check it?

Thank you for your always support.

Error:
org.postgresql.util.PSQLException: ERROR: column "provider_id" is of type integer but expression is of type text
ヒント: You will need to rewrite or cast the expression.
位置: 665

'ヒント' means tips
'位置' means line number

Some concept ID columns are all 0

I've run the pipeline using the fork provided by psbrandt at https://github.com/psbrandt/ETL-Synthea. I've used this fork because of the issue described in #47.

I spun up a Postgresql instance and followed the steps as described in the README using RStudio. I obtained all non-license required Athena vocabularies from http://athena.ohdsi.org/vocabulary/list and created native and cdm_synthea10 schemas.

The pipeline ran to completion but the following columns of interest had all of their rows as 0 instead of a relevant OMOP concept id:

in table condition_occurrence: condition_concept_id
in table drug_exposure: drug_concept_id
in table measurement: measurement_concept_id

Is it possible that the code designed to work with an older version of the Athena vocabularies than what I downloaded two days ago? If not, what would be a useful next step towards debugging?

ETL scripts need both schemas on search path?

I can't get the bulk load script to complete without either

replacing the search_path declaration of $CDM with $CDM,$NATIVE in all of the steps after "creating visit logic tables"
OR using the same schema for both the native Synthea tables, as well as the OMOP-shaped tables, in the bulk loader invocation:

./bulk-load synthea10 127.0.0.1 cdm_synthea10 cdm_synthea10 USERNAME PASSWORD /synthea_data_path /omop_voacb_path

Missing primary key constraint on concept_synonym table

cdm_v5.3_indexes_ddl.sql

Where can I obtain CDM V5.3 Vocabulary CSV files?

Where are these CSV files located? Athena only has v4 and v5.1.0.