ohdsi / eunomia Goto Github PK

View Code? Open in Web Editor NEW

41.0 16.0 11.0 46.77 MB

An R package that facilitates access to a variety of OMOP CDM sample data sets.

Home Page: https://ohdsi.github.io/Eunomia/

R 97.14% Perl 1.21% Shell 1.65%

hades

eunomia's Introduction

Eunomia

Eunomia is part of HADES.

Introduction

Eunomia is a standard dataset manager for sample OMOP (Observational Medical Outcomes Partnership) Common Data Model (CDM) datasets. Eunomia facilitates access to sample datasets from the EunomiaDatasets repository. Eunomia is used for testing and demonstration purposes, including many of the exercises in the Book of OHDSI. For functions that require schema name, use 'main'.

Features

Download selected sample datasets from EunomiaDatasets repository, which includes a subset of the Standardized Vocabularies.
Interfaces with the DatabaseConnector and SqlRender packages.
No need to set up a database server. Eunomia runs in your R instance (currently using SQLite).
(planned) supports for other databases

Example

library(Eunomia)
connectionDetails <- getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
querySql(connection, "SELECT COUNT(*) FROM person;")
#  COUNT(*)
#1     2694

getTableNames(connection,databaseSchema = 'main')
disconnect(connection)

Technology

Eunomia is an R package providing access to sample datasets at EunomiaDatasets repository.

System Requirements

Requires R. Some of the packages required by Eunomia require Java.

Installation

See the instructions here for configuring your R environment, including Java.
In R, use the following commands to download and install Eunomia:

install.packages("Eunomia")

User Documentation

Documentation can be found on the package website.

PDF versions of the documentation are also available:

Package manual: Eunomia.pdf

Support

Developer questions/comments/feedback: OHDSI Forum
We use the GitHub issue tracker for all bugs/issues/enhancements

License

Eunomia is licensed under Apache License 2.0

Development

Eunomia is being developed in R Studio.

Development status

Ready for use

eunomia's People

Contributors

Stargazers

Watchers

Forkers

odysseusinc vojtechhuser ablack3 gowthamrao priagopal odyosg tsemharb na399 anthonysena ulc0

eunomia's Issues

CRAN and support for multiple CDMs

I think one of the challenges with keeping Eunomia on CRAN is the size of the package. The mini CDM embedded in Eunomia is 5.1 MB. Additionally it would be very helpful to allow for the possibility of multiple Eunomia datasets. For example the current CDM is not very helpful for testing code that requires cancer related concepts. In have created cancer specific mini-CDM to test code that requires cancer cohorts to be non-empty.

What if we thought of Eunomia as a "data access package" and rather than embed the CDM inside the package we provide functionality to easily download and connect to numerous mini-CDMs.

The size of the package submitted to CRAN would be much smaller and we could support multiple mini-CDMs for various domains and use cases (like oncology).

The CDMs would be stored on github in this repo but not submitted to CRAN. This setup would be analogous to https://github.com/mlverse/torchdatasets. Eunomia::getEunomiaConnectionDetails("oncology") would ask the user if they want to download the oncology eunomia dataset from github.

The goal would be to 1) keep Eunomia on CRAN and 2) allow for more than one mini-cdm for testing.

Use RSQLite extended types to support DATE and DATETIME natively

RSQlite has added a new extend_types argument to the dbConnect function that adds support for DATE and DATETIME fields. I propose to switch Eunomia to use that. I see two challenges:

This will require all the work-around code to be removed from DatabaseConnector and SqlRender that was emulating DATE and DATETIME fields. These changes will need to be made backwards compatible somehow. I guess we should define a new dialect in SqlRender called 'sqlite_extended_types', and switch to that if the RSQLite connection has extended_types == TRUE.
We'd need to test whether all examples in the Book of OHDSI still work with the updated version.

Also looping in @ablack3 who has been playing with the extended_types for Andromeda.

CSVs with data

Would be useful to provide CSVs with the data which could be uploaded into any other DBMS

Error message when installing in R Studio

Hi I get this message from R Studio when trying to install.

Error: Failed to install 'Eunomia' from GitHub:
(converted from warning) unable to access index for repository https://OHDSI.github.io/drat/bin/windows/contrib/3.6:
cannot open URL 'https://OHDSI.github.io/drat/bin/windows/contrib/3.6/PACKAGES'

It gives a 404 when I put into a browser. It looks like the file is not there.

while installing dependencies i gor this error msg in R studio

ERROR: dependency 'Eunomia' is not available for package 'Hades
Warning messages:
1: package ‘Eunomia’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,

then i tried
install.packages("drat")
drat::addRepo("OHDSI")
install.packages("Eunomia")

then got error msg

Warning in install.packages :
unable to access index for repository https://OHDSI.github.io/drat/bin/windows/contrib/4.1:
cannot open URL 'https://OHDSI.github.io/drat/bin/windows/contrib/4.1/PACKAGES'

how can i resolve this issue as i am stuck here

issue with dates in cohort

Hi,

I was testing some PatientLevelPrediction code and I noticed an error when I tried to select from a cohort table where the cohort_start_date is between two user input dates.

The code: "select * from main.cohort where cohort_start_date <= cast('20050101' as date) and cohort_start_date >= cast('19500101' as date) ;" returns an empty data.frame but there are dates between 1950 and 2005.

I checked the SqlRender and the cast('', as date) is kept when converting to sqlite, but I suspect this should be something else, so maybe that is the issue?

results schema, write access?

The examples provide how to run sql code.
I was trying to run Achilles.
I assume there is no write access and that is why it fails.

Assuming we need results schema and test HivDescriptive study on Eunomia, we would write it into a real db, right?

When trying that, it works on some tables but fails in terms of timestamp on others..


> tbl='drug_era'
> library(glue)
> sql=glue::glue("SELECT * FROM {tbl};") %>% as.character()
> sql
[1] "SELECT * FROM drug_era;"
> t<-querySql(connection,sql )
> DatabaseConnector::dbWriteTable(con2,tbl,t)



> tbl='visit_occurrence'
> library(glue)
> sql=glue::glue("SELECT * FROM {tbl};") %>% as.character()
> sql
[1] "SELECT * FROM visit_occurrence;"
> t<-querySql(connection,sql )
> DatabaseConnector::dbWriteTable(con2,tbl,t)
Error in rJava::.jcall(connection@jConnection, "V", "setAutoCommit", TRUE) : 
  java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
In addition: Warning messages:
1: In max(nchar(as.character(obj)), na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf
2: In max(nchar(as.character(obj)), na.rm = TRUE) :
  no non-missing arguments to max; returning -Inf

Missing gender concept ID for Male

I executed the following query on concept table:

select * from cdm.concept where concept_id = 8507

Returned zero rows

How are the numbers in date columns should be interpreted?

I see the REALs in the date columns and wondering how those could be converted into dates? I know that Sqlite doesn't have DATE type, but I mean e.g. you can store a number of seconds from EPOCH (which is pretty common practice). However, in this case, I see negative values in the columns

Eunomia for Lung Cancer Data

Hi Everyone,

I generated lung cancer Synthea data for 1K patients. I was wondering, whether I could create ETL for cancer extension. I was thinking of doing the following. Please let me know, if my understanding is right?

download latest athena
create postgres schema for source and cdm
upload dictionary data into cdm
Create Episode ETL, update measurement,condition and drug_exposure ETLs (this is going to be really challenging. Please provide some tips on what is the best way forward).
load data into cdm schema

Allow users to override the EunomiaDatasets URL with an optional environment variable

In the soon-to-be-released version of Eunomia users will be able to choose from multiple Eunomia datasets. These datasets are hosted in a separate github repo called EunomiaDatasets. It would be helpful if users could override this location and point to a fork of EunomiaDatasets for testing.

Currently the location is hard coded here:

Eunomia/R/EunomiaData.R

Line 59 in 06f0d3e

    
           baseUrl <- "https://raw.githubusercontent.com/OHDSI/EunomiaDatasets/main/datasets"

Concrete use case:
I created a PR on EunomiaDatasets to fix an issue but I have no way to test that that my change works until after it is accepted. Ideally I'd like to be able to point to Eunomia to my personal fork and test that my change fixes the issue. I would propose using an environment variable for the location with the current location being used if the environment variable is unset.

What do you think @fdefalco?

GI Cohort (cohortid = 3) has null cohort_end_date

This sql shows that cohortEndDate has NA (null) as cohort_end_date:

sql <- "
SELECT TOP 5 *
FROM main.cohort
WHERE cohort_definition_id = 3;
"
renderTranslateQuerySql(connection, sql, snakeCaseToCamelCase = TRUE)
# cohortDefinitionId subjectId cohortStartDate cohortEndDate
# 1                  3       273      2011-10-10          <NA>
#   2                  3        61      2005-09-15          <NA>
#   3                  3       351      2018-06-28          <NA>
#   4                  3       579      1999-11-06          <NA>
#   5                  3       549      1987-12-28          <NA>

Here’s the cohort SQL: https://github.com/OHDSI/Eunomia/blob/main/inst/sql/sql_server/GiBleed.sql

The bug is that cohorts should all have start and end dates, so this sql should be modified to default to some kind of end date, either condition_date +1d or make it end at the end of the containing observation period.

Add concept_id belonging to the gender domain

Eunomia concept table does not have concept_id used in gender_concept_id field in person table

This make it not useful when testing queries that involves joins to gender_concept_id field on person table e.g the incidence rate query that stratifies by gender in cohort diagnostics.

Please add concept id belonging to gender domain to eunomia concept table

Option to use duckdb or SQLite

This is just and idea I'd like some feedback on to see if it is worth implementing.

The OMOP CDM has a lot of date columns. Sqlite does not have a date type and thus also does not support date manipulation functions used in OHDSI-SQL. SqlRender manages to work around this limitation to a large extent but there tends to be edge cases where the limitations of the lack of a date type cause frustration for users.

It is not very difficult to find examples of valid date manipulation OHDSI-SQL that produces incorrect results on Sqlite.

library(Eunomia)
#> Loading required package: DatabaseConnector

cd <- getEunomiaConnectionDetails()
con <- connect(cd)
#> Connecting using SQLite driver

# valid OHDSI-SQL that produces an incorrect result
renderTranslateQuerySql(con, "select eomonth(condition_start_date) as eomonth from main.condition_occurrence limit 10")
#>       EOMONTH
#> 1  1446249600
#> 2  1320019200
#> 3   446860800
#> 4  1133308800
#> 5   144460800
#> 6   675648000
#> 7   307497600
#> 8   933379200
#> 9   654652800
#> 10  509932800

# I'm assuming the expected result of eomonth is a date not an integer
# The return type of an OHDSI SQL function should be the same across database platforms right?
disconnect(con)

^{Created on 2022-08-18 with reprex v2.0.2}

duckdb is a file based database like SQLite that has a date type and supporting date manipulation functions. Is there interested in using duckdb with Eunomia in place of Sqlite?

Tagging @edward-burn since I think he has also experienced issues with dates in Eunomia.

duckdb was recently added as a supported SQL dialect in the development branch of SqlRender.

Exported csvs have corrupted concept IDs

Example in CONCEPT_ANCESTOR.csv:

21604147,40173590,4,5
3.6e+07,438614,4,4
36156195,1119510,0,0

See attached file.
CONCEPT_ANCESTOR.csv

Example in readme relies on DatabaseConnector but doesn't load it (no longer in 'imports')

Super minor - but for any newbies like my trying to work my way through a new package: The following example in the readme won't run as-is because connect() is a DatabaseConnector function (as far as I can tell) but the example doesn't load it since it is no longer a dependency (#49).

library(Eunomia)
connectionDetails <- getEunomiaConnectionDetails()
connection <- connect(connectionDetails)
> Error in connect(connectionDetails) : could not find function "connect"

EunomiaDatasets format

I've built some additional Eunomia datasets available here in duckdb v0.8 format.

I was wondering if we can use parquet files instead of csv since parquet is smaller than csv.

I was also wondering if I can include a full vocabulary as the full vocab tables are very helpful for examples.

All of these cdms use the same vocab so it would be more efficient to store the vocab tables just once and allow eunomia to use two seperate links, one for vocab tables and one for the other cdm tables.

What do you all think?

@fdefalco, @schuemie

SQLite dump contains non-CDM coxibVsNonselVsGiBleed in main schema

Misspelled column name in cost table of eunomia gibleed dataset

This is using the current release of Eunomia (GiBleed dataset).

REVEUE_CODE_SOURCE_VALUE should be REVENUE_CODE_SOURCE_VALUE

library(DatabaseConnector)

con <- connect(Eunomia::getEunomiaConnectionDetails())
#> Connecting using SQLite driver

querySql(con, "select REVEUE_CODE_SOURCE_VALUE from cost limit 3")
#> [1] REVEUE_CODE_SOURCE_VALUE
#> <0 rows> (or 0-length row.names)

disconnect(con)

^{Created on 2023-07-11 with reprex v2.0.2}

testing achilles run against eunomia via duckdb

I'm trying to test Achilles against Eunomia data sources using duckDb and getting the following error.

dataset <-  list(cdmVersion = "5.3", cdmSourceName = "MIMIC", cdmSourceSchema = "main"),

cdmVersion <- dataset$cdmVersion
cdmDatabaseSchema <- dataset$cdmSourceSchema
resultsDatabaseSchema <- dataset$cdmSourceSchema
vocabDatabaseSchema <- resultsDatabaseSchema
cdmSourceName <- dataset$cdmSourceName

  databaseFile <- Eunomia::getDatabaseFile(
    cdmSourceName, 
    cdmVersion=cdmVersion,pathToData="D:/OHDSI/EunomiaDatasets",
    dbms = "duckdb"
  )
  
  connectionDetails <- DatabaseConnector::createConnectionDetails(
    dbms = "duckdb", 
    server = databaseFile
  )
  
  ParallelLogger::clearLoggers()
  
  Achilles::achilles(
    cdmVersion = cdmVersion,
    connectionDetails = connectionDetails,
    cdmDatabaseSchema = cdmDatabaseSchema,
    resultsDatabaseSchema = cdmDatabaseSchema,
    createTable = TRUE,
    smallCellCount = 0
  )

Connecting using DuckDB driver
|=============================================================================================================| 100%
Executing SQL took 0.0178 secs
An error report has been created at output/errorReportR.txtError in dbAppendTable(conn, name, value) :
Column ANALYSIS_ID does not exist in target table.

Any idea @ablack3 ?

Remove Eunomia's dependency on the Java programming language

getEunomiaConnectionDetails is a Eunomia function that returns a connectionDetails object. I'm thinking we could transition to this API instead.

library(DatabaseConnector)

connectionDetails <- createConnectionDetails("sqlite", server = Eunomia::eunomiaDir("GiBleed"))

connection <- connect(connectionDetails)
#> Connecting using SQLite driver

querySql(connection, "select * from main.person limit 2")
#>   PERSON_ID GENDER_CONCEPT_ID YEAR_OF_BIRTH MONTH_OF_BIRTH DAY_OF_BIRTH
#> 1         6              8532          1963             12           31
#> 2       123              8507          1950              4           12
#>   BIRTH_DATETIME RACE_CONCEPT_ID ETHNICITY_CONCEPT_ID LOCATION_ID PROVIDER_ID
#> 1     1963-12-31            8516                    0          NA          NA
#> 2     1950-04-12            8527                    0          NA          NA
#>   CARE_SITE_ID                  PERSON_SOURCE_VALUE GENDER_SOURCE_VALUE
#> 1           NA 001f4a87-70d0-435c-a4b9-1425f6928d33                   F
#> 2           NA 052d9254-80e8-428f-b8b6-69518b0ef3f3                   M
#>   GENDER_SOURCE_CONCEPT_ID RACE_SOURCE_VALUE RACE_SOURCE_CONCEPT_ID
#> 1                        0             black                      0
#> 2                        0             white                      0
#>   ETHNICITY_SOURCE_VALUE ETHNICITY_SOURCE_CONCEPT_ID
#> 1            west_indian                           0
#> 2                italian                           0

disconnect(connection)

^{Created on 2023-04-07 with reprex v2.0.2}

what is the benefit of this API over the current one?

Well it would mean that Eunomia would no longer need to import DatabaseConnector and would no longer require Java. DatabaseConnector would still require Java of course but Eunomia would not. It would make Eunomia available to be used in environments where Java is not installed or not working properly.

We could also still support getEunomiaConnectionDetails and remove the dependency on DatabaseConnector/Java by manually assigning the "ConnectionDetails" class.

Does the order of columns matter in the CDM specification?

Should the column order of a CDM match the order in the specification or can the columns be in any order and still be considered a valid CDM?

In Eunomia GiBleed I see that the column order in the condition_occurrence table does not match the column order in the 5.3 specification. Is this an error or can columns be in any order in a CDM?

https://ohdsi.github.io/CommonDataModel/cdm53.html#CONDITION_OCCURRENCE

library(DatabaseConnector)
con <- connect(Eunomia::getEunomiaConnectionDetails())
#> Connecting using SQLite driver
querySql(con, "select * from main.condition_occurrence limit 0")
#>  [1] CONDITION_OCCURRENCE_ID       PERSON_ID                    
#>  [3] CONDITION_CONCEPT_ID          CONDITION_START_DATE         
#>  [5] CONDITION_START_DATETIME      CONDITION_END_DATE           
#>  [7] CONDITION_END_DATETIME        CONDITION_TYPE_CONCEPT_ID    
#>  [9] STOP_REASON                   PROVIDER_ID                  
#> [11] VISIT_OCCURRENCE_ID           VISIT_DETAIL_ID              
#> [13] CONDITION_SOURCE_VALUE        CONDITION_SOURCE_CONCEPT_ID  
#> [15] CONDITION_STATUS_SOURCE_VALUE CONDITION_STATUS_CONCEPT_ID  
#> <0 rows> (or 0-length row.names)
disconnect(con)
packageVersion("Eunomia")
#> [1] '1.0.2'

^{Created on 2023-07-19 with reprex v2.0.2}

CONDITION_STATUS_CONCEPT_ID comes at the end but in the specification it comes before STOP_REASON.

EunomiaDatasets format

I've built some additional Eunomia datasets available here in duckdb v0.8 format for now.

I was wondering if we can use parquet files instead of csv since parquet is much smaller than csv.

I was also wondering if I can include a full vocabulary as the full vocab tables are very helpful for examples.

What do you all think?

@fdefalco, @schuemie

Provide SQLite db

Hi,

Could you share the SQLite databases somewhere ? I would like to explore CDM from sqlitebrowser.

Add function to list available Eunomia datasets

Maybe something like listAvailableDatasets.

Perhaps this function should dynamically query github to see what is available rather hardcoding dataset names. I'm not sure what the best approach is though.

Provide an option for caching the database?

I use Eunomia a lot while developing, and with the new version (that downloads the data from the world wide web) it has to download the same database every time (which can be dozens of times in a single development session). Might we support some caching mechanism?

For example, if we detect the user specified a EUNOMIA_CACHE_FOLDER environmental variable, the first time we can download the database to that folder, then create a copy in temp and hand that to the user. The next time, we can check if the database already exists in the cache folder, and don't have to download. We could use the file name in the cache folder to distinguish between different versions or types of data. The function signature could become:

getEunomiaConnectionDetails <- function(databaseFile = tempfile(fileext = ".sqlite"), 
                                        dbms = "sqlite",
                                        cacheFolder = Sys.getenv("EUNOMIA_CACHE_FOLDER")) {

A new GiBleed dataset is always downloaded breaking unittests on R 4.4.0

A new GiBleed dataset is always downloaded when getting the Eunomia ConnectionDetails with Eunomia::getEunomiaConnectionDetails(), even though both the zip,- and sqlite-file exist already.

When running the the function it will also throw an additional error, when ran in an enclose environment (unittest, GitHub Actions, reprex).

#> Error: table person already exists

as it is trying to overwrite the existing CDM.

file.exists(file.path(Sys.getenv("EUNOMIA_DATA_FOLDER"), "GiBleed_5.3.sqlite"))
#> [1] TRUE

connectionDetails <- Eunomia::getEunomiaConnectionDetails()
#> attempting to download GiBleed
#> attempting to extract and load: D:/Users/mvankessel/Documents/EunomiaCache/GiBleed_5.3.zip to: D:/Users/mvankessel/Documents/EunomiaCache/GiBleed_5.3.sqlite
#> Error: table person already exists

^{Created on 2024-05-22 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24 ucrt)
#>  os       Windows 11 x64 (build 22631)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.utf8
#>  ctype    Dutch_Netherlands.utf8
#>  tz       Europe/Amsterdam
#>  date     2024-05-22
#>  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package           * version date (UTC) lib source
#>    backports           1.4.1   2021-12-13 [1] CRAN (R 4.4.0)
#>    bit                 4.0.5   2022-11-15 [1] CRAN (R 4.4.0)
#>    bit64               4.0.5   2020-08-30 [1] CRAN (R 4.4.0)
#>    blob                1.2.4   2023-03-17 [1] CRAN (R 4.4.0)
#>    cachem              1.1.0   2024-05-16 [1] CRAN (R 4.4.0)
#>    checkmate           2.3.1   2023-12-04 [1] CRAN (R 4.4.0)
#>    cli                 3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
#>    CommonDataModel     0.2.0   2024-02-07 [1] CRAN (R 4.4.0)
#>    DatabaseConnector   6.3.2   2023-12-11 [1] CRAN (R 4.4.0)
#>    DBI                 1.2.2   2024-02-16 [1] CRAN (R 4.4.0)
#>    digest              0.6.35  2024-03-11 [1] CRAN (R 4.4.0)
#>    Eunomia             2.0.0   2024-04-23 [1] CRAN (R 4.4.0)
#>    evaluate            0.23    2023-11-01 [1] CRAN (R 4.4.0)
#>    fansi               1.0.6   2023-12-08 [1] CRAN (R 4.4.0)
#>    fastmap             1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>    fs                  1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>    glue                1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
#>    hms                 1.1.3   2023-03-21 [1] CRAN (R 4.4.0)
#>    htmltools           0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>    knitr               1.46    2024-04-06 [1] CRAN (R 4.4.0)
#>    lifecycle           1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>    magrittr            2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>    memoise             2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
#>    pillar              1.9.0   2023-03-22 [1] CRAN (R 4.4.0)
#>    pkgconfig           2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>    R6                  2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
#>    readr               2.1.5   2024-01-10 [1] CRAN (R 4.4.0)
#>    reprex              2.1.0   2024-01-11 [1] CRAN (R 4.4.0)
#>  D rJava               1.0-11  2024-01-26 [1] CRAN (R 4.4.0)
#>    rlang               1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
#>    rmarkdown           2.27    2024-05-17 [1] CRAN (R 4.4.0)
#>    RSQLite             2.3.6   2024-03-31 [1] CRAN (R 4.4.0)
#>    rstudioapi          0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
#>    sessioninfo         1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>    SqlRender           1.17.0  2024-03-20 [1] CRAN (R 4.4.0)
#>    tibble              3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>    tzdb                0.4.0   2023-05-12 [1] CRAN (R 4.4.0)
#>    utf8                1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>    vctrs               0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>    withr               3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
#>    xfun                0.44    2024-05-15 [1] CRAN (R 4.4.0)
#>    yaml                2.3.8   2023-12-11 [1] CRAN (R 4.4.0)
#> 
#>  [1] C:/R/R-4.4.0/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The following code should resolve this:

getEunomiaConnectionDetails <- function(databaseFile = tempfile(fileext = ".sqlite"), dbms = "sqlite") {
  if (interactive() & !("DatabaseConnector" %in% rownames(utils::installed.packages()))) {
    message("The DatabaseConnector package is required but not installed.")
    if (!isTRUE(utils::askYesNo("Would you like to install DatabaseConnector?"))) {
      return(invisible(NULL))
    } else {
      utils::install.packages("DatabaseConnector")
    }
  }
  
  if (!file.exists(file.path(Sys.getenv("EUNOMIA_DATA_FOLDER"), "GiBleed_5.3.sqlite"))) {
    datasetLocation <- getDatabaseFile(datasetName = "GiBleed", dbms = dbms, databaseFile = databaseFile)
  }
  DatabaseConnector::createConnectionDetails(dbms = dbms, server = datasetLocation)
}

file.exists(file.path(Sys.getenv("EUNOMIA_DATA_FOLDER"), "GiBleed_5.3.sqlite"))
#> [1] TRUE

connectionDetails <- getEunomiaConnectionDetails()

^{Created on 2024-05-22 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.0 (2024-04-24 ucrt)
#>  os       Windows 11 x64 (build 22631)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.utf8
#>  ctype    Dutch_Netherlands.utf8
#>  tz       Europe/Amsterdam
#>  date     2024-05-22
#>  pandoc   3.1.11 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package           * version date (UTC) lib source
#>    bit                 4.0.5   2022-11-15 [1] CRAN (R 4.4.0)
#>    bit64               4.0.5   2020-08-30 [1] CRAN (R 4.4.0)
#>    cli                 3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
#>    DatabaseConnector   6.3.2   2023-12-11 [1] CRAN (R 4.4.0)
#>    DBI                 1.2.2   2024-02-16 [1] CRAN (R 4.4.0)
#>    digest              0.6.35  2024-03-11 [1] CRAN (R 4.4.0)
#>    evaluate            0.23    2023-11-01 [1] CRAN (R 4.4.0)
#>    fastmap             1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>    fs                  1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
#>    glue                1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
#>    htmltools           0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>    knitr               1.46    2024-04-06 [1] CRAN (R 4.4.0)
#>    lifecycle           1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>    reprex              2.1.0   2024-01-11 [1] CRAN (R 4.4.0)
#>  D rJava               1.0-11  2024-01-26 [1] CRAN (R 4.4.0)
#>    rlang               1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
#>    rmarkdown           2.27    2024-05-17 [1] CRAN (R 4.4.0)
#>    rstudioapi          0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
#>    sessioninfo         1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>    withr               3.0.0   2024-01-16 [1] CRAN (R 4.4.0)
#>    xfun                0.44    2024-05-15 [1] CRAN (R 4.4.0)
#>    yaml                2.3.8   2023-12-11 [1] CRAN (R 4.4.0)
#> 
#>  [1] C:/R/R-4.4.0/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Add description of how this data is simulated

Please add some comments about how this data is simulated - how many patients, how the conditions/procedures/observations are selected/generated etc. This information is important when considering for which tests one can use this package.

Error in extractLoadData

This happens when I have cleared my Eunomia data directory and I am using the develop branch

> library(Eunomia)
> connectionDetails <- getEunomiaConnectionDetails()
attempting to download GiBleed
trying URL 'https://raw.githubusercontent.com/OHDSI/EunomiaDatasets/main/datasets/GiBleed/GiBleed_5.3.zip'
Content type 'application/zip' length 6863696 bytes (6.5 MB)
==================================================
downloaded 6.5 MB

attempting to extract and load /Users/ginberg/Data/eunomia/GiBleed_5.3.zip
Unzipping /Users/ginberg/Data/eunomia/GiBleed_5.3.zip
Error in extractLoadData(from = archiveLocation, to = datasetLocation, :
Data file does not contain .CSV files to load into the database.

Apparently when unzipping, it adds the archive folder to the tempFileLocation, so the csv files are in a subdirectory of the
tempFileLocation and the list.files can't find them.
This is where the unzipping happens
https://github.com/OHDSI/Eunomia/blob/develop/R/EunomiaData.R#L102
when adding junkpaths = TRUE, it does work for me.
Have you seen this error before @fdefalco?

My sessionInfo

R version 4.2.1 (2022-06-23)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.3.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Eunomia_2.0.0           DatabaseConnector_6.2.1

loaded via a namespace (and not attached):
 [1] fansi_1.0.4     tzdb_0.3.0      utf8_1.2.3      duckdb_0.6.1    R6_2.5.1        lifecycle_1.0.3 DBI_1.1.3       magrittr_2.0.3  RSQLite_2.3.1   pillar_1.9.0   
[11] rlang_1.1.0     cachem_1.0.7    cli_3.6.1       rstudioapi_0.14 blob_1.2.4      vctrs_0.6.1     tools_4.2.1     bit64_4.0.5     readr_2.1.4     glue_1.6.2     
[21] bit_4.0.5       hms_1.1.3       fastmap_1.1.1   compiler_4.2.1  pkgconfig_2.0.3 rJava_1.0-6     memoise_2.0.1   tibble_3.2.1

Eunomia 2.0 Unit Test Notes

Just making an issue here to note any changes made while attempting to update unit tests to use Eunomia 2.0 (currently on the develop branch).

Unit tests will need to set up the EUNOMIA_DATA_FOLDER environment variable so GitHub Actions (GHA) has a place to store the data. Here is some boiler plate code that could go into the tests/testthat/setup.R file for reference:

dataFolder <- tempfile()
dir.create(dataFolder)
oldEunomiaFolder <- Sys.getenv("EUNOMIA_DATA_FOLDER")
Sys.setenv("EUNOMIA_DATA_FOLDER" = dataFolder)
withr::defer(
  {
    unlink(dataFolder)
    Sys.setenv("EUNOMIA_DATA_FOLDER" = oldEunomiaFolder)
  },
  testthat::teardown_env()
)

Vignettes that use Eunomia will also need to set the EUNOMIA_DATA_FOLDER environment variable in order for the R CMD check to work properly on GHA.
The signature of the function Eunomia::getConnectionDetails() has changed. v1.x supported the following signature:

getEunomiaConnectionDetails <- function(databaseFile = tempfile(fileext = ".sqlite"))

v2.x removes the databaseFile parameter when calling Eunomia::getConnectionDetails(). The control over the databaseFile parameter is governed by the EUNOMIA_DATA_FOLDER environment variable. Alternatively, you can make use of the Eunomia::getConnectionDetails function for more fine-grained control over the dataset used, etc.

Inconsistencies between `visit_occurrence_id` within records and `visit_occurence_id` in table `VISIT_OCCURRENCE`

Hello,

First of all, thank you for creating this dataset 👍🏻:

We (@mrueda @palilsonjmr) have been using your synthetic dataset to test an OMOP-CDM instance and have encountered an issue with the consistency of the visit_occurrence_id associated with each person_id.

Specifically, we have noticed a discrepancy between the visit_occurrence_id for each person_id in the VISIT_OCCURRENCE table and the visit_occurrence_id for the same person_id appearing in other tables (CONDITION_OCCURRENCE, DRUG_EXPOSURE, MEASUREMENT, OBSERVATION, and PROCEDURE_OCCURRENCE).

For example, person_id = 1 has only 1 visit_occurrence_id (85) in the VISIT_OCCURRENCE table, but in the MEASUREMENT table, this person_id has multiple visit_occurrence_id (22, 41, 42, 43, 45, ...).

Are you guys aware of this issue?

Best,

Manu