ohdsi / gis Goto Github PK

View Code? Open in Web Editor NEW

8.0 14.0 8.0 112.45 MB

Home Page: https://ohdsi.github.io/GIS

License: Apache License 2.0

R 80.70% Dockerfile 0.46% Shell 0.45% HTML 10.66% CSS 0.09% Python 5.44% Batchfile 2.19%

gis's Introduction

OHDSI GIS Workgroup

Introduction

Please visit our webpage for more details and the most up-to-date information and software documentation.

For project management, see the GIS Project Page

Click here to propose a new Use Case.

Quick Start

Instructions to quickly install and start using Gaia are here

Support

Developer questions/comments/feedback: OHDSI Forum
Please use the GitHub issue tracker for all bugs/issues/enhancements

Contributing

We are eager to engage with interested developers with interest or experience in frontend, backend, and geospatial development.

If you are interested in contributing and don't know where to begin, please join the OHDSI GIS WG or email at zollovenecek[at]ohdsi[dot]org

gis's People

Contributors

Stargazers

Watchers

Forkers

cpmdev mnairn jake-gillberg xj2193 kzollove yeechingtiger tibbben dmlyons2

gis's Issues

Ensure that the NAMESPACE reflects all external packages being used

See R Packages for specifics

testing teams notification - ignore

testing - ignore

All roxygen documentation for importShapefile.R file

Part of #54

Create file dbUtils for all functions that interact with database

Create documentation on registering a data source

Data sources must be registered in the Postgres backbone.data_source table to be used within this framework. Registering can be a confusing, complex task depending on the complexity of the source dataset.

This documentation should probably be created as a vignette in the R package.

This documentation should give a thorough explanation of each column in the data source (geom_spec in particular) and walk a potential user through the process of creating a row in the table, updating the DDL for the backbone schema, and creating a PR for the new DDL.

Depends on

#276

Keep dependency in memory if loading into PostGIS

Currently, if a dependency for a variable (attr or geom) is not in the database, the importShapefile will go through a download-upload-import cycle.

The download is from the web-hosted data source into the R environment, where it is transformed into the standard format.

The standardized data is then uploaded to PostGIS.

Now that the dependency exists in PostGIS, the attr and geom are joined and together they are imported back into the R environment.

We should be able to skip the import step if we already have the dependency in memory. The snag here is attr and geom are joined in PG and then imported. There are three cases to address:

neither dependency existed in the DB and was just downloaded into the R environment
geom didn't exist, was just downloaded, and attr needs to be imported from DB
attr didn't exist, was just downloaded, and geom needs to be imported from DB

This is an important issue and could help ease the memory strains and help with #59

Implement a UUID system in the variable_source table

This can only happen after work on #69 is completed.

Each variable comes from a data_source. Multiple variables may come from the same data source

There is a good chance that we could (and should) implement our own variable_source UUID by taking the FK data_source_uuid and appending an autoincrementing number (<data_source_uuid>_01).

Create function checkTableExists

Determining provenance in locations

We need a way to determine provenance of locations, specifically if it came from the EHR or some external data set (e.g. EPA censor data). OHDSI typically used a '_type_concept_id' to specify this. location_type_concept_id ? Could these ever overlap?

Need implementation of extra metadata table for AQS monitors keyed to attribute table

Currently 1) EPA monitor metadata loads into a generic table created from a downloaded static csv file and 2) a foreign key column is added to the attribute table. But no connection is made between the downloaded metadata and the attribute table. This will likely take place in the load_EPA_AQS_static_attr() function of the EPA_loader.R script, but some refactoring may be necessary.

Interested in adding ACS (5 year) to our place-based data repository

We at Boston Medical Center/Boston University are interested in adding to our place-based data resource: 1) SVI data; and 2) ACS (5 year) data for New England.

Which data sets do we want to include in version 1.0?

Currently we have the American Community Survey and EPA Annual Monitor data.

Test issue

GIS for CARE_SITE, or person only

How often do we really have CARE_SITEs (=healthcare institutions) move around? If we decided it's only about patients history of locations we could omit the whole domain_id business and have a simple person_id.

Document steps for setting up R environment

Certain steps must be taken when setting up the R environment to download/upload/import sizeable geospatial data down from the original source, up to the postGIS database, and then back down to the users machine.

Some steps for setting up:

change the java heap size that is allowed
change the R system variable for memory allocation
increase timeout from 60 to ~600

Could there be a utility function user runs to make all these changes?

Add roxygen2 documentation for all functions

See HADES developer guidelines for documentation

Add unit tests to all functions

See HADES developer guidelines for unit tests

Create README.md

Reference HADES developer's guidelines for README.md

How to handle attribute dates

With the data source table, we have the information for when the data set was collected as a whole and what timeframe it represents. When we store a measurement from these data sets, what logic do we use to determine the date of the measurement itself.

For instance, in the EPA data set we have a 'mean value of measurement x for sensor y for the year 2017'. In this circumstance, the data_source contains all needed date information, do we want to replicate that in the attribute table?

An additional example, there is also columns for 'max value' and 'date of max value'. For that record, which date do we use?

Create function getGeomIndex

Data driven column names for sql queries

Currently all columns names for EPA site data an EPA attribute data from either downloaded static files or API calls are hard coded into the sql queries. Is it worth making this data driven?

load_EPA_AQS_api_attr() in EPA_loader.R uses loop in R script; better as all sql

geom epa_aqs_sites at POINT(0,0)

I see Point(0,0) for sensor 1-003-0001 in geom_epa_aqs_sites. Is this an error of the dataset or the load script?

#37

Which tables should have an FK to data_source?

Currently it is only LOCATION_ATTRIBUTE and AREA_ATTRIBUTE but there is a similar need in other structures. LOCATION, AREA and POLYGON_SOURCE all could benefit from that. Currently POLYGON_SOURCE has a string field 'data_source_name' which acts similarly. What changes would be needed to extend this functionality to these tables?

Which fields are needed in the AREA table?

Which data fields that are common to polygon attribute tables, should we persist in the AREA table? e.g. centroid, land area, water area, CBSA, fips code, etc

All roxygen documentation for helpers.R file

Part of #54

Refactor create_indices to automatically get all uuids

There is currently a function get_uuids() that is only used as an argument to create_indices() to provide a list of all uuids that are registered in the backbone.data_source table.

Since get_uuids serves only one purpose, and since getting all uuids is the "default" way create_indices is used, create_indices should be refactored to reflect this and remove an unnecessary function.

Move db instance out of postgres maintenance database

All roxygen documentation for loadGeomTable.R file

Part of #54

Ensure that entire codebase adheres to the OHDSI code style

See the HADES OHDSI code style page

Create function getVariableSourceRecord

Implement true UUIDs in the data_source table

We want each data source to have a universally unique id.

Integration with ATLAS/WebAPI

The current paradigm in OHDSI is to package up cohort definitions into JSON objects which can be translated into singular SQL statements that fully define the cohort. Some of these definitions require calculations (e.g. 'where value is between x and y') but all of which are completed entirely in SQL. In our circumstance, given the lack of compatibility for GIS functionality among all DB flavors, we cannot package everything into SQL statements unless we have every possible calculation already precalculated and stored, which seems inadvisable if not impossible.

The question becomes, how could we expand the OHDSI cohort definition to include functionality outside of SQL.

Example use cases:

id	Use case	Result
1	Patients who lived in area x for date y	list of person_id
2	Patients who live in areas with measurement of x	list of person_id
3	Visits where the patient traveled more than an hour for CT scan	list of visit_occurrence_id
4	Care sites that have 3 or more dialysis clinics in same county	list of care_site

All roxygen documentation for stageData.R file

Part of #54

Validate attr_spec and geom_spec JSON schema in postgres

See pg_jsonschema
https://supabase.com/blog/pg-jsonschema-a-postgres-extension-for-json-validation

How to standardize referencing of custom polygons

For political boundaries we will have an area_concept_id that we could use to uniquely identify an area. For custom (e.g. hospital service area) and derived (e.g. risk map) areas, we need the means to consistently and uniquely identify them. The current way, other than specifying the source polygon, would be to leverage the area_name.

Provenance for attribute tables

For instance, say we have data in our AREA_ATTRIBUTE table from both external data sets and derived information pulled out the CDM. We would need a way to specify between the two. Could data_source_id handle that as well with a separate record for the CDM itself?

All roxygen documentation for createIndices.R file

Part of #54

Enable multi-year downloads into single table for EPA attribute data

Both of the functions fetch_EPA_AQS_static() and load_EPA_AQS_static_attr() in EPA_loader.R currently only handle one file at a time. A wrapper function to download multiple static files into one table might be a good solution.

Interested in adding Air Quality data

We at BMC are interested in being able to offer census level air quality data to our Place-based data resources. Daily temperature and air quality on a daily or weekly basis would be a great start - limiting to a subset of census tracts or at least Massachusetts would help us restrict our data volume. Thx.

How can we incorporate line data?

Reassess how connections are made and broken between R and the Postgres database

Originally, connections were being made in nearly every function that interacted with the database (though not consistently all of them and often overlapping connections). Currently, connection is only created at the highest level and all lower level functions are assumed to be called from within that functions scope.

Reconsider shifting back towards connect/disconnect at a very granular level. Look into how other Hades packages handle this
Consider employing on.exit() to call disconnect to ensure that connections don't persist after functions exit without returning

Do we need person level exposures?

Larger topic.

In order to integrate with the other tools, do we need to persist patient-level records or can we keep them at their native state (i.e. location and area). I believe, with adjustments to the WebAPI and ATLAS, we can do cohort discovery without persisting the person level data. What about the prediction packages?

Create function getAttrIndex

All roxygen documentation for loadVariable.R file

Part of #54

Add distance to care

With Tom Concannon's and Jay Greenfield's guidance, identify an algorithm that calculates distance from care metrics - travel distance, travel time - and build a routine for using the algorithm to calculate those metrics for each patient with sufficient location data.

Create documentation on registering a variable

Similar to #57, we need robust documentation to orient users to methods for creating and adding variables to the backbone.variable_source registry. This is another pre-requisite step to using a data source.

Memory exhausted error

There is a recurring but intermittent error that appears when ingesting a shapefile into R from the database. This error seems to be somehow related to connect/disconnect pattern, as it occurs more frequently if connections are not well managed. However, it does appear after multiple consecutive spatial table imports.

More context:

spatial table imports are done in batches of 1000 rows at a time. This was initially put in place to avoid memory problems from spatial imports of ~3000 rows.
even with the batch import, mutliple consecutive imports (pulling the same 3k row table in 3-4 times) will cause the error.
Since the problem is still occurring after a total of ~10000 rows are imported during a single session, it is worrisome to imagine a larger table failing to load even once.
The problem always goes away/ memory resets when the R session is restarted. Obviously, any imported data is cleared and all packages are unloaded.

geographic attributes- name?

How should we refer to geographic attributes? Currently we are using LOCATION_ATTRIBUTE and AREA_ATTRIBUTE, but would another term be more appropriate? AREA_MEASUREMENT? AREA_OBSERVATION?

Rename feature_index to variable_source

The name feature_index is confusing as "feature" means something different in the context of this project than in the wider GIS community. Instead of the term "feature", we will use the term "variable" which is functionally equivalent in the context of the project. Further, we will address the fact that this table is not exactly an index, but more like a second source table due to the amount of unique data it contains.

Changes should be made in the SQL db and in the R code that touches the SQL db.

ohdsi / gis Goto Github PK

gis's Introduction

OHDSI GIS Workgroup

Introduction

Quick Start

Support

Contributing

gis's People

Contributors

Stargazers

Watchers

Forkers

gis's Issues

Recommend Projects

Recommend Topics

Recommend Org