Giter Site home page Giter Site logo

gis's Introduction

OHDSI GIS Workgroup

Introduction

Please visit our webpage for more details and the most up-to-date information and software documentation.

For project management, see the GIS Project Page

Click here to propose a new Use Case.

Quick Start

Instructions to quickly install and start using Gaia are here

Support

Contributing

We are eager to engage with interested developers with interest or experience in frontend, backend, and geospatial development.

If you are interested in contributing and don't know where to begin, please join the OHDSI GIS WG or email at zollovenecek[at]ohdsi[dot]org

gis's People

Contributors

jake-gillberg avatar kzollove avatar rtmill avatar tibbben avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gis's Issues

Create documentation on registering a data source

Data sources must be registered in the Postgres backbone.data_source table to be used within this framework. Registering can be a confusing, complex task depending on the complexity of the source dataset.

This documentation should probably be created as a vignette in the R package.

This documentation should give a thorough explanation of each column in the data source (geom_spec in particular) and walk a potential user through the process of creating a row in the table, updating the DDL for the backbone schema, and creating a PR for the new DDL.

Depends on

Keep dependency in memory if loading into PostGIS

Currently, if a dependency for a variable (attr or geom) is not in the database, the importShapefile will go through a download-upload-import cycle.

The download is from the web-hosted data source into the R environment, where it is transformed into the standard format.

The standardized data is then uploaded to PostGIS.

Now that the dependency exists in PostGIS, the attr and geom are joined and together they are imported back into the R environment.

We should be able to skip the import step if we already have the dependency in memory. The snag here is attr and geom are joined in PG and then imported. There are three cases to address:

  1. neither dependency existed in the DB and was just downloaded into the R environment
  2. geom didn't exist, was just downloaded, and attr needs to be imported from DB
  3. attr didn't exist, was just downloaded, and geom needs to be imported from DB

This is an important issue and could help ease the memory strains and help with #59

Implement a UUID system in the variable_source table

This can only happen after work on #69 is completed.

Each variable comes from a data_source. Multiple variables may come from the same data source

There is a good chance that we could (and should) implement our own variable_source UUID by taking the FK data_source_uuid and appending an autoincrementing number (<data_source_uuid>_01).

Determining provenance in locations

We need a way to determine provenance of locations, specifically if it came from the EHR or some external data set (e.g. EPA censor data). OHDSI typically used a '_type_concept_id' to specify this. location_type_concept_id ? Could these ever overlap?

Need implementation of extra metadata table for AQS monitors keyed to attribute table

Currently 1) EPA monitor metadata loads into a generic table created from a downloaded static csv file and 2) a foreign key column is added to the attribute table. But no connection is made between the downloaded metadata and the attribute table. This will likely take place in the load_EPA_AQS_static_attr() function of the EPA_loader.R script, but some refactoring may be necessary.

GIS for CARE_SITE, or person only

How often do we really have CARE_SITEs (=healthcare institutions) move around? If we decided it's only about patients history of locations we could omit the whole domain_id business and have a simple person_id.

Document steps for setting up R environment

Certain steps must be taken when setting up the R environment to download/upload/import sizeable geospatial data down from the original source, up to the postGIS database, and then back down to the users machine.

Some steps for setting up:

  • change the java heap size that is allowed
  • change the R system variable for memory allocation
  • increase timeout from 60 to ~600

Could there be a utility function user runs to make all these changes?

How to handle attribute dates

With the data source table, we have the information for when the data set was collected as a whole and what timeframe it represents. When we store a measurement from these data sets, what logic do we use to determine the date of the measurement itself.

For instance, in the EPA data set we have a 'mean value of measurement x for sensor y for the year 2017'. In this circumstance, the data_source contains all needed date information, do we want to replicate that in the attribute table?

An additional example, there is also columns for 'max value' and 'date of max value'. For that record, which date do we use?

Data driven column names for sql queries

Currently all columns names for EPA site data an EPA attribute data from either downloaded static files or API calls are hard coded into the sql queries. Is it worth making this data driven?

Which tables should have an FK to data_source?

Currently it is only LOCATION_ATTRIBUTE and AREA_ATTRIBUTE but there is a similar need in other structures. LOCATION, AREA and POLYGON_SOURCE all could benefit from that. Currently POLYGON_SOURCE has a string field 'data_source_name' which acts similarly. What changes would be needed to extend this functionality to these tables?

Refactor create_indices to automatically get all uuids

There is currently a function get_uuids() that is only used as an argument to create_indices() to provide a list of all uuids that are registered in the backbone.data_source table.

Since get_uuids serves only one purpose, and since getting all uuids is the "default" way create_indices is used, create_indices should be refactored to reflect this and remove an unnecessary function.

Integration with ATLAS/WebAPI

The current paradigm in OHDSI is to package up cohort definitions into JSON objects which can be translated into singular SQL statements that fully define the cohort. Some of these definitions require calculations (e.g. 'where value is between x and y') but all of which are completed entirely in SQL. In our circumstance, given the lack of compatibility for GIS functionality among all DB flavors, we cannot package everything into SQL statements unless we have every possible calculation already precalculated and stored, which seems inadvisable if not impossible.

The question becomes, how could we expand the OHDSI cohort definition to include functionality outside of SQL.

Example use cases:

id Use case Result
1 Patients who lived in area x for date y list of person_id
2 Patients who live in areas with measurement of x list of person_id
3 Visits where the patient traveled more than an hour for CT scan list of visit_occurrence_id
4 Care sites that have 3 or more dialysis clinics in same county list of care_site

How to standardize referencing of custom polygons

For political boundaries we will have an area_concept_id that we could use to uniquely identify an area. For custom (e.g. hospital service area) and derived (e.g. risk map) areas, we need the means to consistently and uniquely identify them. The current way, other than specifying the source polygon, would be to leverage the area_name.

Provenance for attribute tables

For instance, say we have data in our AREA_ATTRIBUTE table from both external data sets and derived information pulled out the CDM. We would need a way to specify between the two. Could data_source_id handle that as well with a separate record for the CDM itself?

Interested in adding Air Quality data

We at BMC are interested in being able to offer census level air quality data to our Place-based data resources. Daily temperature and air quality on a daily or weekly basis would be a great start - limiting to a subset of census tracts or at least Massachusetts would help us restrict our data volume. Thx.

Reassess how connections are made and broken between R and the Postgres database

Originally, connections were being made in nearly every function that interacted with the database (though not consistently all of them and often overlapping connections). Currently, connection is only created at the highest level and all lower level functions are assumed to be called from within that functions scope.

  1. Reconsider shifting back towards connect/disconnect at a very granular level. Look into how other Hades packages handle this
  2. Consider employing on.exit() to call disconnect to ensure that connections don't persist after functions exit without returning

Do we need person level exposures?

Larger topic.

In order to integrate with the other tools, do we need to persist patient-level records or can we keep them at their native state (i.e. location and area). I believe, with adjustments to the WebAPI and ATLAS, we can do cohort discovery without persisting the person level data. What about the prediction packages?

Add distance to care

With Tom Concannon's and Jay Greenfield's guidance, identify an algorithm that calculates distance from care metrics - travel distance, travel time - and build a routine for using the algorithm to calculate those metrics for each patient with sufficient location data.

Create documentation on registering a variable

Similar to #57, we need robust documentation to orient users to methods for creating and adding variables to the backbone.variable_source registry. This is another pre-requisite step to using a data source.

Memory exhausted error

There is a recurring but intermittent error that appears when ingesting a shapefile into R from the database. This error seems to be somehow related to connect/disconnect pattern, as it occurs more frequently if connections are not well managed. However, it does appear after multiple consecutive spatial table imports.

More context:

  • spatial table imports are done in batches of 1000 rows at a time. This was initially put in place to avoid memory problems from spatial imports of ~3000 rows.
  • even with the batch import, mutliple consecutive imports (pulling the same 3k row table in 3-4 times) will cause the error.
  • Since the problem is still occurring after a total of ~10000 rows are imported during a single session, it is worrisome to imagine a larger table failing to load even once.
  • The problem always goes away/ memory resets when the R session is restarted. Obviously, any imported data is cleared and all packages are unloaded.

geographic attributes- name?

How should we refer to geographic attributes? Currently we are using LOCATION_ATTRIBUTE and AREA_ATTRIBUTE, but would another term be more appropriate? AREA_MEASUREMENT? AREA_OBSERVATION?

Rename feature_index to variable_source

The name feature_index is confusing as "feature" means something different in the context of this project than in the wider GIS community. Instead of the term "feature", we will use the term "variable" which is functionally equivalent in the context of the project. Further, we will address the fact that this table is not exactly an index, but more like a second source table due to the amount of unique data it contains.

Changes should be made in the SQL db and in the R code that touches the SQL db.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.