gompertzmakeham / hazardrates Goto Github PK

6.0 1.0 0.0 548 KB

License: GNU General Public License v3.0

PLSQL 38.95% PLpgSQL 14.93% TSQL 46.11%

gompertz makeham survival longitudinal hazard analysis mortality morbidity actuarial statistical

hazardrates's Introduction

Hazard Rates

Declaration and Land Acknowledgement

This project, its materials, resources, and manpower are wholly funded by Alberta Health Services for the purpose of informing health system performance improvement, and health care quality improvement. Alberta Health Services is the single public health authority of the province of Alberta, within the boundaries of the traditional lands of the nations of Treaties 6, 7, 8, and 10, and the peoples of Metis Regions 1, 2, 4, 5, and 6, including the Cree, Dene, Inuit, Kainai, Metis, Nakota Sioux, Piikani, Saulteaux, Siksika, Tsek’ehne, Tsuut’ina. The author is best reached through the project issue log.

Introduction

Proactive methodological disclosure of a high resolution precision calibrated estimate of the Gompertz-Makeham Law of Mortality and general utilization hazard rates through lifespan interferometry against annual census data consolidated from the administrative data of all publicly funded healthcare provided in a single geopolitical jurisdiction. This repository only contains the source code, and only for the purpose of peer review, validation, and replication. This repository does not contain any data, results, findings, figures, conclusions, or discussions.

This code base is under active development, and is currently being tested against a data store compiled from 19 distinct administrative data sets containing nearly 3 billion healthcare utilization events, each with couple hundred features, covering more than 6 million individual persons, and two decades of surveillance. The application generates approximately 175 million time intervals, each with four dozen features, by Measure Theory consistent Temporal Joins and dimensional reduction, implemented in ad hoc map-reduce steps.

High resolution estimation of mortality and utilization hazard rates is accomplished by measuring the person-time denominator of the hazard rates to the single person-day, without any rounding or truncation to coarser time scales. The precision of the hazard rate estimators are calibrated against the main source of epistemic uncertainty: clerical equivocation in the measurement, recording, and retention of the life events of birth, death, immigration, and emigration. The aleatory uncertainty is estimated using standard errors formally derived by applying the Delta Method to the formal representation of the hazard rate estimators as equations of random variables. The existence, uniqueness, and consistency of the standard errors are left unproven, although a straightforward application of the usual asymptotic Maximum Likelihood theory should suffice.

Overview

The construction of the denominators and numerators of the hazard rate analysis broadly proceeds in 11 steps of ad hoc map-reduce and dynamic reconstitution, to produce records of person census intervals:

Ingest independently and in parallel the external administrative data sources, mapping the clerically records of life events and demographic information.
Digest independently and in parallel the mapped data sources from step 1, reducing each source to one record per person. The files in the survey folder contain steps 1 and 2.
Ingest sequentially the reduced data sources from step 2, mapping into a common structure.
Digest sequentially the mapped common structure from step 3, reducing to one master record per person, containing the extremums of life event dates. The file persondemographic.sql contains steps 3 and 4.
Dynamically reconstitute the pair of surveillance extremums for each person from step 4. This process is contained in the file personsurveillance.sql.
Dynamically reconstitute the census intervals for each surveillance extremum from step 5. This process is contained in the file personcensus.sql.
Ingest independently and in parallel the external administrative data sources, mapping the transactional records of utilization events and the event details.
Digest independently and in parallel the mapped data sources from step 7, reducing each source to one record per person per census interval, using the dynamically generated census intervals from step 6. The files in the census folder contains steps 7 and 8.
Ingest sequentially the reduced records per person per census interval from step 8, mapping to a common data structure.
Digest sequentially the mapped common data structure from step 9, reducing by Temporal Joins to one record per person per census interval, containing the utilization in that census interval. The file personutilization.sql contains steps 9 and 10.
Dynamically reconstitute a columnar list of utilization measures, eliding trivial measures. This is contained in the file personmeasure.sql.

Currently the build process is contained in refresh.sql; which for the time being will remain partly manual because of idiosyncratic crashes that occur during table builds, possibly due to locking of the underlying table sources. An example of querying the terminal assets of this analysis is contained in the files dense.sql and columnar.sql.

Temporal Joins

In keeping with declarative languages, a Measure Theory consistent Temporal Join on a longitudinal data set is defined by the global characteristics of the resulting data set. Specifically, a relational algebra join is a Measure Theory consistent Temporal Join if the resulting data set represents a totally ordered partition of a bounded time span under the absolute set ordering. Furthermore, the time intervals represented by any two records produced by a Temporal Join must intersect trivial, either being disjoint, or equal. Put more simply, a Temporal Join takes a time span,

|--------------------------------------------------------------------------------------|

and partitions it into compact contiguous intervals, of possibly unequal lengths,

|-------|--------|-------------|-----|--------|--|--|--------------|-----------|-------|

such that the produced data set contains at least one, and possibly arbitrarily more, records for each interval. A Temporal Join unambiguously ascribes a definite set of features, from one or more records, to each moment in a time span, because there are neither gaps in the representation of time, nor non-trivial intersections between intervals. Temporal Joins are Category Theory closed, in that the composition of Temporal Joins is a Temporal Join, because successive Temporal Joins are Measure Theory (finite) refinements of the (finite) minimal sigma algebra to which the partition belongs.

Given two intervals a Temporal Join will generate one, two, or three records. If the intervals have identical boundaries the result will be a single record containing the characteristics of the two source intervals:

|--------------------|
           a                 

|--------------------|
           b

|--------------------|
         a & b

If the boundaries of the intervals are exactly contiguous the result will be two records:

|-------|
    a

        |------------|
               b

|-------|------------|
    a          b

If the intervals intersect non-trivial the result will be three records:

|-----------------|
       a

          |-----------------|
                b

|---------|-------|---------|
     a      a & b      b

Finally if the intervals are fully disjoint and not contiguous the result will also be three records:

|---------|
     a

                     |-------------------|
                              b

|---------|----------|-------------------|                       
      a     ~(a & b)           b

This construction can be composed by iteration on any (finite) number of intervals, because the Category of Temporal Joins is Measure Theory closed with respect to refinement. Fortunately much faster techniques can be found than the naive iteration, exploiting either sorting on the boundary dates and then back searching, or in the case of the methods in this analysis, by explicitly constructing the intervals based on the bounds of the events.

Concretely, in the context of this project, for each surveillance time span during which a person's healthcare utilization was observed, we divide the time span into fiscal years, starting on April 1, and further subdivide each fiscal year on the person's birthday in the fiscal year; where if the birthday falls on April 1 the fiscal year is not subdivided. This is precisely what the function hazardutilities.generatecensus implements, taking three dates, a start date, an end date, and a date of birth.

Events

The administrative data sources we work has three modalities of recording information about events: existential, clerical, and transactional. Transactional recording of events occurs at the time of the event, is general completed by the person delivering the service recorded in the transaction, and may record information about prior events during the process of collecting information about the current event, examples include: visits to inpatient care, dispensing of prescribed pharmaceuticals at community pharmacy, and delivery of home care. Clerical recording of events occurs after the event has occurred, requiring recall on the part of the participants of the event, usually in the form of self-report by the recipient of services, examples include: symptom onset, birth dates, and provincial migration dates. Finally existential records record a broad time interval during which an event was known to have occurred, these are usual recorded in the contexts of administrative registrations in programs, and time intervals of data capture by administrative information systems, examples include: year of coverage start, quarter of inpatient record capture, and registration for home care.

Differentiation between:

transactional recording of events.
clerically reflective recording events.
Inferential existential bounds on events.

The impact of:

Censoring is what you do not know about observed patients because you cannot see into the future.
Survivorship bias is what you do not know about the patients you never observed because they did not live long enough to be included.
Immortal time bias is what you do not know about observed patients because you cannot see into the past.
Sampling Bias is what you do not know about patients you did not observe.

Epistemic Uncertainty

We measure epistemic uncertainty using Clerical Equivocation Interferometry against the clerical recording of the lifespan events of birth, death, immigration, and emigration. This technique begins by identifying, for each person, the shortest and longest lifespan possible given all the clerical events. Given entry events O, and exit events X, we generate two lifespans, the longest and the shortest:

O-----O--O---O----O----------------------------------X--X------X-X-X----------X

|-----------------------------------------------------------------------------|

                  |----------------------------------|

Within each lifespan, we then identify the shortest possible surveillance interval within the shortest lifespan, and the longest possible surveillance interval within in the longest lifespan.

Transactional dates are fixed, but age may change due to clerical uncertainty, moving events to different age buckets.
Shortest is not the upper bound, longest is not the lower bound, they can even cross
Not the uniform norm bound either
The envelope is a mini-maxi estimator, minimum covariance, maximum variance.

It is a reasonable estimate of epistemic uncertainty. It is not the maximum variance possible due to clerical equivocation, but it is a reasonable amount. Combinatoric methods could provide broader bounds, but the computational trade-offs in terms of expediency of the analysis were not worth it at this time.

Aleatory Uncertainty

We measure aleatory uncertainty using a non-parametric standard error of the hazard rate estimator. embed Codecog LaTeX images

Data Sources

Of the 19 data sources that currently feed into this hazard rates analysis, a number either partially or completely publish their data collection methodology, definitions, and standards:

Ambulatory care Canadian Institute of Health Information National Ambulatory Care Reporting System.
- 2 data sources from 2002 onward; currently approximately 126 000 000 events.
Inpatient care Canadian Institute of Health Information Discharge Abstract Database.
- 1 data source from 2002 onward; current approximately 7 000 000 events.
Long Term Care Canadian Institute of Health Information Resident Assessment Instrument.
- 1 data source from 2010 onward; currently approximately 560 000 events.
Primary Care Alberta Health Schedule of Medical Benefits.
- 1 data source from 2001 onward; currently approximately 656 000 000 events.
Community Pharmacy Dispensing Alberta Health Pharmacy Information Network.
- 1 data source from 2008 onward; currently approximately 605 000 000 events.
Annual Registry Alberta Health Alberta Health Care Insurance Plan
- 1 data source from 1993 onward; currently approximately 90 000 000 events.
Continuing Care Registrations proprietary direct access (Civica, Meditech, StrataHealth).
- 3 data sources, phased adoption 2008, 2010, and 2012 onward; currently approximately 520 000 events.
Community Laboratory Collections proprietary direct access (Fusion, Meditech, Millennium, SunQuest).
- 4 data sources, phased adoption 2008, 2009, 2012, and 2014 onward; currently approximately 1 500 000 000 events.
Care Management proprietary direct access (Civica, Meditech).
- 2 data sources phased adoption 2008, 2010, and 2012 onward; currently approximately 2 800 000 events.
Home Care Activity proprietary direct access (Civica, Meditech, StrataHealth).
- 3 data sources, phased adoption 2008, 2010, and 2012 onward; currently approximately 70 000 000 events.
Diagnostic Imaging proprietary direct access (in staging).
- Not calibrated yet.
Emergency Medical Services proprietary direct access (in staging).
- Not calibrated yet.
Health Link proprietary direct access (in staging).
- Not calibrated yet.

Disclaimer

The aggregate provincial data presented in this project are compiled in accordance with the Health Information Act of Alberta under the provision of Part 4, Section 27, Sub-section 1(g) for the purpose of health system quality improvement, monitoring, and evaluation. Further to this the aggregate provincial data are released under all the provisions of Part 4, Section 27, Subsection 2 of the Health Information Act of Alberta.

This material is intended for general information only, and is provided on an "as is", or "where is" basis. Although reasonable efforts were made to confirm the accuracy of the information, Alberta Health Services does not make any representation or warranty, express, implied, or statutory, as to the accuracy, reliability, completeness, applicability, or fitness for a particular purpose of such information. This material is not a substitute for the advice of a qualified health professional. Alberta Health Services expressly disclaims all liability for the use of these materials, and for any claims, actions, demands, or suits arising from such use.

hazardrates's People

Stargazers

Watchers

hazardrates's Issues

Incorporate Immunizations and Vaccinations

Due to the lack of digitization historical trending of vaccinations and immunizations will not be possible. However with the province wide adoption of Meditech for public health in the fall of 2018 it should be possible to report rates from 2019 on wards. The data would have to be extracted directly from the Meditech SQLServer data warehouse.

Subtotal by ObGyn utilization

Currently ObGyn utilization is included in the subtotals of specialist utilization. However, ObGyn utilization is not predominately driven by the exponential development of co-morbidity. This both obscures the characterization of the exponential utilization of specialist with ages, and misrepresents the use of ObGyn. The same fields that exist for other primary care need to be added for ObGyn:

Count of procedures in the time interval
Unique ObGyn days (i.e. visits)
Unique days of at least one visit to an ObGyn.

Utilization among males should be interesting.

Live Birth Rates per Female Demographic

Live birth rates per female demographic will provide critical insights into fertility, fecundity, and in general population dynamics (quadratic or greater terms in the male-age/female-age scattering cross-section). There are three potential sources for live birth data:

Vital statistics
~~Inpatient care~~
~~Primary care~~

Vital statistics explicitly links mother and infant PHNs, but only has data after 2010. Inpatient and primary care contain pairs of records for mother and infant but the records can only be linked on secondary fields, like chart numbers, provider identifiers, location identifiers, and service dates. The strategy would then be to find all distinct pairs of mother-infant PHNs and then link that back to the derived demographic tables to assign delivery counts to the mother's age intervals, as well as filter out implausible records.

After further investigation self-referential linking within primary care is going to be involved because constraining on provider identifiers and services dates is not sufficient to narrow the possible matches. So the algorithm will be:

First check vital statistics for maternal-infant PHN pairs.
Next check inpatient care, cross linking the first new born on the reciprocally recorded chart numbers, then with all newborn chart numbers that have not been reciprocally reference back cross-reference against the successful cross-reference. Preferentially selecting the vital statistics infant data, and using inpatient only for infants not in vital statistics.
~~Finally check primary care, find the non-newborn claim immediately before the infant claim, only use this pair for infants not in the previous step.~~

A small nuance, there are about 2 dozen inpatient admissions that are actually transfers after live birth, so sorting on admission date needs to be accounted for.

It turns out that their has been historical heterogeneity in physician billing practices around deliveries, with various combinations of submitting claims for just the mother, just the newborn, or both. So that knocks primary care out.

Reconcile Prescriber Identifiers

Select submitting organizations will leave the prescriber identifier blank when re-filling a prescription. These records need to be characterized and a strategy to fill in the missing prescriber identifiers through back referencing needs to be developed.

Incorporate Geographic Identifiers and Geometries

Incorporating geographic identifiers requires dealing with three thorny data volume problems:

Finding a postal code for each interval for each person. It is not entirely clear what heuristic to use to choose a postal code in the case of multiple choices, particularly because heuristics based on time sorting will be computationally expensive to do at scale.
Finding the local geography code for each postal code requires a non-trivial look-up against a large reference table. Again running this at scale will be computationally punishing.
Finally mapping the local geography code to a geometry requires only a look-up against a relatively small reference table, but then requires incorporating a geometry field.

Pre-computation of Incidence Timestamps

Utilization relative to the onset of critical stages in care, like before and after first entrance to continuing care are strong statistical estimators of utilization process dynamics. The current data set allows for computationally intensive determination of incidence and onset timestamps, because they have to be calculated for each patient. To speed up this process the incidence timestamps can be pre-computed and stored as one record per patient.

Subtotal laboratory utilization

The community laboratory utilization could be broken down and subtotal by broad categories of types, or functions of assays. A few possible categories to consider:

microbiology assays, i.e cultures
blood chemistry, gases, counts, and composition
immunochemistry assays, antibody assays, amplification, SNPs, virology, and molecular biology assays
all other assays

Once 2-4 categories are identified, the subtotaling can follow the pattern set by the pharmacy dispensing subtotals.

Further Subtotal Dispensed Standard Therapeutics

The subtotal of all community pharmacy dispensed non-criminal code scheduled therapeutics (not triple pads) currently combines everything including antibiotics and contraceptives. However contraceptives and antibiotics both in and of themselves represent very different prescribing regimens when compared to other standard therapeutics. Contraceptives because they are not treating a morbidity, and antibiotics because they are explicitly treating a (hopefully) transient infection. For these reasons the subtotal of standard therapeutics should actual be three subtotals:

Antimicrobials explicitly prescribed to resolve an infection, according to ATC code.
Contraceptives explicitly prescribed for birth control, according to ATC code.
Diabetic treatments explicitly prescribed for diabetes, according to ATC code.
All other standard therapeutics.

Which ATC codes are appropriate for delineating these subtotals requires further investigation.

Incorporate Health Link-811

This is a stub to remind me to do further investigation into pulling health link-811, primarily calls linked to PHN, into the utilization measures.

The author cannot be reached

...please leave a message after the tone.

Refine Time Partitions by Delivery Setting

Currently for each person the surveillance interval is divided into fiscal years (starting on April 1), and then subdivided on the birthday in the fiscal year. For coinciding utilization, like emergency ambulatory visits while in supportive living, this partitioning gives a rough estimate. However the resolution available in the underlying data is to the single day. To improve the precision of the estimate we could further subdivided each interval on the time spans of unique combinations of the following four flags:

Person has any emergency inpatient stays
Person has any long term care stays
Person has any designated supportive living stays
Person has any case managers allocated

Computationally the exponentially more person with refined intervals at older ages should be offset by the exponential increase in mortality.

Incorporate Emergency Medical Services

This is a stub to remind me to do further investigation into pulling emergency medical services, primarily transport and community attendance, into the utilization measures.

Incorporate Home Care Activity

While the sampling of brokered support services is heterogeneous across geography and time the recording of days when professional staff directly attended to clients, either for home care or transition services, is sampled relatively uniformly. A few utilization measures to consider, under two new census tables:

censuscaremanagement
- Intersecting assigned case manager days (including simultaneous assignments).
- Number of case manager assignments intersecting with interval.
- Number of case manager assignments starting in the interval.
- Number of case manager assignments ending in the interval.
censushomecare
- Home care professional staff visit days.
- Home care professional visit days.
- Home care professional activities.
- Transition services and placement coordination professional staff visit days.
- Transition services and placement coordination professional visit days.
- Transition services and placement coordination professional activities.

Subtotal NDSL Utilization

Tabulate NDSL the same way that DSL is tabulated.

Summary Statistics for Tableau Public

For both privacy and performance when publishing to Tableau Public the hazard rates need to be aggregated to:

Fiscal year.
Age in years.
Biological Sex.
Rate measure.

Each record must at least contain:

Identifier and description of the specific rate measure.
Identifier, description, lower and upper lifespan summed value of the denominator of the rate measure.
Identifier, description, lower and upper lifespan summed value of the numerator of the rate measure.
Number of persons in the sample represented by the record.
Summed squared sampling discount of the numerator and denominator.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.