Giter Site home page Giter Site logo

opensafely-core / ehrql Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 3.0 24.96 MB

ehrQL: the electronic health record query language for OpenSAFELY

Home Page: https://docs.opensafely.org/ehrql/

License: Other

Dockerfile 0.33% Shell 0.21% Python 98.71% TSQL 0.01% Just 0.73%

ehrql's People

Contributors

benbc avatar bloodearnest avatar carolinemorton avatar dependabot[bot] avatar evansd avatar ghickman avatar iaindillingham avatar inglesp avatar jongmassey avatar lfisher7 avatar madwort avatar mikedferris avatar milanwiedemann avatar opensafely-github-bot avatar rebkwok avatar stevenmaude avatar wjchulme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

ehrql's Issues

Add support for codelists

One major feature that is missing so far which we will need for any study is codelists. We should add it. :-)

We have an opportunity to simplify the use of codelists for researchers, for example by allowing them to specify just a URL which we then retrieve the codelist from, rather than having to deal with files in the workspace. Before implementing this we should discuss with the wider team.

Deal with null values

All of the basic filters will ignore null values; we should have some way of customising how we deal with nulls and/or options to include/exclude nulls

e.g. additional filters?

table("clinical_events").filter("value", equals_or_null="foo")

or parameters?

table("clinical_events").filter(value="foo", include_null=True)

Probably with some implementation of the sqlalcmey or_ conjunction

Update manual, tutorials, other documentation.

As much as possible of the documentation should be generated from the code. The new DSL should help with that for the qeury language; for the backend documentation, if we can formalize the backend definition, we may be able to generate some of that documentation too.

Update long covid V2 study definition

Update full long covid study definition to V2

Format for the IMD variable will depend on #96

BMI to be implemented as explicitly recorded BMI values only (no inferred calculations from height/weight)

Allow code collation to vary per backend, per system

We currently hardcode the collation for codes to Latin1_General_BIN, because that's what's used in TPP for CTV3, which is what we need at the moment. However it needs to be allowed to vary as appropriate. We can consider this as part of the backend definition and do something like this:

collation = self.backend.collation_for_coding_system(codelists.system)

Implement Index of Multiple Deprivation

IMD queries in the old cohort extractor are calculated by finding the patients with addresses registered on the index_date, and then ordering them by

ORDER BY
              StartDate DESC,
              EndDate DESC,
              IIF(MSOACode = 'NPC', 1, 0),
              PatientAddress_ID

This orders by most recent start date, then latest end date, then prefers addresses which are not marked "NPC" (for "No Postcode") and finally uses the address ID as a tie-breaker.

The IMD query in the old cohort extractor could almost be implemented as a last_by method in the new version, like:

imd_value = (
        table("patient_address")
        .date_in_range(index_date)
        .last_by("date_start", "date_end", "has_postcode", "patientaddress_id")
        .get("index_of_multiple_deprivation")
    )

Except that has_postcode is not a field, it's the IIF statement on MSOACode.

A Row could accept an expression as a column (with relevant DSL wrapping for defining the expression). We probably still want a custom method of some sort so that researchers don't have to type the column ordering explicitly though.

Categorizing for bucketed values

  • simple single comparisons (x > y)
  • and/or conditions ((x > y) & (y == z))
  • nesting conditions (((x > y) & (y == z)) | (z == a))
  • inverting conditions (~((x > y) & (y == z)))
  • True/False comparisons

Correctly format dates for MSSQL

For reasons best known to itself, MSSQL will change it's interpretation of ISO format dates (YYYY-MM-DD) according to it's localisation settings and so will sometimes treat them as a bizarre (and seen nowhere else) hybrid of ISO and US formats: YYYY-DD-MM. (See opensafely-core/cohort-extractor#80)

However, it will consistently and correctly interpret dates in YYYYMMDD format (i.e ISO without the hyphens).

We obviously don't want to leak this monstrous detail out to study authors and so our approach thus far has been to use ISO format everywhere and then convert to MSSQL-format at the last moment when building the SQL.

We do this by passing every string though this function, which we reformat it if it looks like an ISO date and leave it alone otherwise:
https://github.com/opensafely-core/cohort-extractor/blob/bb5d2e9faa8e3e5aa35824edcfe2e48d088be59c/cohortextractor/tpp_backend.py#L2421-L2434

This has the advantage of being comprehensive. Accidentally passing an ISO date through would lead to very subtle bugs because it's unlikely to trigger an error, just a subtlety wrong date condition. So we really, really don't want to do that. On the other hand, I think there have been one or two cases where this has bitten us because someone has wanted to use date-like-strings as categories and then got very confused.

Theoretically, the new cohortextractor has access to enough type information to know if a value is supposed to be a date or not and format it accordingly. However, propagating this type information through is something we're not yet doing and this problem needs solving now so I suggest we adopt the same approach as the old cohortextractor but hopefully implemented in a such a way that we can surgically swap it out for a "proper" solution when the time comes.

It would also be good to work out how to put the test database into the right localisation mode to trigger this behaviour so we can increase our chances of catching problems early.

Proposal: new approach to codelist managment

Current workflow

The current workflow for using codelists in an OpenSAFELY study definition looks something like this:

  • Create (or discover) a codelist on OpenCodelists
  • Add (part of) its URL to a file called codelists.txt
  • Run opensafely codelists update to download it
  • In a study definition, import the codelist_from_csv function
  • Load the codelist using its filename and specify the columns to read and the coding system in use.

As a bonus step, it looks like a pattern has emerged whereby users do the codelist_from_csv call in a file called codelists.py and then do from codelists import * in the study definition.

I think there are a few issues with this workflow:

  • It's a bit clunky for the user.
  • It makes it hard for someone looking at a study to jump quickly from the use of a codelist to the corresponding page on OpenCodelists. (And the from codelists import * pattern makes this even harder.)
  • Having to specify the coding system and columns again is not only duplicative work but it adds the possibility of user error.

Proposal

  1. We no longer have a codelists.txt file.

  2. Study authors should reference codelists in a study directly by URL e.g.

    from cohortextractor import codelist
    
    pulse_oximetry_codes = codelist(
        'https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/'
    )
  3. The opensafely codelists update command would get the URLs out of the study definitions and then download as normal (exact mechanics of this TBC, but I have some ideas).

  4. We already store some codelist metadata in a codelists.json file. We extend this to include details of the coding system used and relevant columns so the user doesn't have to specify these again.

Additional notes:

  • We can continue to support codelists sourced from elsewhere by making the codelist function accept a path to a local CSV file, in which case the coding system and column argument would be required.
  • It might be possible to have something like the opensafely codelists update step run automatically when running the cohortextractor locally to reduce the steps required even further.
  • Some of this might require extra work on the OpenCodelists side as I'm not 100% sure that this currently provides all the metadata we need in an easily consumable way.

Add support for basic filtering on tables

"Filtering" means taking a table and removing some of the rows it contains according to certain criteria. This should be a chainable operation so that further filters can be applied to an already filtered table.

The most basic kind of filter compares a column in the table with a single, static value e.g. date_recorded >= '2020-01-01' but there are various more complex cases we'll need to support as well including:

  • supporting IN/NOT IN queries (or their semantic equivalents) whereby the comparison is not with a single value but an array.
  • supporting the above but where the array-like value is a codelist where we need to retrieve not just the fact that there was a match but what the specific matching code was (and potentially the code "category" if the codelist contains these)
  • using the value from another query as the comparison value (e.g. what are called "dynamic dates" in the old cohortextractor)
  • using an array of values from another query as the comparison value e..g where date_recorded NOT in <list_of_dates_where_event_X_happened>

The most basic example looks something like this:

class Cohort:
    has_event_x = table("coded_events").filter(code="code_X").exists()

And the expected output looks something like this:

patient_id,has_event_x
10001,False
10002,True
10003,False

Disable default logging of executed SQL queries

It's useful to have this in development but it makes the logs too noisy for users.

If we need to see the SQL for debugging purposes we can always regenerate the exact SQL given a study definition and the version of the cohortextractor used.

Work with researchers to find a simple initial study

We'd like to identify a real study that we can aim to support. We'll make it a major milestone to have this study actually running on the new cohort extractor in the production system.

Characteristics we're looking for:

  • TPP only
  • minimal use of cohort features beyond those we already have

Publish the new image.

We need to pick a versioning scheme to indicate that it's work in progress and the API is not stable. I suggest something with "alpha" in it for the moment.

Colons in test recording names break checkout on Windows

I just tried to clone this repo on the TPP server and hit this issue. It's possible to work around this by just checking out the directories you need but not ideal.

This may be the only time we ever care about cloning this repo on Windows, but it's not impossible that we'll want to do it again in future and hopefully it's an easy fix via changing the naming scheme.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.