The ehrql from opensafely-core

One major feature that is missing so far which we will need for any study is codelists. We should add it. :-)

We have an opportunity to simplify the use of codelists for researchers, for example by allowing them to specify just a URL which we then retrieve the codelist from, rather than having to deal with files in the workspace. Before implementing this we should discuss with the wider team.

Deal with null values

All of the basic filters will ignore null values; we should have some way of customising how we deal with nulls and/or options to include/exclude nulls

e.g. additional filters?

table("clinical_events").filter("value", equals_or_null="foo")

or parameters?

table("clinical_events").filter(value="foo", include_null=True)

Probably with some implementation of the sqlalcmey or_ conjunction

Use the same version and compatibility level for MSSQL in testing as TPP have in production

Port over codelists

Add support for multi-value code column in TPP hospitalizations table

Allow cohort definitions to be specified by name on the command line.

Update manual, tutorials, other documentation.

As much as possible of the documentation should be generated from the code. The new DSL should help with that for the qeury language; for the backend documentation, if we can formalize the backend definition, we may be able to generate some of that documentation too.

Move the DSL into its own package so that the researcher can import just that during development and they are not burdened by all the cohort extractor's dependencies and the discoverability of the API is improved.

Just enough of a TPP backend to support a simplified study

https://reports.opensafely.org/reports/opensafely-tpp-database-schema/

Add directory that the cohort definition sits in to the `PYTHONPATH` so that researchers can factor out modules as they see fit.

Update long covid V2 study definition

Update full long covid study definition to V2

Format for the IMD variable will depend on #96

BMI to be implemented as explicitly recorded BMI values only (no inferred calculations from height/weight)

Allow code collation to vary per backend, per system

We currently hardcode the collation for codes to Latin1_General_BIN, because that's what's used in TPP for CTV3, which is what we need at the moment. However it needs to be allowed to vary as appropriate. We can consider this as part of the backend definition and do something like this:

collation = self.backend.collation_for_coding_system(codelists.system)

Categorizing that requires a calculation

Just enough of a TPP backend to support a real study

Add analysis code to categorise ethnicity codes post-extraction

Implement Index of Multiple Deprivation

IMD queries in the old cohort extractor are calculated by finding the patients with addresses registered on the index_date, and then ordering them by

ORDER BY
              StartDate DESC,
              EndDate DESC,
              IIF(MSOACode = 'NPC', 1, 0),
              PatientAddress_ID

This orders by most recent start date, then latest end date, then prefers addresses which are not marked "NPC" (for "No Postcode") and finally uses the address ID as a tie-breaker.

The IMD query in the old cohort extractor could almost be implemented as a last_by method in the new version, like:

imd_value = (
        table("patient_address")
        .date_in_range(index_date)
        .last_by("date_start", "date_end", "has_postcode", "patientaddress_id")
        .get("index_of_multiple_deprivation")
    )

Except that has_postcode is not a field, it's the IIF statement on MSOACode.

A Row could accept an expression as a column (with relevant DSL wrapping for defining the expression). We probably still want a custom method of some sort so that researchers don't have to type the column ordering explicitly though.

Categorizing for bucketed values

simple single comparisons (x > y)
and/or conditions ((x > y) & (y == z))
nesting conditions (((x > y) & (y == z)) | (z == a))
inverting conditions (~((x > y) & (y == z)))
True/False comparisons

Work out what extra features we need to add to support the full long covid study

Create a checklist on #38. Create cards for everything we don't already have cards for.

There are some notes here: https://ebmdatalab.slack.com/archives/C026C0ZCHGR/p1626786014371600.

Correctly format dates for MSSQL

For reasons best known to itself, MSSQL will change it's interpretation of ISO format dates (YYYY-MM-DD) according to it's localisation settings and so will sometimes treat them as a bizarre (and seen nowhere else) hybrid of ISO and US formats: YYYY-DD-MM. (See opensafely-core/cohort-extractor#80)

However, it will consistently and correctly interpret dates in YYYYMMDD format (i.e ISO without the hyphens).

We obviously don't want to leak this monstrous detail out to study authors and so our approach thus far has been to use ISO format everywhere and then convert to MSSQL-format at the last moment when building the SQL.

We do this by passing every string though this function, which we reformat it if it looks like an ISO date and leave it alone otherwise:
https://github.com/opensafely-core/cohort-extractor/blob/bb5d2e9faa8e3e5aa35824edcfe2e48d088be59c/cohortextractor/tpp_backend.py#L2421-L2434

This has the advantage of being comprehensive. Accidentally passing an ISO date through would lead to very subtle bugs because it's unlikely to trigger an error, just a subtlety wrong date condition. So we really, really don't want to do that. On the other hand, I think there have been one or two cases where this has bitten us because someone has wanted to use date-like-strings as categories and then got very confused.

Theoretically, the new cohortextractor has access to enough type information to know if a value is supposed to be a date or not and format it accordingly. However, propagating this type information through is something we're not yet doing and this problem needs solving now so I suggest we adopt the same approach as the old cohortextractor but hopefully implemented in a such a way that we can surgically swap it out for a "proper" solution when the time comes.

It would also be good to work out how to put the test database into the right localisation mode to trigger this behaviour so we can increase our chances of catching problems early.

Support full version of Long Covid study

Fix the Docker network accumulation problem

Proposal: new approach to codelist managment

Current workflow

The current workflow for using codelists in an OpenSAFELY study definition looks something like this:

Create (or discover) a codelist on OpenCodelists
Add (part of) its URL to a file called codelists.txt
Run opensafely codelists update to download it
In a study definition, import the codelist_from_csv function
Load the codelist using its filename and specify the columns to read and the coding system in use.

As a bonus step, it looks like a pattern has emerged whereby users do the codelist_from_csv call in a file called codelists.py and then do from codelists import * in the study definition.

I think there are a few issues with this workflow:

It's a bit clunky for the user.
It makes it hard for someone looking at a study to jump quickly from the use of a codelist to the corresponding page on OpenCodelists. (And the from codelists import * pattern makes this even harder.)
Having to specify the coding system and columns again is not only duplicative work but it adds the possibility of user error.

Proposal

We no longer have a codelists.txt file.

Study authors should reference codelists in a study directly by URL e.g.

from cohortextractor import codelist

pulse_oximetry_codes = codelist(
    'https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/'
)

The opensafely codelists update command would get the URLs out of the study definitions and then download as normal (exact mechanics of this TBC, but I have some ideas).
We already store some codelist metadata in a codelists.json file. We extend this to include details of the coding system used and relevant columns so the user doesn't have to specify these again.

Additional notes:

We can continue to support codelists sourced from elsewhere by making the codelist function accept a path to a local CSV file, in which case the coding system and column argument would be required.
It might be possible to have something like the opensafely codelists update step run automatically when running the cohortextractor locally to reduce the steps required even further.
Some of this might require extra work on the OpenCodelists side as I'm not 100% sure that this currently provides all the metadata we need in an easily consumable way.

Add support for filtering using a codelist

Use synthetic data provided by an arbitrary upstream action.

Add support for basic filtering on tables

"Filtering" means taking a table and removing some of the rows it contains according to certain criteria. This should be a chainable operation so that further filters can be applied to an already filtered table.

The most basic kind of filter compares a column in the table with a single, static value e.g. date_recorded >= '2020-01-01' but there are various more complex cases we'll need to support as well including:

supporting IN/NOT IN queries (or their semantic equivalents) whereby the comparison is not with a single value but an array.
supporting the above but where the array-like value is a codelist where we need to retrieve not just the fact that there was a match but what the specific matching code was (and potentially the code "category" if the codelist contains these)
using the value from another query as the comparison value (e.g. what are called "dynamic dates" in the old cohortextractor)
using an array of values from another query as the comparison value e..g where date_recorded NOT in <list_of_dates_where_event_X_happened>

The most basic example looks something like this:

class Cohort:
    has_event_x = table("coded_events").filter(code="code_X").exists()

And the expected output looks something like this:

patient_id,has_event_x
10001,False
10002,True
10003,False

Disable default logging of executed SQL queries

It's useful to have this in development but it makes the logs too noisy for users.

If we need to see the SQL for debugging purposes we can always regenerate the exact SQL given a study definition and the version of the cohortextractor used.

Filter in/not in using the output from another query

Don't write bytecode files to the analysis directory, so that the researcher doesn't have to wonder what they are and whether they should be commited.

Support simplified version of Long Covid study

Becky has drafted a simplified version of the long covid study here: https://github.com/opensafely/long-covid/blob/f133b6edd661366cff2cbd7b21dc789b3347982a/analysis/study_definition_simplified_v2.py.

Add support to `job-runner` for a new image name.

Explicitly specify path of cohort definition to make it clearer what's going on.

Work with researchers to find a simple initial study

We'd like to identify a real study that we can aim to support. We'll make it a major milestone to have this study actually running on the new cohort extractor in the production system.

Characteristics we're looking for:

TPP only
minimal use of cohort features beyond those we already have

Load cohort definition from `/workspace/analysis`.

Simple categorizing for categoricals

New efficient measures implementation

Publish the new image.

We need to pick a versioning scheme to indicate that it's work in progress and the API is not stable. I suggest something with "alpha" in it for the moment.

`age_as_of()`

Write cohort into `/workspace/output`.

Add a way of choosing one of a set of patient events (e.g. the most recent)

Modify the DSL and add type hinting so that IDE auto-complete works as well as possible, so that researchers can discover what they need as they go instead of having to look for documentation and examples.

opensafely-core / ehrql Goto Github PK

ehrql's People

Contributors

Stargazers

Watchers

Forkers

ehrql's Issues

Current workflow

Proposal

Recommend Projects

Recommend Topics

Recommend Org