opensafely-core / ehrql Goto Github PK
View Code? Open in Web Editor NEWehrQL: the electronic health record query language for OpenSAFELY
Home Page: https://docs.opensafely.org/ehrql/
License: Other
ehrQL: the electronic health record query language for OpenSAFELY
Home Page: https://docs.opensafely.org/ehrql/
License: Other
One major feature that is missing so far which we will need for any study is codelists. We should add it. :-)
We have an opportunity to simplify the use of codelists for researchers, for example by allowing them to specify just a URL which we then retrieve the codelist from, rather than having to deal with files in the workspace. Before implementing this we should discuss with the wider team.
All of the basic filters will ignore null values; we should have some way of customising how we deal with nulls and/or options to include/exclude nulls
e.g. additional filters?
table("clinical_events").filter("value", equals_or_null="foo")
or parameters?
table("clinical_events").filter(value="foo", include_null=True)
Probably with some implementation of the sqlalcmey or_
conjunction
As much as possible of the documentation should be generated from the code. The new DSL should help with that for the qeury language; for the backend documentation, if we can formalize the backend definition, we may be able to generate some of that documentation too.
Update full long covid study definition to V2
Format for the IMD variable will depend on #96
BMI to be implemented as explicitly recorded BMI values only (no inferred calculations from height/weight)
We currently hardcode the collation for codes to Latin1_General_BIN
, because that's what's used in TPP for CTV3
, which is what we need at the moment. However it needs to be allowed to vary as appropriate. We can consider this as part of the backend definition and do something like this:
collation = self.backend.collation_for_coding_system(codelists.system)
IMD queries in the old cohort extractor are calculated by finding the patients with addresses registered on the index_date, and then ordering them by
ORDER BY
StartDate DESC,
EndDate DESC,
IIF(MSOACode = 'NPC', 1, 0),
PatientAddress_ID
This orders by most recent start date, then latest end date, then prefers addresses which are not marked "NPC" (for "No Postcode") and finally uses the address ID as a tie-breaker.
The IMD query in the old cohort extractor could almost be implemented as a last_by
method in the new version, like:
imd_value = (
table("patient_address")
.date_in_range(index_date)
.last_by("date_start", "date_end", "has_postcode", "patientaddress_id")
.get("index_of_multiple_deprivation")
)
Except that has_postcode
is not a field, it's the IIF statement on MSOACode.
A Row
could accept an expression as a column (with relevant DSL wrapping for defining the expression). We probably still want a custom method of some sort so that researchers don't have to type the column ordering explicitly though.
x > y
)(x > y)
& (y == z)
)((x > y)
& (y == z)) | (z == a)
)~((x > y)
& (y == z))
)Create a checklist on #38. Create cards for everything we don't already have cards for.
There are some notes here: https://ebmdatalab.slack.com/archives/C026C0ZCHGR/p1626786014371600.
For reasons best known to itself, MSSQL will change it's interpretation of ISO format dates (YYYY-MM-DD) according to it's localisation settings and so will sometimes treat them as a bizarre (and seen nowhere else) hybrid of ISO and US formats: YYYY-DD-MM. (See opensafely-core/cohort-extractor#80)
However, it will consistently and correctly interpret dates in YYYYMMDD format (i.e ISO without the hyphens).
We obviously don't want to leak this monstrous detail out to study authors and so our approach thus far has been to use ISO format everywhere and then convert to MSSQL-format at the last moment when building the SQL.
We do this by passing every string though this function, which we reformat it if it looks like an ISO date and leave it alone otherwise:
https://github.com/opensafely-core/cohort-extractor/blob/bb5d2e9faa8e3e5aa35824edcfe2e48d088be59c/cohortextractor/tpp_backend.py#L2421-L2434
This has the advantage of being comprehensive. Accidentally passing an ISO date through would lead to very subtle bugs because it's unlikely to trigger an error, just a subtlety wrong date condition. So we really, really don't want to do that. On the other hand, I think there have been one or two cases where this has bitten us because someone has wanted to use date-like-strings as categories and then got very confused.
Theoretically, the new cohortextractor has access to enough type information to know if a value is supposed to be a date or not and format it accordingly. However, propagating this type information through is something we're not yet doing and this problem needs solving now so I suggest we adopt the same approach as the old cohortextractor but hopefully implemented in a such a way that we can surgically swap it out for a "proper" solution when the time comes.
It would also be good to work out how to put the test database into the right localisation mode to trigger this behaviour so we can increase our chances of catching problems early.
The current workflow for using codelists in an OpenSAFELY study definition looks something like this:
codelists.txt
opensafely codelists update
to download itcodelist_from_csv
functionAs a bonus step, it looks like a pattern has emerged whereby users do the codelist_from_csv
call in a file called codelists.py
and then do from codelists import *
in the study definition.
I think there are a few issues with this workflow:
from codelists import *
pattern makes this even harder.)We no longer have a codelists.txt
file.
Study authors should reference codelists in a study directly by URL e.g.
from cohortextractor import codelist
pulse_oximetry_codes = codelist(
'https://www.opencodelists.org/codelist/opensafely/pulse-oximetry/72ce1380/'
)
The opensafely codelists update
command would get the URLs out of the study definitions and then download as normal (exact mechanics of this TBC, but I have some ideas).
We already store some codelist metadata in a codelists.json
file. We extend this to include details of the coding system used and relevant columns so the user doesn't have to specify these again.
Additional notes:
codelist
function accept a path to a local CSV file, in which case the coding system and column argument would be required.opensafely codelists update
step run automatically when running the cohortextractor locally to reduce the steps required even further."Filtering" means taking a table and removing some of the rows it contains according to certain criteria. This should be a chainable operation so that further filters can be applied to an already filtered table.
The most basic kind of filter compares a column in the table with a single, static value e.g. date_recorded >= '2020-01-01'
but there are various more complex cases we'll need to support as well including:
IN/NOT IN
queries (or their semantic equivalents) whereby the comparison is not with a single value but an array.date_recorded NOT in <list_of_dates_where_event_X_happened>
The most basic example looks something like this:
class Cohort:
has_event_x = table("coded_events").filter(code="code_X").exists()
And the expected output looks something like this:
patient_id,has_event_x
10001,False
10002,True
10003,False
It's useful to have this in development but it makes the logs too noisy for users.
If we need to see the SQL for debugging purposes we can always regenerate the exact SQL given a study definition and the version of the cohortextractor used.
We'd like to identify a real study that we can aim to support. We'll make it a major milestone to have this study actually running on the new cohort extractor in the production system.
Characteristics we're looking for:
We need to pick a versioning scheme to indicate that it's work in progress and the API is not stable. I suggest something with "alpha" in it for the moment.
I just tried to clone this repo on the TPP server and hit this issue. It's possible to work around this by just checking out the directories you need but not ideal.
This may be the only time we ever care about cloning this repo on Windows, but it's not impossible that we'll want to do it again in future and hopefully it's an easy fix via changing the naming scheme.
Add the exists
aggregation first, as it's needed for the simplest population selection
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.