vmenger / deduce Goto Github PK
View Code? Open in Web Editor NEWDeduce: de-identification method for Dutch medical text
License: GNU General Public License v3.0
Deduce: de-identification method for Dutch medical text
License: GNU General Public License v3.0
The telephone number format 06 18 34 56 78
is currently not de-identified with DEDUCE. I suspect this format is not uncommon, so it might be worth adding functionality for it to DEDUCE.
import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 18 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))
De <PERSOON-1> (e: <URL-1>, t: 06 18 34 56 78) is <LEEFTIJD-1> jaar oud.
Interestingly, the first 4 numbers can be recognized as date:
import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 12 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))
De <PERSOON-1> (e: <URL-1>, t: <DATUM-1> 56 78) is <LEEFTIJD-1> jaar oud.
The following URL: https://keuzehulp‐vruchtbaarheidsbehoud‐ transmannen.nl
takes about 500 seconds to be deidentified, for no particular reason. It's the first regexp in annotate_url that causes the delay. It's probably doing some absurd and epic search to match, but perhaps it can be optimized a little.
Simplest reproduction I found so far:
https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a.nl
The logic for detecting housenumber letters (1a
) and housenumber additions (1-bis
) is not very complete, and Deduce thus not always correctly annotates the information following the housenumber. There seem to be no strict rules on what can follow a housenumber. Currently Deduce only detects a single letter (case insensitive) after a housenumber, optionally separated by a whitespace (so 1a
, 1A
, 1 a
and 1 A
). The information is usually not very privacy-sensitive, so not very high priority, but would still be nice to do it a bit cleaner.
See:
from deduce.tokenizer import tokenize_split
print(tokenize_split("v.d. "))
Expected output: ['v.d.', ' ']
Real output: ['v.', 'd', '. ']
This makes that a name like v.d. Valk
will not be detected.
While attempting to create continuous integration, such that unit tests would be run automatically, I discovered that utility.py uses two keywords that are not imported anywhere. Those are unicodedata and unicode. These are not usually used, I guess, because we usually don't call utility.read_list with normalize='ascii'. But that raises an error in the "action", so it would lead to not being able to merge. I think we should fix this first.
Deduce does not recognize names that contain initials that are not one letter, like Ph.
or Th.
:
A.Ph. de Visser
A.Th. de Visser
These cases are fine though:
ABC de Visser
A.B.C. de Visser
Perhaps there are more/similar names that are initialized with the following h
.
While working on the branch structured_annotations I came across the method deduce.annotate.annotate_names_context, but I can't find what it is used for. I've tried inputting texts in many formats, but I never found a case in which it did anything. Can you provide an example and/or tell me what the expected behaviour of this method is?
Currently DEDUCE can only accept data from a webservice REST-API (Flask based). Ideally a level of indirection should be added to decouple the data-source to make DEDUCE adaptable to different input sources or formats (e.g. CSV-file, TSV-file, DB-connector (ODBC / JDBC), SOAP). This is known the Enterprise Service Bus terms as an adapter
This text:
text = "Patient krijgt 1000mg van het middel, tot 1250mg verhogen"
Gets annotated (annotate_postalcode
) to:
Patient krijgt 1000mg> van het middel, tot <LOCATIE 1250mg verhogen
Which, since 1.0.4 raises a:
ValueError: Incorrectly formatted string
Bug has probably been around for much longer, but only came to light now. Fixing should be straightforward, working on it.
There are still lots of false positives regarding ages, like:
sinds 5 jaar
ongeveer 5 jaar
< 5 jaar
Unfortunately, the current regexp method is limited, as creating many negative lookbehinds is not scalable and also very inefficient. An improvement that takes into account negative/pseudotriggers before or after the age number would be a nice improvement.
Deduce includes a list of medical terms, but it's quite short. It would be nice to extend it with some more terms.
Some specific examples of false positives:
A. Femoralis
N. Fibularis
M. Hodgkin
Currently for annotation purposes DEDUCE only supports single token medical terms. Multi-token medical terms could also be added. Possilby this is already possible.
From #115, Deduce includes a long list of healthcare institutions. There is however a mismatch between the names of the institutions on the list, and the actual name that is written in text. For example, an institution like 'De Binnentuin, zorgboerderij en dagbesteding' would amost certainly be written as 'De Binnentuin'. For hospitals, some optimizations have been done already, but for non-hospitals (healthcare_institutions.txt
) there are probably some more optimizations to be done, without impacting performance and false positives too much.
In medical texts tokens occur of long numbers such as the Dutch Social Number (Burger Service Number), BIG, AGB, bankaccounts. Some of these adhere to a specific format (e.g. IBAN) and/or to a specific mathematical algoritme (e.g. BSN and the 11-proof).
To enhance the 'readablity' of anonymized texts number conforming to a format, could be annotated with a semantic meaningful annotation like [BSN] or [BANKACCOUTNUMBER]. On the other hand numbers not conforming to a format but could be annoated with a general annotation like [LONGNUMBER].
E.g.:
Ziekte van Lyme
Here's a list in English, would be nice to incorporate this somewhere in logic or filter them out of lookup lists: https://en.wikipedia.org/wiki/List_of_eponymous_diseases
An uncommon pattern, but it does exist:
de Vries, B.
de Vries, Berend
Vries, de
Adding a contribute.md is always a good option as it
Lemme know if I can work on it
If you input this text into Deduce:
'ADHD Adres: Naamlaan 100 Woonplaats: 3512AB Apeldoorn Tel: 088-1234567'
and run deduce.annotate_text, you will obtain:
'ADHD <PERSOON Adres: <LOCATIE Naamlaan 100> Woonplaats: 3512AB Apeldoorn Tel>: <TELEFOONNUMMER 088-1234567>'
which includes nested tags. Obviously there is a problem whereby the entired string from "Adres" to "Tel" is being detected as a person's name. However, the problem I'm pointing out here is that, having detected that, it then finds a LOCATIE tag within the PERSOON tag, which means that the final output contains nested tags, which should not be allowed.
This should be fixed easily by moving the flatten_text call, currently happening within the "names" deidentification, to the very end of the annotate_text method, right before returning the final text. Do you agree with this?
E.g.:
Jan-Willem
Bruins Slot
Are currently never detected, at least not from the lookup lists, as they only match single tokens.
It would be nice to update the lists of first names and surnames with current data, as the existing lists were compiled 5-6 years ago. Perhaps there are new/better lists available.
Currently the residence (Street name and house number) is annotated using a regular expression according the Dutch convention (Dorpstraat 2) and with a postfix of common Dutch locations; such as -straat, -hof or -laan. Note however that there are a large number of residences in the Netherlands which do not match that pattern e.g. Nijenheim 22-20. Moreover foreign (ie. non-Dutch) residences are not found using such a regurlar expression e.g. '221b Baker Street '.
A proposed solution is to replace the reg-exp with a lookup list. This list can consist of the intersection of a list generated from an EHR (Electronic Heath Record) system and a list of all street names in the geographical region where patients live (e.g. the Netherlands or Europe).
A possible source of Dutch street names is the Kadaster - Basisregistratie Adressen en Gebouwen (BAG) or PostNL
In #92 we removed detection of day-month combinations, and only detect combinations of day, month and year as date. Day-month combinations may still be considered PHI in some cases, however, detection is harder than a simple regexp. See e.g. these real-world examples:
Patient werd op 5/5 opgenomen
Patient werd op 2.5 opgenomen
versus:
Motoriek: Deltoideus 5/5; Biceps 5/5
Start met bisoprolol 2.5 mg
Perhaps a different rule-based approach can be used here?
IBAN numbers are not currently detected by Deduce. It has been reported they sometimes do occur in text, so this might be a useful example. I have not encountered them in any dataset I've worked with, so hoping somebody who does encounter is willing to pick it up.
See also: #74
Can I use the same implementation with some tweaks for different languages by changing the static data it uses and regex according to the data?
For example for the English language.
When there are multiple consecutive spaces present in the patient identifiers (e.g. patient_first_names, patient_last_names) then an
IndexError occurs duing the processing; see the stacktrace down below. The error occured with the input "Peter Fiver" (with two spaces) as patient first names
self = <deduce.pattern.name_patient.PersonInitialFromNamePattern object at 0x000001A6664C1AC0>
token = Token(text='Peter', start_char=0, end_char=5)
metadata = <docdeid.document.MetaData object at 0x000001A667603340>
def match(self, token: dd.Token, metadata: dd.MetaData) -> Optional[tuple[dd.Token, dd.Token]]:
for _, first_name in enumerate(metadata["patient"].first_names):
if str_match(token.text, first_name[0]):
E IndexError: string index out of range
..\venv\lib\site-packages\deduce\pattern\name_patient.py:47: IndexError
Recently, interest in DEDUCE is increasing: I get more messages/requests/questions, downloads are increasing, the repository has become somewhat alive. This must mean more people are using clinical text data, which is nice news.
However, DEDUCE was created a long time ago with a specific problem in mind, by someone without much in depth python/dev experience (me). I now find the implementation of DEDUCE lacking in several aspects, and this in turn prevents making improvements to the algorithm itself. I intend to start fixing that, first improving just the implementation, and later potentially the algorithm too. Here I want to write out some of my ideas for improvements in the (near) future.
Roughly I'm thinking of the following implementation changes (in this order), keeping the algorithm the same:
Annotator
, Pipeline
, Tokenizer
, Deduce
objects), with proper abstract base classes.After that, we can start creating more building blocks (e.g. for new PHI categories we find interesting), and improving the building blocks we have, and by that make DEDUCE a more useful for cases outside Psychiatry. Ideally we would have a bit of an ecosystem developing around the rule-based deidentification problem, perhaps also outside medicine and in other languages. But, this is really still stuff for the future.
This is all a bit stream of thought -- but please don't hesitate to add any comments or ideas! In a later stage I will try to structure and plan this a bit more in separate issues.
Try entering 'Vandaag is het 14-05' and 'Vandaag is het 14-05.' on tdslab.com/deduce
You obtain 'Vandaag is het 14-05' and 'Vandaag is het ', respectively
The method deduce.annotate_text_structured returns structured annotations corresponding to the annotations embedded in the original text. It does so by reconstructing the original text from the annotated one, and keeping track of at which position the original text in the annotation would be placed. For example, in the text "I'm in love with Jane", annotated as "I'm in love with ", the method finds the tag "", then computes the index of the opening angular bracket, and assumes that the word "Jane" in the original text occurs at the same location.
The trouble is that Deduce sometimes (accidentally) inserts or deletes a blank space, which messes with the indexing of the annotations in the original text. To mitigate this, we could access the original text to make sure that the annotation text matches what is present in the original text at that index.
Using the string below as input for annotate_text gives me an IndexError: list index out of range
import deduce
text_input = "https://www.rtl.nl/components/financien/rtlz/2004/06_juni/crucell_sars.pdf"
deduce.annotate_text(text_input)
The strings "https://www.rtl.nl/components/financien/rtlz/2004/crucell_sars.pdf"
and "https://www.rtl.nl/components/financien/rtlz/06_juni/crucell_sars.pdf" seem to work however.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.