The deduce from vmenger

Telephone number format not de-identified

The telephone number format 06 18 34 56 78 is currently not de-identified with DEDUCE. I suspect this format is not uncommon, so it might be worth adding functionality for it to DEDUCE.

import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 18 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))

De <PERSOON-1> (e: <URL-1>, t: 06 18 34 56 78) is <LEEFTIJD-1> jaar oud.

Interestingly, the first 4 numbers can be recognized as date:

import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 12 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))

De <PERSOON-1> (e: <URL-1>, t: <DATUM-1> 56 78) is <LEEFTIJD-1> jaar oud.

Specific URL with two hypens, a whitespace and many characters breaks deduce

The following URL: https://keuzehulp‐vruchtbaarheidsbehoud‐ transmannen.nl takes about 500 seconds to be deidentified, for no particular reason. It's the first regexp in annotate_url that causes the delay. It's probably doing some absurd and epic search to match, but perhaps it can be optimized a little.

Simplest reproduction I found so far:

https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a.nl

Improved detection of housenumber letters and housenumber additions

The logic for detecting housenumber letters (1a) and housenumber additions (1-bis) is not very complete, and Deduce thus not always correctly annotates the information following the housenumber. There seem to be no strict rules on what can follow a housenumber. Currently Deduce only detects a single letter (case insensitive) after a housenumber, optionally separated by a whitespace (so 1a, 1A, 1 a and 1 A). The information is usually not very privacy-sensitive, so not very high priority, but would still be nice to do it a bit cleaner.

Tokenizer incorrectly tokenizes interfixes with non-alpha characters

See:

from deduce.tokenizer import tokenize_split
print(tokenize_split("v.d. "))

Expected output: ['v.d.', ' ']
Real output: ['v.', 'd', '. ']

This makes that a name like v.d. Valk will not be detected.

Unknown variables/modules in utility.py

While attempting to create continuous integration, such that unit tests would be run automatically, I discovered that utility.py uses two keywords that are not imported anywhere. Those are unicodedata and unicode. These are not usually used, I guess, because we usually don't call utility.read_list with normalize='ascii'. But that raises an error in the "action", so it would lead to not being able to merge. I think we should fix this first.

Detection of Ph./Th. initials

Deduce does not recognize names that contain initials that are not one letter, like Ph. or Th.:

A.Ph. de Visser
A.Th. de Visser

These cases are fine though:

ABC de Visser
A.B.C. de Visser

Perhaps there are more/similar names that are initialized with the following h.

annotate_names_context

While working on the branch structured_annotations I came across the method deduce.annotate.annotate_names_context, but I can't find what it is used for. I've tried inputting texts in many formats, but I never found a case in which it did anything. Can you provide an example and/or tell me what the expected behaviour of this method is?

Add the possibilty to use DEDUCE with multiple data sources and formats using a kind of adapter

Currently DEDUCE can only accept data from a webservice REST-API (Flask based). Ideally a level of indirection should be added to decouple the data-source to make DEDUCE adaptable to different input sources or formats (e.g. CSV-file, TSV-file, DB-connector (ODBC / JDBC), SOAP). This is known the Enterprise Service Bus terms as an adapter

test1

Incorrect handling of multiple 4-digit mg dosages

This text:

text = "Patient krijgt 1000mg van het middel, tot 1250mg verhogen"

Gets annotated (annotate_postalcode) to:

Patient krijgt 1000mg> van het middel, tot <LOCATIE 1250mg verhogen

Which, since 1.0.4 raises a:

ValueError: Incorrectly formatted string

Bug has probably been around for much longer, but only came to light now. Fixing should be straightforward, working on it.

Improve age detection with pseudo triggers

There are still lots of false positives regarding ages, like:

sinds 5 jaar
ongeveer 5 jaar
< 5 jaar

Unfortunately, the current regexp method is limited, as creating many negative lookbehinds is not scalable and also very inefficient. An improvement that takes into account negative/pseudotriggers before or after the age number would be a nice improvement.

Extend list of medical terms

Deduce includes a list of medical terms, but it's quite short. It would be nice to extend it with some more terms.

Muscles
Bones
Organs
Diseases
Nerves
Arteries
Veins (?)
...?

Some specific examples of false positives:

A. Femoralis
N. Fibularis
M. Hodgkin

Multi-token medical terms

Currently for annotation purposes DEDUCE only supports single token medical terms. Multi-token medical terms could also be added. Possilby this is already possible.

Fuzzy detection / preprocessing of institution names

From #115, Deduce includes a long list of healthcare institutions. There is however a mismatch between the names of the institutions on the list, and the actual name that is written in text. For example, an institution like 'De Binnentuin, zorgboerderij en dagbesteding' would amost certainly be written as 'De Binnentuin'. For hospitals, some optimizations have been done already, but for non-hospitals (healthcare_institutions.txt) there are probably some more optimizations to be done, without impacting performance and false positives too much.

Addition of annotations for long numbers conforming or not to a specific format

In medical texts tokens occur of long numbers such as the Dutch Social Number (Burger Service Number), BIG, AGB, bankaccounts. Some of these adhere to a specific format (e.g. IBAN) and/or to a specific mathematical algoritme (e.g. BSN and the 11-proof).
To enhance the 'readablity' of anonymized texts number conforming to a format, could be annotated with a semantic meaningful annotation like [BSN] or [BANKACCOUTNUMBER]. On the other hand numbers not conforming to a format but could be annoated with a general annotation like [LONGNUMBER].

Stop detecting names in eponymous diseases (false positives)

E.g.:

Ziekte van Lyme

Here's a list in English, would be nice to incorporate this somewhere in logic or filter them out of lookup lists: https://en.wikipedia.org/wiki/List_of_eponymous_diseases

Detection of names in reversed order

An uncommon pattern, but it does exist:

de Vries, B.
de Vries, Berend
Vries, de

Add CONTRIBUTE.md to make contribution more accessible.

Adding a contribute.md is always a good option as it

Makes clear what the maintainer is looking for.
Promotes a uniformity in the code
Helps manage things better and efficiently.

Lemme know if I can work on it

Can obtain nested tags

If you input this text into Deduce:
'ADHD Adres: Naamlaan 100 Woonplaats: 3512AB Apeldoorn Tel: 088-1234567'
and run deduce.annotate_text, you will obtain:
'ADHD <PERSOON Adres: <LOCATIE Naamlaan 100> Woonplaats: 3512AB Apeldoorn Tel>: <TELEFOONNUMMER 088-1234567>'
which includes nested tags. Obviously there is a problem whereby the entired string from "Adres" to "Tel" is being detected as a person's name. However, the problem I'm pointing out here is that, having detected that, it then finds a LOCATIE tag within the PERSOON tag, which means that the final output contains nested tags, which should not be allowed.

This should be fixed easily by moving the flatten_text call, currently happening within the "names" deidentification, to the very end of the annotate_text method, right before returning the final text. Do you agree with this?

Detect multi-token names

E.g.:

Jan-Willem
Bruins Slot

Are currently never detected, at least not from the lookup lists, as they only match single tokens.

Update/extend name lookup lists

It would be nice to update the lists of first names and surnames with current data, as the existing lists were compiled 5-6 years ago. Perhaps there are new/better lists available.

Change residence annotation to a lookup list

Currently the residence (Street name and house number) is annotated using a regular expression according the Dutch convention (Dorpstraat 2) and with a postfix of common Dutch locations; such as -straat, -hof or -laan. Note however that there are a large number of residences in the Netherlands which do not match that pattern e.g. Nijenheim 22-20. Moreover foreign (ie. non-Dutch) residences are not found using such a regurlar expression e.g. '221b Baker Street '.
A proposed solution is to replace the reg-exp with a lookup list. This list can consist of the intersection of a list generated from an EHR (Electronic Heath Record) system and a list of all street names in the geographical region where patients live (e.g. the Netherlands or Europe).
A possible source of Dutch street names is the Kadaster - Basisregistratie Adressen en Gebouwen (BAG) or PostNL

Reliable detection of day-month combinations

In #92 we removed detection of day-month combinations, and only detect combinations of day, month and year as date. Day-month combinations may still be considered PHI in some cases, however, detection is harder than a simple regexp. See e.g. these real-world examples:

Patient werd op 5/5 opgenomen
Patient werd op 2.5 opgenomen

versus:

Motoriek:  Deltoideus 5/5; Biceps 5/5
Start met bisoprolol 2.5 mg

Perhaps a different rule-based approach can be used here?

Detection of IBAN numbers

IBAN numbers are not currently detected by Deduce. It has been reported they sometimes do occur in text, so this might be a useful example. I have not encountered them in any dataset I've worked with, so hoping somebody who does encounter is willing to pick it up.

Can I use the same implementation with some tweaks for different languages ?

Can I use the same implementation with some tweaks for different languages by changing the static data it uses and regex according to the data?
For example for the English language.

Multiple consecutive spaces in patient identifers causes an IndexError

When there are multiple consecutive spaces present in the patient identifiers (e.g. patient_first_names, patient_last_names) then an
IndexError occurs duing the processing; see the stacktrace down below. The error occured with the input "Peter Fiver" (with two spaces) as patient first names

self = <deduce.pattern.name_patient.PersonInitialFromNamePattern object at 0x000001A6664C1AC0>
token = Token(text='Peter', start_char=0, end_char=5)
metadata = <docdeid.document.MetaData object at 0x000001A667603340>
def match(self, token: dd.Token, metadata: dd.MetaData) -> Optional[tuple[dd.Token, dd.Token]]:

for _, first_name in enumerate(metadata["patient"].first_names):

if str_match(token.text, first_name[0]):
E IndexError: string index out of range
..\venv\lib\site-packages\deduce\pattern\name_patient.py:47: IndexError

Long-term repository improvement plan

Recently, interest in DEDUCE is increasing: I get more messages/requests/questions, downloads are increasing, the repository has become somewhat alive. This must mean more people are using clinical text data, which is nice news.

However, DEDUCE was created a long time ago with a specific problem in mind, by someone without much in depth python/dev experience (me). I now find the implementation of DEDUCE lacking in several aspects, and this in turn prevents making improvements to the algorithm itself. I intend to start fixing that, first improving just the implementation, and later potentially the algorithm too. Here I want to write out some of my ideas for improvements in the (near) future.

Roughly I'm thinking of the following implementation changes (in this order), keeping the algorithm the same:

After that, we can start creating more building blocks (e.g. for new PHI categories we find interesting), and improving the building blocks we have, and by that make DEDUCE a more useful for cases outside Psychiatry. Ideally we would have a bit of an ecosystem developing around the rule-based deidentification problem, perhaps also outside medicine and in other languages. But, this is really still stuff for the future.

This is all a bit stream of thought -- but please don't hesitate to add any comments or ideas! In a later stage I will try to structure and plan this a bit more in separate issues.

bug: dates at the ends of texts not matched if there are no punctuation symbols afterwards

Try entering 'Vandaag is het 14-05' and 'Vandaag is het 14-05.' on tdslab.com/deduce
You obtain 'Vandaag is het 14-05' and 'Vandaag is het ', respectively

Structured annotations: check for text match

The method deduce.annotate_text_structured returns structured annotations corresponding to the annotations embedded in the original text. It does so by reconstructing the original text from the annotated one, and keeping track of at which position the original text in the annotation would be placed. For example, in the text "I'm in love with Jane", annotated as "I'm in love with ", the method finds the tag "", then computes the index of the opening angular bracket, and assumes that the word "Jane" in the original text occurs at the same location.

The trouble is that Deduce sometimes (accidentally) inserts or deletes a blank space, which messes with the indexing of the annotations in the original text. To mitigate this, we could access the original text to make sure that the annotation text matches what is present in the original text at that index.

bug: IndexError: list index out of range for annotate_text with input string containing a URL with specific type of date

Using the string below as input for annotate_text gives me an IndexError: list index out of range

import deduce
text_input = "https://www.rtl.nl/components/financien/rtlz/2004/06_juni/crucell_sars.pdf"
deduce.annotate_text(text_input)

The strings "https://www.rtl.nl/components/financien/rtlz/2004/crucell_sars.pdf"
and "https://www.rtl.nl/components/financien/rtlz/06_juni/crucell_sars.pdf" seem to work however.

vmenger / deduce Goto Github PK

deduce's People

Contributors

Stargazers

Watchers

Forkers

deduce's Issues

Recommend Projects

Recommend Topics

Recommend Org