Giter Site home page Giter Site logo

vmenger / deduce Goto Github PK

View Code? Open in Web Editor NEW
51.0 51.0 22.0 7.38 MB

Deduce: de-identification method for Dutch medical text

License: GNU General Public License v3.0

Python 99.67% Makefile 0.33%
deidentification dutch dutch-clinical-nlp information-extraction nlp python python-library text-mining text-processing

deduce's People

Contributors

dependabot[bot] avatar j535d165 avatar jantrienes avatar pablomosuu avatar sandertan avatar vmenger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

deduce's Issues

Telephone number format not de-identified

The telephone number format 06 18 34 56 78 is currently not de-identified with DEDUCE. I suspect this format is not uncommon, so it might be worth adding functionality for it to DEDUCE.

import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 18 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))

De <PERSOON-1> (e: <URL-1>, t: 06 18 34 56 78) is <LEEFTIJD-1> jaar oud.

Interestingly, the first 4 numbers can be recognized as date:

import deduce
text = u"De patient J. Jansen (e: [email protected], t: 06 12 34 56 78) is 64 jaar oud."
print(deduce.deidentify_annotations(deduce.annotate_text(text)))

De <PERSOON-1> (e: <URL-1>, t: <DATUM-1> 56 78) is <LEEFTIJD-1> jaar oud.

Specific URL with two hypens, a whitespace and many characters breaks deduce

The following URL: https://keuzehulp‐vruchtbaarheidsbehoud‐ transmannen.nl takes about 500 seconds to be deidentified, for no particular reason. It's the first regexp in annotate_url that causes the delay. It's probably doing some absurd and epic search to match, but perhaps it can be optimized a little.

Simplest reproduction I found so far:

https://aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa a.nl

Improved detection of housenumber letters and housenumber additions

The logic for detecting housenumber letters (1a) and housenumber additions (1-bis) is not very complete, and Deduce thus not always correctly annotates the information following the housenumber. There seem to be no strict rules on what can follow a housenumber. Currently Deduce only detects a single letter (case insensitive) after a housenumber, optionally separated by a whitespace (so 1a, 1A, 1 a and 1 A). The information is usually not very privacy-sensitive, so not very high priority, but would still be nice to do it a bit cleaner.

Unknown variables/modules in utility.py

While attempting to create continuous integration, such that unit tests would be run automatically, I discovered that utility.py uses two keywords that are not imported anywhere. Those are unicodedata and unicode. These are not usually used, I guess, because we usually don't call utility.read_list with normalize='ascii'. But that raises an error in the "action", so it would lead to not being able to merge. I think we should fix this first.

Detection of Ph./Th. initials

Deduce does not recognize names that contain initials that are not one letter, like Ph. or Th.:

A.Ph. de Visser
A.Th. de Visser

These cases are fine though:

ABC de Visser
A.B.C. de Visser

Perhaps there are more/similar names that are initialized with the following h.

annotate_names_context

While working on the branch structured_annotations I came across the method deduce.annotate.annotate_names_context, but I can't find what it is used for. I've tried inputting texts in many formats, but I never found a case in which it did anything. Can you provide an example and/or tell me what the expected behaviour of this method is?

Incorrect handling of multiple 4-digit mg dosages

This text:

text = "Patient krijgt 1000mg van het middel, tot 1250mg verhogen"

Gets annotated (annotate_postalcode) to:

Patient krijgt 1000mg> van het middel, tot <LOCATIE 1250mg verhogen

Which, since 1.0.4 raises a:

ValueError: Incorrectly formatted string

Bug has probably been around for much longer, but only came to light now. Fixing should be straightforward, working on it.

Improve age detection with pseudo triggers

There are still lots of false positives regarding ages, like:

sinds 5 jaar
ongeveer 5 jaar
< 5 jaar

Unfortunately, the current regexp method is limited, as creating many negative lookbehinds is not scalable and also very inefficient. An improvement that takes into account negative/pseudotriggers before or after the age number would be a nice improvement.

Extend list of medical terms

Deduce includes a list of medical terms, but it's quite short. It would be nice to extend it with some more terms.

  • Muscles
  • Bones
  • Organs
  • Diseases
  • Nerves
  • Arteries
  • Veins (?)
  • ...?

Some specific examples of false positives:

A. Femoralis
N. Fibularis
M. Hodgkin

Multi-token medical terms

Currently for annotation purposes DEDUCE only supports single token medical terms. Multi-token medical terms could also be added. Possilby this is already possible.

Fuzzy detection / preprocessing of institution names

From #115, Deduce includes a long list of healthcare institutions. There is however a mismatch between the names of the institutions on the list, and the actual name that is written in text. For example, an institution like 'De Binnentuin, zorgboerderij en dagbesteding' would amost certainly be written as 'De Binnentuin'. For hospitals, some optimizations have been done already, but for non-hospitals (healthcare_institutions.txt) there are probably some more optimizations to be done, without impacting performance and false positives too much.

Addition of annotations for long numbers conforming or not to a specific format

In medical texts tokens occur of long numbers such as the Dutch Social Number (Burger Service Number), BIG, AGB, bankaccounts. Some of these adhere to a specific format (e.g. IBAN) and/or to a specific mathematical algoritme (e.g. BSN and the 11-proof).
To enhance the 'readablity' of anonymized texts number conforming to a format, could be annotated with a semantic meaningful annotation like [BSN] or [BANKACCOUTNUMBER]. On the other hand numbers not conforming to a format but could be annoated with a general annotation like [LONGNUMBER].

Can obtain nested tags

If you input this text into Deduce:
'ADHD Adres: Naamlaan 100 Woonplaats: 3512AB Apeldoorn Tel: 088-1234567'
and run deduce.annotate_text, you will obtain:
'ADHD <PERSOON Adres: <LOCATIE Naamlaan 100> Woonplaats: 3512AB Apeldoorn Tel>: <TELEFOONNUMMER 088-1234567>'
which includes nested tags. Obviously there is a problem whereby the entired string from "Adres" to "Tel" is being detected as a person's name. However, the problem I'm pointing out here is that, having detected that, it then finds a LOCATIE tag within the PERSOON tag, which means that the final output contains nested tags, which should not be allowed.

This should be fixed easily by moving the flatten_text call, currently happening within the "names" deidentification, to the very end of the annotate_text method, right before returning the final text. Do you agree with this?

Detect multi-token names

E.g.:

Jan-Willem
Bruins Slot

Are currently never detected, at least not from the lookup lists, as they only match single tokens.

Update/extend name lookup lists

It would be nice to update the lists of first names and surnames with current data, as the existing lists were compiled 5-6 years ago. Perhaps there are new/better lists available.

Change residence annotation to a lookup list

Currently the residence (Street name and house number) is annotated using a regular expression according the Dutch convention (Dorpstraat 2) and with a postfix of common Dutch locations; such as -straat, -hof or -laan. Note however that there are a large number of residences in the Netherlands which do not match that pattern e.g. Nijenheim 22-20. Moreover foreign (ie. non-Dutch) residences are not found using such a regurlar expression e.g. '221b Baker Street '.
A proposed solution is to replace the reg-exp with a lookup list. This list can consist of the intersection of a list generated from an EHR (Electronic Heath Record) system and a list of all street names in the geographical region where patients live (e.g. the Netherlands or Europe).
A possible source of Dutch street names is the Kadaster - Basisregistratie Adressen en Gebouwen (BAG) or PostNL

Reliable detection of day-month combinations

In #92 we removed detection of day-month combinations, and only detect combinations of day, month and year as date. Day-month combinations may still be considered PHI in some cases, however, detection is harder than a simple regexp. See e.g. these real-world examples:

Patient werd op 5/5 opgenomen
Patient werd op 2.5 opgenomen

versus:

Motoriek:  Deltoideus 5/5; Biceps 5/5
Start met bisoprolol 2.5 mg

Perhaps a different rule-based approach can be used here?

Detection of IBAN numbers

IBAN numbers are not currently detected by Deduce. It has been reported they sometimes do occur in text, so this might be a useful example. I have not encountered them in any dataset I've worked with, so hoping somebody who does encounter is willing to pick it up.

See also: #74

Multiple consecutive spaces in patient identifers causes an IndexError

When there are multiple consecutive spaces present in the patient identifiers (e.g. patient_first_names, patient_last_names) then an
IndexError occurs duing the processing; see the stacktrace down below. The error occured with the input "Peter Fiver" (with two spaces) as patient first names

self = <deduce.pattern.name_patient.PersonInitialFromNamePattern object at 0x000001A6664C1AC0>
token = Token(text='Peter', start_char=0, end_char=5)
metadata = <docdeid.document.MetaData object at 0x000001A667603340>
def match(self, token: dd.Token, metadata: dd.MetaData) -> Optional[tuple[dd.Token, dd.Token]]:

for _, first_name in enumerate(metadata["patient"].first_names):

if str_match(token.text, first_name[0]):
E IndexError: string index out of range
..\venv\lib\site-packages\deduce\pattern\name_patient.py:47: IndexError

Long-term repository improvement plan

Recently, interest in DEDUCE is increasing: I get more messages/requests/questions, downloads are increasing, the repository has become somewhat alive. This must mean more people are using clinical text data, which is nice news.

However, DEDUCE was created a long time ago with a specific problem in mind, by someone without much in depth python/dev experience (me). I now find the implementation of DEDUCE lacking in several aspects, and this in turn prevents making improvements to the algorithm itself. I intend to start fixing that, first improving just the implementation, and later potentially the algorithm too. Here I want to write out some of my ideas for improvements in the (near) future.

Roughly I'm thinking of the following implementation changes (in this order), keeping the algorithm the same:

  • Making everything pep8 compliant, maybe use black formatting, and write some basic contributing guidelines #23
  • Refactor into a decent modular and object-oriented design (e.g. Annotator, Pipeline, Tokenizer, Deduce objects), with proper abstract base classes.
  • Split some of the rule-based deidentification logic into a separate meta-package, keeping the deduce-specific logic here. This should help greatly to provide building blocks for creating new/derived de-identification methods, for instance with more/other PHIs, without needing too much of a dev environment/background.
  • Adding proper unit testing, integration testing and coverage
  • Replacing the current in-text annotation logic with structured annotation logic. This one will be tricky to do without introducing changes to the algorithm, but we can go PHI by PHI.
  • Replacing the current de-identification logic based on the structured annotations. This one should be more doable, but will still require some thought.
  • Find out where there are still idiosyncrasies (like nested tags, annotation order) and other obscurities preventing future changes, and mitigating them.
  • Check if any speed optimizations are possible (plus perhaps built-in multiprocessing)
  • Add documentation and guidelines for documentation
  • Separate configuration in a separate file/step (e.g. like this)
  • Add some good tutorials with examples, that should allow people to make their own rule-based de-identification models with the blocks coming from DEDUCE and/or the new meta-package.
  • If possible, have a proper test on an (open) annotated dataset to monitor performance
  • Adding error handling and exceptions, including specification of what to do when an error occurs (raise warning / ignore / redact all text)
  • Do a comparison with nedap/deidentify on (1) a UMCU dataset and if possible (2) a 'neutral' dataset?

After that, we can start creating more building blocks (e.g. for new PHI categories we find interesting), and improving the building blocks we have, and by that make DEDUCE a more useful for cases outside Psychiatry. Ideally we would have a bit of an ecosystem developing around the rule-based deidentification problem, perhaps also outside medicine and in other languages. But, this is really still stuff for the future.

This is all a bit stream of thought -- but please don't hesitate to add any comments or ideas! In a later stage I will try to structure and plan this a bit more in separate issues.

Structured annotations: check for text match

The method deduce.annotate_text_structured returns structured annotations corresponding to the annotations embedded in the original text. It does so by reconstructing the original text from the annotated one, and keeping track of at which position the original text in the annotation would be placed. For example, in the text "I'm in love with Jane", annotated as "I'm in love with ", the method finds the tag "", then computes the index of the opening angular bracket, and assumes that the word "Jane" in the original text occurs at the same location.

The trouble is that Deduce sometimes (accidentally) inserts or deletes a blank space, which messes with the indexing of the annotations in the original text. To mitigate this, we could access the original text to make sure that the annotation text matches what is present in the original text at that index.

bug: IndexError: list index out of range for annotate_text with input string containing a URL with specific type of date

Using the string below as input for annotate_text gives me an IndexError: list index out of range

import deduce
text_input = "https://www.rtl.nl/components/financien/rtlz/2004/06_juni/crucell_sars.pdf"
deduce.annotate_text(text_input)

The strings "https://www.rtl.nl/components/financien/rtlz/2004/crucell_sars.pdf"
and "https://www.rtl.nl/components/financien/rtlz/06_juni/crucell_sars.pdf" seem to work however.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.