Giter Site home page Giter Site logo

anonymisation's Introduction

Pseudo-anonymization of French legal cases

Build Status License

Capture existing domain knowledge in legal case pseudo-anonymization process from existing rule based system and enhance it through use of Machine learning algorithms.

Build Named Entity Recognition (NER) training dataset and learn a model dedicated to French legal case anonymization by leveraging existing rule based system and adding noise and synthetic data.
The projects goes above the scope covered by the rule based system which was limited to address and natural person names.
This model can be used in a pseudo-anonymization system.
Input format is the one from rule based system skill cartridges.

Measures computed over manually annotated data show strong performance, in particular on natural person and legal professionals names.

Scope

The only French legal cases massively acquired by Lefebvre Sarrut not pseudo-anonymized are those from appeal courts (Jurica database).
The input data are XML files from Jurica as generated by skill cartridges covering the period 2008-2019.

The project is focused on finding mentions of entities and guessing their types.
It doesn't manage the pseudo-anonymization step, meaning replacing entities found in precedent step by another representation.

Challenges

Many SOTA algorithms are available as open source project.
Therefore, developing a NER algorithm is not in the scope of this project.

The main focus of this work is to generate a large high quality training set being able to leverage all the knownledge put in existing rule based system.
Learning on a dataset only made with rules may only build a weak model repeating these rules.
Therefore, we have tried to include as many tricks as needed to catch / create more complex patterns.
With those, we have been able to produce a robust model, able to find much more entities than the initial rules!

Below strategies used are listed:

  • skill cartridges
    • leveraging the extractions performed by skill cartridges embedding the many customization and domain knowledge Lefebvre Sarrut teams
  • Rules
    • using some easy to describe patterns to catch some entities (with regex)
    • find some other entities using dictionaries (e.g.: city names, etc.)
  • Name extension
    • extending any discovered entity to the neighbor words when it makes sense
      • done carefully otherwise there is a risk of lowering the quality of the training set
  • Find all occurrences of caught entities
    • looking for all occurrences of each entity already found in a document (2 pass process)
    • building dictionaries of frequent names using all documents and look for them in each of them (2 pass process)
  • Dataset augmentation
    • Create some variation of the discovered entities and search for them
      • By removing first or last name, changing the case of one or more word in entity, removing key words (M., Mme, la société, ...), etc.
      • transformation are randomly applied (20% of entities are transformed)
      • make the model more robust to error in the text
      • these variations can not be discovered easily with patterns
        • eg. : changing the case is an easy way to workaround the creation of patterns to catch entities written in lower case
  • Miscellaneous tricks
    • removing from train set all paragraphs containing 0 entity
      • no entity paragraphs may be due to too simplistic patterns
    • Apply some priority rules over the source of the entity offset for cases where there is a conflict of type
      • some candidate generators are more safe than others
      • a _1 is added to the end of the tag label when it is safe and it is removed during the offset normalization step
    • Look for doubtful MWE candidates and declare them as doubtful
      • doubtful MWE candidates are any sequence of words starting with an upper case
      • a filter is then applied to keep only those with a first name (based on a dictionary)
      • no loss is computed on these entities, meaning they don't influence the model during training

The purpose of ML is to smooth the rules and the other tricks, making the whole system much more robust to hard to catch entities. Data augmentation in particular has proved to be very efficient.

Recognized entity types

Our rule based system only managed PERS, ADDRESS and RG types.

  • Persons:
    • PERS: natural person (include first name unlike skill cartridges), source: skill cartridges + name extension + other occurrences
    • ORGANIZATION: organization, source: skill cartridges + rules + extension + other occurrences
    • PHONE_NUMBER: phone number, source: rules
    • LICENCE_PLATE: licence plate numbers, source: rules
  • Lawyers:
    • LAWYER: lawyers, source: rules + other occurrences
    • BAR: bar where lawyers are registered (not done by skill cartridges), source: rules + other occurrences
  • Courts:
    • COURT: names of French courts, source: rules + other occurrences
    • JUDGE_CLERK: judges and court clerks, source: rules + other occurrences
  • Miscellaneous:
    • ADDRESS: addresses (badly done by skill cartridges), source: rules + other occurrences + dictionary
      • there is no way to always guess if the address owner is a PERS or an ORGANIZATION, therefore this aspect is not managed
    • DATE: any date, in numbers or letters, source: rules + other occurrences
    • RG : ID of the legal case, source: skill cartridges + rules
    • UNKNOWN : only for train set, indicates that no loss should be apply on the word, whatever the prediction is, source: rules + dictionary

To each type, dataset augmentation and miscellaneous tricks have been applied.

Model

Main NER model is from Spacy library and is best described in this video.

Basically it is a CNN + HashingTrick / Bloom filter + L2S approach over it.
The L2S part is very similar to classical dependency parser algorithm (stack + actions).

Advantages of the Spacy approach:

  • no manual feature extraction (done by Spacy: suffix and prefix, 3 letters each, and the word shape)
  • quite rapid on CPU (ease deployment)
  • low memory foot print (ease deployment)
  • off the shelf algorithm (documented, maintained, large community, etc.)

Project is fully written in Python and can't be rewritten in something else because Spacy only exists on Python.

Resources

No language related resources are used.

Few open data dictionaries are used:

  • a dictionary of French first names (open data)
  • a dictionary of postal code and cities of France (open data)

Both resources are stored on the Git repository (resources/ folder).
Both are not strategic to the success of the learning but provide a little help.

Data and model paths

Paths listed below can be modified in the config file resources/config.ini.

XML

  • Cases have to be provided as XML in the format used by skill cartridges (example provided in resources folder).
  • One XML file represents one week of legal cases.
  • XML files should be put in folder resources/training_data/.
  • The case used for inference has to be placed in resources/dev_data/.
  • Folder resources/test/ contains a XML used for unit tests.

Other resources

  • Resources are to be put in folder resources/courts, resources/postal_codes, resources/first_names.

Model

  • Folder resources/model/ will contain the Spacy model.

Commands to use the code

This project uses Python virtual environment to manage dependencies without interfering with those used by the machine.
pip3 and python3 are the only requirements.
To setup a virtual environment on the machine, install virtualenv from pip3 and install the project dependencies (from the requirements.txt file).

These steps are scripted in the Makefile (tested only on Ubuntu) and can be performed with the following command:

make setup

Variable VIRT_ENV_FOLDER can be changed in the Makefile to change where to install Python dependencies.

... then you can use the project by running one of the following actions:

  • train a model
make train
  • Find and export frequent entities (these entities are caught in all documents during the training set creation)
make export_frequent_entities
make show_spacy_entities
make show_rule_based_entities
  • view differences between Spacy and skill cartridges (only for shared entity types)
make list_differences
  • run unit tests
make test

Most of the project configuration is done in resources/config.ini file.

Setup Pycharm

For tests run from Pycharm, you need to create a Pytest test task.
Then the working folder by default (implicit) is the test folder.
It has to be setup as the project root folder explicitly.

License

This project is licensed under Apache 2.0 License (found in the LICENSE file in the root directory).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.