Giter Site home page Giter Site logo

awsaf49 / pii-data-detection Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 46 KB

The Learning Agency Lab - PII Data Detection || Develop automated techniques to detect and remove PII from educational data.

Home Page: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/

License: MIT License

Jupyter Notebook 100.00%
jax kaggle keras keras-nlp named-entity-recognition nlp pytorch scratch-implementation tensorflow token-classification

pii-data-detection's Introduction

Keras logo
This starter notebook is provided by the Keras team.

PII Data Detection with KerasNLP and Keras

The objective of this competition is to detect and remove personally identifiable information (PII) from student writing.

PII Data Detection

The task of this competition falls under Token Classification (not Text Classification!), sometimes known as Named Entity Recognition (NER). This notebook guides you through performing this task from scratch for the competition. Implementing from scratch is a unique feature of this notebook, as most public notebooks use HuggingFace to handle modeling and data processing, which performs many tasks under the hood. One may have to look deeper into the repository to understand what is happening inside. In contrast, this notebook goes step by step, showing you exactly how Token Classification works. A cherry on top: this notebook leverages Mixed Precision and Distributed (multi-GPU) Training/Inference to turbocharge performance!

🔗 Notebook (Train + Inference): PII Data Detection: KerasNLP Starter Notebook You can also find it in the /notebooks folder of this repository.

Fun fact: This notebook is backend-agnostic, supporting TensorFlow, PyTorch, and JAX. Utilizing KerasNLP and Keras allows us to choose our preferred backend. Explore more details on Keras.

In this notebook, you will learn how to:

  • Design a data pipeline for token classification.
  • Create a model for token classification with KerasNLP.
  • Load the data efficiently using tf.data.
  • Perform Mixed Precision and Distributed Training/Inference with Keras 3.
  • Make submission on test data.

Note: For a more in-depth understanding of KerasNLP, refer to the KerasNLP guides.

Data

The competition dataset contains $22,000$ student essays where $70%$ essays are reserved for testing, leaving $30%$ for training and validation.

Sure, here's the modified markdown with an example of the BIO format label:

Data Overview:

  • All essays were written in response to the same prompt, applying course material to a real-world problem.
  • The dataset includes 7 types of PII: NAME_STUDENT, EMAIL, USERNAME, ID_NUM, PHONE_NUM, URL_PERSONAL, STREET_ADDRESS.
  • Labels are given in BIO (Beginning, Inner, Outer) format.

Example of BIO format label:

Let's consider a sentence: "The email address of Michael jordan is [email protected]". In BIO format, the labels for the personally identifiable information (PII) would be annotated as follows:

Word The email address of Michael Jordan is [email protected]
Label O O O O B-NAME_STUDENT I-NAME_STUDENT O B-EMAIL

In the example above, B- indicates the beginning of an PII, I- indicates an inner part of a multi-token PII, and O indicates tokens that do not belong to any PII.

Data Format:

  • The train/test data is stored in {test|train}.json files.
  • Each json file has:
    • document: unique ID (integer)
    • full_text: essay content (string)
    • tokens: individual words in the essay (list of strings)
    • labels (training data only): BIO labels for each token (list of strings)

Acknowledgement

Special thanks to Martin Görner (@martin-gorner) for kind review.

pii-data-detection's People

Contributors

awsaf49 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

futureshaper

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.