Giter Site home page Giter Site logo

datafog / datafog-python Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 1.0 10.75 MB

Privacy Engineering for the Generative AI era

Home Page: https://www.datafog.ai

License: MIT License

Python 96.74% Just 3.26%
ai pii rag data-anonymization data-preprocessing data-science llm-privacy machine-learning open-source pii-detection

datafog-python's Introduction

DataFog logo

Open-source DevSecOps for Generative AI Systems.

PyPi Version PyPI pyversions GitHub stars PyPi downloads Discord Code style: black codecov GitHub Issues

Overview

What is DataFog?

DataFog is an open-source DevSecOps platform that lets you scan and redact Personally Identifiable Information (PII) out of your Generative AI applications.

Core Problem

image

How it works

image

Installation

DataFog can be installed via pip:

pip install datafog

Getting Started

The DataFog library provides functionality for text and image processing, including PII (Personally Identifiable Information) annotation and OCR (Optical Character Recognition) capabilities.

Installation

To install the DataFog library, use the following command:

pip install datafog

Usage

The Getting Started notebook features a standalone Colab notebook.

Text PII Annotation

To annotate PII in a given text, lets start with a set of clinical notes:

!git clone https://gist.github.com/b43b72693226422bac5f083c941ecfdb.git
# Define the directory path
folder_path = 'clinical_notes/'

# List all files in the directory
file_list = os.listdir(folder_path)
text_files = sorted([file for file in file_list if file.endswith('.txt')])

with open(os.path.join(folder_path, text_files[0]), 'r') as file:
    clinical_note = file.read()

display(Markdown(clinical_note))

which looks like this:


**Date:** April 10, 2024

**Patient:** Emily Johnson, 35 years old

**MRN:** 00987654

**Chief Complaint:** "I've been experiencing severe back pain and numbness in my legs."

**History of Present Illness:** The patient is a 35-year-old who presents with a 2-month history of worsening back pain, numbness in both legs, and occasional tingling sensations. The patient reports working as a freelance writer and has been experiencing increased stress due to tight deadlines and financial struggles.

**Past Medical History:** Hypothyroidism

**Social History:**
The patient shares a small apartment with two roommates and relies on public transportation. They mention feeling overwhelmed with work and personal responsibilities, often sacrificing sleep to meet deadlines. The patient expresses concern over the high cost of healthcare and the need for affordable medication options.

**Review of Systems:** Denies fever, chest pain, or shortness of breath. Reports occasional headaches.

**Physical Examination:**
- General: Appears tired but is alert and oriented.
- Vitals: BP 128/80, HR 72, Temp 98.6°F, Resp 14/min

**Assessment/Plan:**
- Continue to monitor blood pressure and thyroid function.
- Discuss affordable medication options with a pharmacist.
- Refer to a social worker to address housing concerns and access to healthcare services.
- Encourage the patient to engage with community support groups for social support.
- Schedule a follow-up appointment in 4 weeks or sooner if symptoms worsen.

**Comments:** The patient's health concerns are compounded by socioeconomic factors, including employment status, housing stability, and access to healthcare. Addressing these social determinants of health is crucial for improving the patient's overall well-being.

we can then set up our pipeline to accept these files

async def run_text_pipeline_demo():
  results = await datafog.run_text_pipeline(texts)
  print("Text Pipeline Results:", results)
  return results


texts = [clinical_note]
loop = asyncio.get_event_loop()
results = loop.run_until_complete(run_text_pipeline_demo())

Note: The DataFog library uses asynchronous programming, so make sure to use the async/await syntax when calling the appropriate methods.

OCR PII Annotation

Let's use a image (which could easily be a converted or scanned PDF)

Executive Email

datafog = DataFog(operations='extract_text')
url_list = ['https://pbs.twimg.com/media/GM3-wpeWkAAP-cX.jpg']

async def run_ocr_pipeline_demo():
  results = await datafog.run_ocr_pipeline(url_list)
  print("OCR Pipeline Results:", results)

loop = asyncio.get_event_loop()
loop.run_until_complete(run_ocr_pipeline_demo())

You'll notice that we use async functions liberally throughout the SDK - given the nature of the functions we're providing and the extension of DataFog into API/other formats, this allows the functions to be more easily adapted for those uses.

Contributing

DataFog is a community-driven open-source platform and we've been fortunate to have a small and growing contributor base. We'd love to hear ideas, feedback, suggestions for improvement - anything on your mind about what you think can be done to make DataFog better! Join our Discord and join our growing community.

Dev Notes

  • Justfile commands:
    • just format to apply formatting.
    • just lint to check formatting and style.

Testing

To run the datafog unit tests, check out this repository and do


tox

License

This software is published under the MIT license.

datafog-python's People

Contributors

sidmohan0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

ockhamlabs

datafog-python's Issues

Support for Other doc types

Hi: Is there a plan to support other types? e.g. PDF (with images), JSON, PPTX? etc If we need to enable, what's the best way?

[Investigate] scope PDF parsing functionality

This is still an active topic within RAG extraction space

TODO:

  • Review PDF extraction methods in well reviewed RAG tutorials (over past week)
  • Break down into what's scope-able for future milestones

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.