Giter Site home page Giter Site logo

redhuntlabs / octopii Goto Github PK

View Code? Open in Web Editor NEW
616.0 11.0 51.0 4.45 MB

An AI-powered Personal Identifiable Information (PII) scanner.

Home Page: https://redhuntlabs.com/blog/octopii-an-opensource-pii-scanner-for-images.html

License: Other

Python 100.00%
cybersecurity image-processing machine-learning ocr optical-character-recognition pii pii-detection nlp python blackhat

octopii's Introduction

Octopii

⠀⠀⠀⠀⠀⠀⠀⣤⣤⣄⣀⡀⠀⠀⠀⢀⣠⣤⣤⣄⡀⠀⠀⠀⢀⣀⣠⣤⣤⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠸⣿⣿⡿⠿⢿⣷⡄⢠⣿⣿⣿⣿⣿⣿⡄⢀⣾⡿⠿⢿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠈⠉⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣠⣤⡀⠀⠀⠀⠀⠀⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠀⠀⠀⠀⠀⢀⣤⣄⠀⠀⠀
⠸⣿⣿⣿⣿⣿⣿⣿⣿⣦⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⣴⣿⣿⣿⣿⣿⣿⣿⣿⠇⠀⠀
⠀⠉⠉⠁⠀⠀⠀⠀⣿⣿⠀⢸⣿⡇⠀⠉⣿⣿⣿⣿⠉⠀⢸⣿⡇⠀⣿⣿⠀⠀⠀⠀⠈⠉⠉⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣀⣈⣻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣟⣁⣀⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀       
⠀⠀⠀⠀⠀⠀⠀⠀⠘⠿⠿⠿⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠿⠿⠿⠿⠃⠀⠀⠀⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⢀⣤⣤⣤⣤⣤⣤⣴⣿⣿⣿⡇⢸⣿⡿⣿⣦⣤⣤⣤⣤⣤⣤⡀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⠋⠉⠉⠉⠉⠉⠉⢸⣿⡇⢸⣿⡇⠈⠉⠉⠉⠉⠉⠙⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢰⣿⣿⣦⠀⢰⣿⣿⣦⠀⢸⣿⡇⢸⣿⡇⠀⣰⣿⣿⡆⠀⣴⣿⣿⡆⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠻⠿⠋⠀⠘⣿⣿⠃⠀⢸⣿⡇⢸⣿⡇⠀⠘⣿⣿⠃⠀⠙⠿⠟⠁⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⣿⣦⣤⣼⣿⠃⠘⣿⣧⣄⣤⣿⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠛⠛⠁⠀⠀⠈⠛⠛⠛⠋⠀⠀⠀

⠀⠀⠀⠀⠀⠀      ⠀O C T O P I I⠀⠀⠀⠀
Copyright © 2023 RedHunt Labs Private Limited

Octopii is a Personally Identifiable Information (PII) scanner that uses Optical Character Recognition (OCR), regular expression lists and Natural Language Processing (NLP) to search public-facing locations for Government ID, addresses, emails etc in images, PDFs and documents.

PII leaks are often overlooked in the cybersecurity space. At RedHunt Labs, we always look for different and innovative ways to come up with cybersecurity solutions that organizations and services need. We've encountered a substantial number of organizations that have their servers configured incorrectly. This causes employee and customer PII to leak all the time, giving malicious parties sensitive information about their origins, ID numbers, contact information and their location.

This is why we created Octopii, a tool to demonstrate and detect how easy it is to automate the discovery and extraction of leaked PII and sensitive documents on the Internet.

Usage

Installing dependencies

  1. Install all dependencies via pip install -r requirements.txt.
  2. Install the Tesseract helper locally via sudo apt install tesseract-ocr -y on Ubuntu or sudo pacman -Syu tesseract on Arch Linux.
  3. Install Spacy language definitions locally via python -m spacy download en_core_web_sm.

Once you've installed the above, you're all set.

Running

To run Octopii, type

python3 octopii.py <location to scan>

where <location to scan> is a file or a directory.

Octopii currently supports local scanning via filesystem path, S3 URLs and Apache open directory listings. You can also provide individual image URLs or files as an argument.

Example

We've provided a dummy-pii/ folder containing sample PII for you to test Octopii with. Pass it as an argument and you'll get the following output

owais@artemis ~ $ python3 octopii.py dummy-pii/

Searching for PII in dummy-pii/dummy-drivers-license-nebraska-us.jpg
{
    "file_path": "dummy-pii/dummy-drivers-license-nebraska-us.jpg",
    "pii_class": "Nebraska Driver's License",
    "country_of_origin": "United States",
    "faces": 1,
    "identifiers": [],
    "emails": [],
    "phone_numbers": [
        "4000002170"
    ],
    "addresses": [
        "Nebraska"
    ]
}

Searching for PII in dummy-pii/dummy-PAN-India.jpg
{
    "file_path": "dummy-pii/dummy-PAN-India.jpg",
    "pii_class": "Permanent Account Number",
    "country_of_origin": "India",
    "faces": 0,
    "identifiers": [],
    "emails": [],
    "phone_numbers": [],
    "addresses": [
        "INDIA"
    ]
}

...

A file named output.txt is created, containing output from the tool. This file is appended to sequentially in real-time.

Working

Octopii uses Tesseract for Optical Character Recognition (OCR) and NLTK for Natural Language Processing (NLP) to detect for strings of personal identifiable information. This is done via the following steps:

1. Input and importing

Octopii scans for images (jpg and png) and documents (pdf, doc, txt etc). It supports 3 sources:

  1. Amazon Simple Storage Service (S3): traverses the XML from S3 container URLs
  2. Open directory listing: traverses Apache open directory listings and scans for files
  3. Local filesystem: can access files and folders within UNIX-like filesystems (macOS and Linux-based operating systems)

Images are detected via Python Imaging Library (PIL) and are opened with OpenCV. PDFs are converted into a list of images and are scanned via OCR. Text-based file types are read into strings and are scanned without OCR.

2. Face detection

A binary classification image detection technique - known as a "Haar cascade" - is used to detect faces within images. A pre-trained cascade model is supplied in this repo, which contains cascade data for OpenCV to use. Multiple faces can be detected within the same PII image, and the number of faces detected is returned.

3. Cleaning image and reading text

Images are then "cleaned" for text extraction with the following image transformation steps:

  1. Auto-rotation
  2. Grayscaling
  3. Monochrome
  4. Mean threshold
  5. Gaussian threshold
  6. 3x Deskewing

Image filtering illustration

Since these steps strip away image data (including colors in photographs), this image cleaning process occurs after attempting face detection.

4. Optical Character Recognition (OCR)

Tesseract is used to grab all text strings from an image/file. It is then tokenized into a list of strings, split by newline characters ('\n') and spaces (' '). Garbled text, such as null strings and single characters are discarded from this list, resulting in an 'intelligible' list of potential words.

This list of words is then fed into a similarity checker function. This function uses Gestalt pattern matching to compare each word extracted from the PII document with a list of keywords, present in definitions.json. This check happens once per cleaning. The number of times a word occurs from the keywords list is counted and this is used to derive a confidence score. When a particular definition's keywords appear repeatedly in these scans, that definition gets the highest score and is picked as the predicted PII class.

Octopii also checks for sensitive PII substrings such as emails, phone numbers and common government ID unique identifiers using regular expressions. It can also extract geolocation data such as addresses and countries using Natural Language Processing.

4. Output

The output consists of the following:

  • file_path: Where the file containing PII can be found
  • pii_class: The type of PII this file contains
  • country_of_origin: Where this PII originates from.
  • identifiers: Unique identifiers, codes or numbers that may be used to target the individual mentioned in the PII.
  • emails and phone_numbers: Contact information in the file.
  • addresses: Any form of geolocation data in the PII. This may be used to triangulate an individual's location.

Contributing

Click here to read about how you can contribute to Octopii.


Credits

...and countless others

Disclaimer

This tool is intended for research and educational purposes only. RedHunt Labs and other contributors to this project take no responsibility for malicious usage of this tool.

License

MIT License

Copyright © 2023 RedHunt Labs Private Limited.

By Owais Shaikh

octopii's People

Contributors

0x4f53 avatar othmanalikhan avatar owais-redhunt avatar umair-rhl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

octopii's Issues

Questions about "confidence_score"

Hi, I'm watching your sources and got some curiosity about your confidence scores.
Can I know your indications about your confidence scores? What are your standards about the score?

UnboundLocalError: local variable 'contains_faces' referenced before assignment

Describe the bug
When running the tool on a directory without images or PDF files, an UnboundLocalError is raised because the variable contains_faces has not been initialized. I believe that adding contains_faces = 0 at the beginning of the search_pii(file_path) function will solve the issue.

To Reproduce
Steps to reproduce the behavior:

  1. Create a directory dir only with text files
  2. Run python3 octopii.py dir/

Expected behavior
Octopii runs successfully

Feature request: portable app

Would there be an easy way to make this portable so I could toss it on a thumb drive and run it on a random workstation?

Windows

Is your feature request related to a problem? Please describe.
I have a use case which is where I want to scan through backup files with Octopii on an SMB share. The capability works for this but there are some additional steps in that I have to make sure my Linux machine has access to the SMB share or the Backup file in question. If we could enable this to work on Windows as well this would help my use case.

Describe the solution you'd like
I am not sure how big this lift is, more than happy to help where possible. I have added the errors below that I see after confirming that the dependencies for windows are available.

It is not the end of the world but being able to run this from a Windows box would be better than having a dedicated Linux box for this task.

Additional context
When I run on Windows where I have already installed Tesseract I get the following:

 Octopii  python .\octopii.py .\dummy-pii\
Traceback (most recent call last):
  File "C:\Users\Administrator\Documents\Octopii\octopii.py", line 123, in <module>
    rules=text_utils.get_regexes()
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\Documents\Octopii\text_utils.py", line 52, in get_regexes
    _rules = json.load(json_file)
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3062: character maps to <undefined>

"WARNING:tensorFlow:No training configuration found..." When running tool

Greetings,

I became aware of this project via Intigriti's Bug Bytes newsletter. I went through the install using venv, but found that the following error is returned when I run the tool against the 'dummy-pii' local directory and the 'https://pii-carbonconsole.fra1.digitaloceanspaces.com' URL.

image

It seems to be working as expected as it returns a confidence value for the sample images containing "PII". I am running the tool within Kali 2022.1 using Python 3.9.12 within a virtualenv using venv. A GitHub issue for another project that lead me to add ", compile=False" to line 214 of the octopii.py script

image

I don't really understand the implications of the change, but it did result in the error no longer being returned. As I mentioned earlier, the tool seems to be working as expected, so to me it kind of seems like it is just "cosmetic".

This is an exciting project. Thank you for the time and effort put into developing it and sharing it with the world!

ModuleNotFoundError: No module named 'cv2'

Describe the bug
ModuleNotFoundError: No module named 'cv2'

To Reproduce
Steps to reproduce the behavior:

  1. Run python3 octopii.py dummy-pii/ (Windows 11)

Expected behavior
Octopii runs successfully

Octopii crashes on empty files

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
Run octopii against a folder with a 0 byte file in it

Traceback (most recent call last):
File "/opt/Octopii/octopii.py", line 199, in
results = search_pii (file_path)
File "/opt/Octopii/octopii.py", line 80, in search_pii
addresses = text_utils.regional_pii(text)
File "/opt/Octopii/text_utils.py", line 80, in regional_pii
place_entity = locationtagger.find_locations(text = text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/init.py", line 4, in find_locations
e = NamedEntityExtractor(url=url, text=text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/locationextractor.py", line 25, in init
raise Exception('Please input any text or url')
Exception: Please input any text or url

Expected behavior
It not to crash when a file is 0 bytes

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.