redhuntlabs / octopii Goto Github PK

View Code? Open in Web Editor NEW

616.0 11.0 51.0 4.45 MB

An AI-powered Personal Identifiable Information (PII) scanner.

Home Page: https://redhuntlabs.com/blog/octopii-an-opensource-pii-scanner-for-images.html

License: Other

Python 100.00%

cybersecurity image-processing machine-learning ocr optical-character-recognition pii pii-detection nlp python blackhat

octopii's Introduction

Octopii

⠀⠀⠀⠀⠀⠀⠀⣤⣤⣄⣀⡀⠀⠀⠀⢀⣠⣤⣤⣄⡀⠀⠀⠀⢀⣀⣠⣤⣤⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠸⣿⣿⡿⠿⢿⣷⡄⢠⣿⣿⣿⣿⣿⣿⡄⢀⣾⡿⠿⢿⣿⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠈⠉⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⣠⣤⡀⠀⠀⠀⠀⠀⠀⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⠀⠀⠀⠀⠀⠀⢀⣤⣄⠀⠀⠀
⠸⣿⣿⣿⣿⣿⣿⣿⣿⣦⠀⢸⣿⡇⢸⣿⣿⣿⣿⣿⣿⡇⢸⣿⡇⠀⣴⣿⣿⣿⣿⣿⣿⣿⣿⠇⠀⠀
⠀⠉⠉⠁⠀⠀⠀⠀⣿⣿⠀⢸⣿⡇⠀⠉⣿⣿⣿⣿⠉⠀⢸⣿⡇⠀⣿⣿⠀⠀⠀⠀⠈⠉⠉⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⣿⣿⣀⣈⣻⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣟⣁⣀⣿⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀       
⠀⠀⠀⠀⠀⠀⠀⠀⠘⠿⠿⠿⠿⠿⣿⣿⣿⣿⣿⣿⣿⣿⠿⠿⠿⠿⠿⠃⠀⠀⠀⠀⠀⠀⠀ 
⠀⠀⠀⠀⠀⠀⢀⣤⣤⣤⣤⣤⣤⣴⣿⣿⣿⡇⢸⣿⡿⣿⣦⣤⣤⣤⣤⣤⣤⡀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⢸⣿⠋⠉⠉⠉⠉⠉⠉⢸⣿⡇⢸⣿⡇⠈⠉⠉⠉⠉⠉⠙⣿⣧⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⢰⣿⣿⣦⠀⢰⣿⣿⣦⠀⢸⣿⡇⢸⣿⡇⠀⣰⣿⣿⡆⠀⣴⣿⣿⡆⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠈⠻⠿⠋⠀⠘⣿⣿⠃⠀⢸⣿⡇⢸⣿⡇⠀⠘⣿⣿⠃⠀⠙⠿⠟⠁⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⣿⣦⣤⣼⣿⠃⠘⣿⣧⣄⣤⣿⡟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠛⠛⠁⠀⠀⠈⠛⠛⠛⠋⠀⠀⠀

⠀⠀⠀⠀⠀⠀      ⠀O C T O P I I⠀⠀⠀⠀
Copyright © 2023 RedHunt Labs Private Limited

Octopii is a Personally Identifiable Information (PII) scanner that uses Optical Character Recognition (OCR), regular expression lists and Natural Language Processing (NLP) to search public-facing locations for Government ID, addresses, emails etc in images, PDFs and documents.

PII leaks are often overlooked in the cybersecurity space. At RedHunt Labs, we always look for different and innovative ways to come up with cybersecurity solutions that organizations and services need. We've encountered a substantial number of organizations that have their servers configured incorrectly. This causes employee and customer PII to leak all the time, giving malicious parties sensitive information about their origins, ID numbers, contact information and their location.

This is why we created Octopii, a tool to demonstrate and detect how easy it is to automate the discovery and extraction of leaked PII and sensitive documents on the Internet.

Usage

Installing dependencies

Install all dependencies via pip install -r requirements.txt.
Install the Tesseract helper locally via sudo apt install tesseract-ocr -y on Ubuntu or sudo pacman -Syu tesseract on Arch Linux.
Install Spacy language definitions locally via python -m spacy download en_core_web_sm.

Once you've installed the above, you're all set.

Running

To run Octopii, type

python3 octopii.py <location to scan>

where <location to scan> is a file or a directory.

Octopii currently supports local scanning via filesystem path, S3 URLs and Apache open directory listings. You can also provide individual image URLs or files as an argument.

Example

We've provided a dummy-pii/ folder containing sample PII for you to test Octopii with. Pass it as an argument and you'll get the following output

owais@artemis ~ $ python3 octopii.py dummy-pii/

Searching for PII in dummy-pii/dummy-drivers-license-nebraska-us.jpg
{
    "file_path": "dummy-pii/dummy-drivers-license-nebraska-us.jpg",
    "pii_class": "Nebraska Driver's License",
    "country_of_origin": "United States",
    "faces": 1,
    "identifiers": [],
    "emails": [],
    "phone_numbers": [
        "4000002170"
    ],
    "addresses": [
        "Nebraska"
    ]
}

Searching for PII in dummy-pii/dummy-PAN-India.jpg
{
    "file_path": "dummy-pii/dummy-PAN-India.jpg",
    "pii_class": "Permanent Account Number",
    "country_of_origin": "India",
    "faces": 0,
    "identifiers": [],
    "emails": [],
    "phone_numbers": [],
    "addresses": [
        "INDIA"
    ]
}

...

A file named output.txt is created, containing output from the tool. This file is appended to sequentially in real-time.

Working

Octopii uses Tesseract for Optical Character Recognition (OCR) and NLTK for Natural Language Processing (NLP) to detect for strings of personal identifiable information. This is done via the following steps:

1. Input and importing

Octopii scans for images (jpg and png) and documents (pdf, doc, txt etc). It supports 3 sources:

Amazon Simple Storage Service (S3): traverses the XML from S3 container URLs
Open directory listing: traverses Apache open directory listings and scans for files
Local filesystem: can access files and folders within UNIX-like filesystems (macOS and Linux-based operating systems)

Images are detected via Python Imaging Library (PIL) and are opened with OpenCV. PDFs are converted into a list of images and are scanned via OCR. Text-based file types are read into strings and are scanned without OCR.

2. Face detection

A binary classification image detection technique - known as a "Haar cascade" - is used to detect faces within images. A pre-trained cascade model is supplied in this repo, which contains cascade data for OpenCV to use. Multiple faces can be detected within the same PII image, and the number of faces detected is returned.

3. Cleaning image and reading text

Images are then "cleaned" for text extraction with the following image transformation steps:

Auto-rotation
Grayscaling
Monochrome
Mean threshold
Gaussian threshold
3x Deskewing

Since these steps strip away image data (including colors in photographs), this image cleaning process occurs after attempting face detection.

4. Optical Character Recognition (OCR)

Tesseract is used to grab all text strings from an image/file. It is then tokenized into a list of strings, split by newline characters ('\n') and spaces (' '). Garbled text, such as null strings and single characters are discarded from this list, resulting in an 'intelligible' list of potential words.

This list of words is then fed into a similarity checker function. This function uses Gestalt pattern matching to compare each word extracted from the PII document with a list of keywords, present in definitions.json. This check happens once per cleaning. The number of times a word occurs from the keywords list is counted and this is used to derive a confidence score. When a particular definition's keywords appear repeatedly in these scans, that definition gets the highest score and is picked as the predicted PII class.

Octopii also checks for sensitive PII substrings such as emails, phone numbers and common government ID unique identifiers using regular expressions. It can also extract geolocation data such as addresses and countries using Natural Language Processing.

4. Output

The output consists of the following:

file_path: Where the file containing PII can be found
pii_class: The type of PII this file contains
country_of_origin: Where this PII originates from.
identifiers: Unique identifiers, codes or numbers that may be used to target the individual mentioned in the PII.
emails and phone_numbers: Contact information in the file.
addresses: Any form of geolocation data in the PII. This may be used to triangulate an individual's location.

Contributing

Click here to read about how you can contribute to Octopii.

Credits

...and countless others

Disclaimer

This tool is intended for research and educational purposes only. RedHunt Labs and other contributors to this project take no responsibility for malicious usage of this tool.

License

MIT License

By Owais Shaikh

Work: [email protected]
Personal: [email protected]

octopii's People

Contributors

Stargazers

Watchers

Forkers

tuleo iamjohnbrown z5bra orxor qqvirus killvxk nyx2022 budhastudent exiahan paminhoff greatfanzy techsd orinocoz qq54288 ghurcka sekaki22 sopftf abdokaseb bellyfat robertjvt ashupup ramrod-0 thedevopsguru1 mrhou999 0x4f53 muhammadzubair220 ben-spec chris415 sjpi cyberdefender1 sahar042 furryfatkat elcin240 jeremiahn fancypanda2020 mbaroudi tanandy othmanalikhan-security drkiettran noorahsmith valteresj2 som3one0 anjanigourisaria jhoule-hyland ghost5683 soulabi santosh3743 robotica-labs python-popular-repos thomasxm 126789t

octopii's Issues

New PII-related regexes

Is your feature request related to a problem? Please describe.
I believe we can have more regexes for PII scanning. This can help expand the coverage of the tool.

Describe the solution you'd like
I discovered a website that has a good amount of regexes that I believe can be useful for Octopii: https://docs.trellix.com/bundle/data-loss-prevention-11.10.x-classification-definitions-reference-guide/page/GUID-66B1F12A-E267-4EEB-A9A5-A4398A6AF8CD.html

Additional context
None

Questions about "confidence_score"

Hi, I'm watching your sources and got some curiosity about your confidence scores.
Can I know your indications about your confidence scores? What are your standards about the score?

UnboundLocalError: local variable 'contains_faces' referenced before assignment

Describe the bug
When running the tool on a directory without images or PDF files, an UnboundLocalError is raised because the variable contains_faces has not been initialized. I believe that adding contains_faces = 0 at the beginning of the search_pii(file_path) function will solve the issue.

To Reproduce
Steps to reproduce the behavior:

Create a directory dir only with text files
Run python3 octopii.py dir/

Expected behavior
Octopii runs successfully

Feature request: portable app

Would there be an easy way to make this portable so I could toss it on a thumb drive and run it on a random workstation?

Fails and crashes when encountering unfamiliar file: zip, .db, etc

The script fails when encountering a file it cannot parse. For instance, a zip file:

Apologies for the screenshot, working inside a limited VM.

ModuleNotFoundError: No module named 'textract.parsers.zip_parser'

Py traceback includes lines 74, 203.

Thank you.

python3 octopii.py Traceback (most recent call last): File "/home/kali/Octopii/octopii.py", line 33, in <module> from keras.models import load_model ModuleNotFoundError: No module named 'keras'

Windows

Is your feature request related to a problem? Please describe.
I have a use case which is where I want to scan through backup files with Octopii on an SMB share. The capability works for this but there are some additional steps in that I have to make sure my Linux machine has access to the SMB share or the Backup file in question. If we could enable this to work on Windows as well this would help my use case.

Describe the solution you'd like
I am not sure how big this lift is, more than happy to help where possible. I have added the errors below that I see after confirming that the dependencies for windows are available.

It is not the end of the world but being able to run this from a Windows box would be better than having a dedicated Linux box for this task.

Additional context
When I run on Windows where I have already installed Tesseract I get the following:

 Octopii  python .\octopii.py .\dummy-pii\
Traceback (most recent call last):
  File "C:\Users\Administrator\Documents\Octopii\octopii.py", line 123, in <module>
    rules=text_utils.get_regexes()
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\Documents\Octopii\text_utils.py", line 52, in get_regexes
    _rules = json.load(json_file)
             ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
    return loads(fp.read(),
                 ^^^^^^^^^
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3062: character maps to <undefined>

"WARNING:tensorFlow:No training configuration found..." When running tool

Greetings,

I became aware of this project via Intigriti's Bug Bytes newsletter. I went through the install using venv, but found that the following error is returned when I run the tool against the 'dummy-pii' local directory and the 'https://pii-carbonconsole.fra1.digitaloceanspaces.com' URL.

It seems to be working as expected as it returns a confidence value for the sample images containing "PII". I am running the tool within Kali 2022.1 using Python 3.9.12 within a virtualenv using venv. A GitHub issue for another project that lead me to add ", compile=False" to line 214 of the octopii.py script

I don't really understand the implications of the change, but it did result in the error no longer being returned. As I mentioned earlier, the tool seems to be working as expected, so to me it kind of seems like it is just "cosmetic".

This is an exciting project. Thank you for the time and effort put into developing it and sharing it with the world!

ModuleNotFoundError: No module named 'cv2'

Describe the bug
ModuleNotFoundError: No module named 'cv2'

To Reproduce
Steps to reproduce the behavior:

Run python3 octopii.py dummy-pii/ (Windows 11)

Expected behavior
Octopii runs successfully

Octopii crashes on empty files

Describe the bug
A clear and concise description of what the bug is.

To Reproduce
Steps to reproduce the behavior:
Run octopii against a folder with a 0 byte file in it

Traceback (most recent call last):
File "/opt/Octopii/octopii.py", line 199, in
results = search_pii (file_path)
File "/opt/Octopii/octopii.py", line 80, in search_pii
addresses = text_utils.regional_pii(text)
File "/opt/Octopii/text_utils.py", line 80, in regional_pii
place_entity = locationtagger.find_locations(text = text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/init.py", line 4, in find_locations
e = NamedEntityExtractor(url=url, text=text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/locationextractor.py", line 25, in init
raise Exception('Please input any text or url')
Exception: Please input any text or url

Expected behavior
It not to crash when a file is 0 bytes

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.