redhuntlabs / octopii Goto Github PK
View Code? Open in Web Editor NEWAn AI-powered Personal Identifiable Information (PII) scanner.
Home Page: https://redhuntlabs.com/blog/octopii-an-opensource-pii-scanner-for-images.html
License: Other
An AI-powered Personal Identifiable Information (PII) scanner.
Home Page: https://redhuntlabs.com/blog/octopii-an-opensource-pii-scanner-for-images.html
License: Other
Greetings,
I became aware of this project via Intigriti's Bug Bytes newsletter. I went through the install using venv, but found that the following error is returned when I run the tool against the 'dummy-pii' local directory and the 'https://pii-carbonconsole.fra1.digitaloceanspaces.com' URL.
It seems to be working as expected as it returns a confidence value for the sample images containing "PII". I am running the tool within Kali 2022.1 using Python 3.9.12 within a virtualenv using venv. A GitHub issue for another project that lead me to add ", compile=False" to line 214 of the octopii.py script
I don't really understand the implications of the change, but it did result in the error no longer being returned. As I mentioned earlier, the tool seems to be working as expected, so to me it kind of seems like it is just "cosmetic".
This is an exciting project. Thank you for the time and effort put into developing it and sharing it with the world!
Is your feature request related to a problem? Please describe.
I believe we can have more regexes for PII scanning. This can help expand the coverage of the tool.
Describe the solution you'd like
I discovered a website that has a good amount of regexes that I believe can be useful for Octopii: https://docs.trellix.com/bundle/data-loss-prevention-11.10.x-classification-definitions-reference-guide/page/GUID-66B1F12A-E267-4EEB-A9A5-A4398A6AF8CD.html
Additional context
None
Describe the bug
When running the tool on a directory without images or PDF files, an UnboundLocalError
is raised because the variable contains_faces
has not been initialized. I believe that adding contains_faces = 0
at the beginning of the search_pii(file_path)
function will solve the issue.
To Reproduce
Steps to reproduce the behavior:
dir
only with text filespython3 octopii.py dir/
Expected behavior
Octopii runs successfully
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
Run octopii against a folder with a 0 byte file in it
Traceback (most recent call last):
File "/opt/Octopii/octopii.py", line 199, in
results = search_pii (file_path)
File "/opt/Octopii/octopii.py", line 80, in search_pii
addresses = text_utils.regional_pii(text)
File "/opt/Octopii/text_utils.py", line 80, in regional_pii
place_entity = locationtagger.find_locations(text = text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/init.py", line 4, in find_locations
e = NamedEntityExtractor(url=url, text=text)
File "/usr/local/lib/python3.10/dist-packages/locationtagger/locationextractor.py", line 25, in init
raise Exception('Please input any text or url')
Exception: Please input any text or url
Expected behavior
It not to crash when a file is 0 bytes
Describe the bug
ModuleNotFoundError: No module named 'cv2'
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Octopii runs successfully
Would there be an easy way to make this portable so I could toss it on a thumb drive and run it on a random workstation?
Hi, I'm watching your sources and got some curiosity about your confidence scores.
Can I know your indications about your confidence scores? What are your standards about the score?
Is your feature request related to a problem? Please describe.
I have a use case which is where I want to scan through backup files with Octopii on an SMB share. The capability works for this but there are some additional steps in that I have to make sure my Linux machine has access to the SMB share or the Backup file in question. If we could enable this to work on Windows as well this would help my use case.
Describe the solution you'd like
I am not sure how big this lift is, more than happy to help where possible. I have added the errors below that I see after confirming that the dependencies for windows are available.
It is not the end of the world but being able to run this from a Windows box would be better than having a dedicated Linux box for this task.
Additional context
When I run on Windows where I have already installed Tesseract I get the following:
Octopii python .\octopii.py .\dummy-pii\
Traceback (most recent call last):
File "C:\Users\Administrator\Documents\Octopii\octopii.py", line 123, in <module>
rules=text_utils.get_regexes()
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\Documents\Octopii\text_utils.py", line 52, in get_regexes
_rules = json.load(json_file)
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\json\__init__.py", line 293, in load
return loads(fp.read(),
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 3062: character maps to <undefined>
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.