rominf / profanity-filter Goto Github PK
View Code? Open in Web Editor NEWA Python library for detecting and filtering profanity
License: GNU General Public License v3.0
A Python library for detecting and filtering profanity
License: GNU General Public License v3.0
When trying to identify profane words sh1t
is not getting identified as profane.
Levenstein approach should have identified the variation to the original profane word.
Also, I see that sh1t
is listed under the profane word dictionary. Could you please see where the problem is?
The profane words dictionary contains both wank
and fuck
and also fucker
and wanker
, however pf.censor_word()
correctly censors fucks
and wanks
but not fuckers
or wankers
. This seems counter intuitive since the extra s
on shorter words would be a bigger % difference between words?
Hi,
Am I correct in assuming that this library can not censor phrases like "2 girls 1 cup" where individual words are harmless but the sentence is suggestive even if I add them to custom_profane_word_dictionaries
?
Thanks
Hi @rominf
When I do
from profanity_filter import ProfanityFilter as pf
pf.censor("That's bullshit!")
This error pops
SyntaxError: invalid syntax in File "/usr/local/lib/python3.5/dist-packages/profanity_filter/profanity_filter.py", line 102
censor_char: str = '*'
Is this a python3 issue? Does this only support python 2?
I created a simple service using instructions from your readme but nothing works.
My service:
import spacy
from django.conf import settings
from functools import cached_property
from profanity_filter import ProfanityFilter
class ProfanityService:
def __init__(self):
en_nlp = spacy.load("en_core_web_sm")
pl_nlp = spacy.load("pl_core_news_sm")
self.filter = ProfanityFilter(nlps={"en": en_nlp, "pl": pl_nlp})
self.filter.custom_profane_word_dictionaries = self.dictionaries
self.filter.censor_char = "*"
@cached_property
def dictionaries(self):
dicts = {}
with open(settings.EN_PROFANITY_DICT, "r") as f:
dicts["en"] = f.read().splitlines()
with open(settings.PL_PROFANITY_DICT, "r") as f:
dicts["pl"] = f.read().splitlines()
return dicts
def censor(self, text):
return self.filter.censor(text)
And I got errors when I call censor method.
common/services/profanity.py:28: in censor
return self.filter.censor(text)
../venv/lib/python3.8/site-packages/profanity_filter/profanity_filter.py:201: in censor
return self._censor(text=text, return_bool=False)
../venv/lib/python3.8/site-packages/profanity_filter/profanity_filter.py:798: in _censor
if token._.is_profane:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <spacy.tokens.underscore.Underscore object at 0x7f85da213ee0>, name = 'is_profane'
def __getattr__(self, name):
if name not in self._extensions:
> raise AttributeError(Errors.E046.format(name=name))
E AttributeError: [E046] Can't retrieve unregistered extension attribute 'is_profane'. Did you forget to call the `set_extension` method?
../venv/lib/python3.8/site-packages/spacy/tokens/underscore.py:35: AttributeError
P.S. I tried different ways but have no luck.
the Deep learning section contains code to cd int profanity_filter/data, where are these
How to explain this behavior in a current version?
>>> pf.censor("KAME FUKUHARA")
'KAME **KUHARA'
>>> pf.censor_word("KAME FUKUHARA")
Word(uncensored='KAME FUKUHARA', censored='KAME FUKUHARA', original_profane_word=None)
>>> pf = ProfanityFilter(languages=['en', 'ru'])
>>> pf.censor_whole_words=False
>>> pf.censor("goodshiit")
'good*****'
>>> pf.censor("улицабля")
'улицабля'
>>> pf = ProfanityFilter(languages=['ru', 'en'])
>>> pf.censor_whole_words=False
>>> pf.censor("улицабля")
'улица***'
>>> pf.censor("goodshiit")
'goodshiit'
Expected behavior
Profane word is saved in redis.
Real behavior
Exception is thrown.
How to reproduce
pf = ProfanityFilter(cache_redis_connection_url='redis://redis:6379/1')
pf.censor('fuck')
_save_censored_word
will fail` method will throw an exception
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "/code/ratings_parser/utils/censorship/profanity_filter.py", line 15, in censor
return pf.censor(text)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 201, in censor
return self._censor(text=text, return_bool=False)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 796, in _censor
doc = self._parse(language=language, text=text_part)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 523, in _parse
return spacy_utlis.parse(nlp=nlp, text=text, language=language, use_profanity_filter=use_profanity_filter)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_utlis.py", line 19, in parse
return nlp(text, disable=disable, component_cfg=component_cfg)
File "/usr/local/lib/python3.8/site-packages/spacy/language.py", line 439, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_component.py", line 37, in __call__
span = self._censor_spaceless_span(doc[i:j], language=language)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_component.py", line 78, in _censor_spaceless_span
censored_word = self._profanity_filter.censor_word(word=token, language=language)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 206, in censor_word
return self._censor_word(language=language, word=word)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 710, in _censor_word
censored_censored_part, no_profanity_inside = self._censor_word_part(language=language,
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 659, in _censor_word_part
self._save_censored_word(censored_word)
File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 630, in _save_censored_word
d = asdict(word)
File "/usr/local/lib/python3.8/dataclasses.py", line 1072, in asdict
raise TypeError("asdict() should be called on dataclass instances")
TypeError: asdict() should be called on dataclass instances
After installing, I'm not getting the console command to work.
profanity_filter -h
command not found
It's not in my C:\Python38\Scripts
nor my C:\Users\abc\AppData\Roaming\Python\Python38\Scripts
Use the Spacy component for most tests, as it offers more information.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 154, in __init__
spells=spells,
File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 180, in config
self._set_languages(languages, load_morphs=morphs is None, load_nlps=nlps is None, load_spells=spells is None)
File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 418, in _set_languages
self.morphs = None
File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 319, in morphs
self._morphs[language] = MorphAnalyzer(lang=language)
TypeError: __init__() got an unexpected keyword argument 'lang'
It should be implemented as settable property. Note, that cache should be cleared after the setting the new value.
The bottlenecks are:
deepcopy
.Hey,
I have been using the Library to classify english texts.
The one problem I have been facing is that the tool is wrongly classifying words that have devil, hell or allah in it. I was wondering if I can remove those from the Library's Dictionary.
Thanks,
Vyom
Every test a new instance of profanity filter is created. I think it should be possible to cache fixtures.
Hi Roman,
Thank you for sharing a code for your product. I learned a lot from it and
find it very powerful and reliable for the amount of features it provides. Did not try all of them yet though. :)
Have a suggestion.
Can we bring up the bad_word that was mutated by the user into result?
Ex, if I have "shiiiit" as an input, I would want to know what was the real bad_word that Levenshtein "had in mind" ("shit"). This example is easy but sometimes there are cases when you cannot even guess why the word is censored.
Do you see a value in it? Do you think it makes sense to add it? Maybe by extra parameter if not always?
Thank you very much for being very responsive and providing an excellent support for your great product!
When supplying a dict ({lang: {set}}) to extra_profane_word_dictionaries, it raises a TypeError after trying to divide a string by a string:
Traceback (most recent call last): File "<console>", line 4, in <module> File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 265, in custom_profane_word_dictionaries self.clear_cache() File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 384, in clear_cache self._update_profane_word_dictionary_files() File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 429, in _update_profane_word_dictionary_files profane_word_file = self._DATA_DIR / f'{language}_profane_words.txt' TypeError: unsupported operand type(s) for /: 'str' and 'str'
Algorithm: https://fastss.csg.uzh.ch/
Implementation: https://github.com/fujimotos/TinyFastSS
So I am trying to make a bot that uses this module and it can't seem to work after being turned into a .exe. It compiles fine but if I run the .exe from the command line, this is the output:
C:\Users\ReCor\Documents\Bot>bot.exe Traceback (most recent call last): File "bot.py", line 10, in <module> import better_profanity File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\__init__.py", line 3, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\better_profanity.py", line 5, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\constants.py", line 14, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\ReCor\\AppData\\Local\\Temp\\_MEI88242\\better_profanity\\alphabetic_unicode.json' [22044] Failed to execute script bot
Useful functions: partitions
, substrings_indexes
.
Blocked by more-itertools/more-itertools#276, more-itertools/more-itertools#278.
For example, these words should not be detected as profane: "deathfrom", "eskimobob" if they come as part of emails and URLs.
I think dask
is a good solution because it has a nice API and can be used in a cluster.
The easiest and most effective parallelization is to map words after tokenization.
Things to do:
Also package it to the Docker.
This will make parallelized censoring faster. This should be optional because the user will need to setup MongoDB and install additional dependencies.
For these inputs "deathfrom", "eskimobob", ""piazza@gma" with pf.censor_whole_words=False, pf.censor_word throws below exception.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 246, in censor_word
word=censored_part)
File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 746, in _censor_word
censored = self._censor_word_by_part(word=word, profane_word=bad_word)
File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 603, in _censor_word_by_part
flags=regex.IGNORECASE)
File "/Users/nsmoli/venv/profanity/profanity-filter/lib/python3.7/site-packages/regex/regex.py", line 275, in sub
return _compile(pattern, flags, kwargs).sub(repl, string, count, pos,
File "/Users/nsmoli/venv/profanity/profanity-filter/lib/python3.7/site-packages/regex/regex.py", line 515, in _compile
caught_exception.pos)
regex._regex_core.error: multiple repeat at position 4 (or 5)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.