rominf / profanity-filter Goto Github PK

View Code? Open in Web Editor NEW

145.0 3.0 73.0 211 KB

A Python library for detecting and filtering profanity

License: GNU General Public License v3.0

Python 100.00%

python python3 profanity profanity-detection profanity-filter profanityfilter filter filtering lib library

profanity-filter's Introduction

profanity-filter: A Python library for detecting and filtering profanity

Archived

This library is no longer a priority for me. Feel free to fork it.

profanity-filter: A Python library for detecting and filtering profanity

Overview

profanity-filter is a universal library for detecting and filtering profanity. Support for English and Russian is included.

Features

Full text or individual words censoring.
Multilingual support, including profanity filtering in texts written in mixed languages.
Deep analysis. The library detects not only the exact profane word matches but also derivative and distorted profane words using the Levenshtein automata, ignoring dictionary words, containing profane words as a part.
Spacy component for using the library as a part of the pipeline.
Explanation of decisions (attribute original_profane_word).
Partial word censoring.
Extensibility support. New languages can be added by supplying dictionaries.
RESTful web service.

Caveats

Context-free. The library cannot detect using profane phrases consisted of decent words. Vice versa, the library cannot detect appropriate usage of a profane word.

Usage

Here are the basic examples of how to use the library. For more examples please see tests folder.

Basics

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor("That's bullshit!")
# "That's ********!"

pf.censor_word('fuck')
# Word(uncensored='fuck', censored='****', original_profane_word='fuck')

Deep analysis

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor("fuckfuck")
# "********"

pf.censor_word('oofuko')
# Word(uncensored='oofuko', censored='******', original_profane_word='fuck')

pf.censor_whole_words = False
pf.censor_word('h0r1h0r1')
# Word(uncensored='h0r1h0r1', censored='***1***1', original_profane_word='h0r')

Multilingual analysis

from profanity_filter import ProfanityFilter

pf = ProfanityFilter(languages=['ru', 'en'])

pf.censor("Да бля, это просто shit какой-то!")
# "Да ***, это просто **** какой-то!"

Using as a part of Spacy pipeline

import spacy
from profanity_filter import ProfanityFilter

nlp = spacy.load('en')
profanity_filter = ProfanityFilter(nlps={'en': nlp})  # reuse spacy Language (optional)
nlp.add_pipe(profanity_filter.spacy_component, last=True)

doc = nlp('This is shiiit!')

doc._.is_profane
# True

doc[:2]._.is_profane
# False

for token in doc:
    print(f'{token}: '
          f'censored={token._.censored}, '
          f'is_profane={token._.is_profane}, '
          f'original_profane_word={token._.original_profane_word}'
    )
# This: censored=This, is_profane=False, original_profane_word=None
# is: censored=is, is_profane=False, original_profane_word=None
# shiiit: censored=******, is_profane=True, original_profane_word=shit
# !: censored=!, is_profane=False, original_profane_word=None

Customizations

from profanity_filter import ProfanityFilter

pf = ProfanityFilter()

pf.censor_char = '@'
pf.censor("That's bullshit!")
# "That's @@@@@@@@!"

pf.censor_char = '*'
pf.custom_profane_word_dictionaries = {'en': {'love', 'dog'}}
pf.censor("I love dogs and penguins!")
# "I **** **** and penguins"

pf.restore_profane_word_dictionaries()
pf.is_clean("That's awesome!")
# True

pf.is_clean("That's bullshit!")
# False

pf.is_profane("That's bullshit!")
# True

pf.extra_profane_word_dictionaries = {'en': {'chocolate', 'orange'}}
pf.censor("Fuck orange chocolates")
# "**** ****** **********"

Console Executable

$ profanity_filter -h
usage: profanity_filter [-h] [-t TEXT | -f PATH] [-l LANGUAGES] [-o OUTPUT_FILE] [--show]

Profanity filter console utility

optional arguments:
  -h, --help            show this help message and exit
  -t TEXT, --text TEXT  Test the given text for profanity
  -f PATH, --file PATH  Test the given file for profanity
  -l LANGUAGES, --languages LANGUAGES
                        Test for profanity using specified languages (comma
                        separated)
  -o OUTPUT_FILE, --output OUTPUT_FILE
                        Write the censored output to a file
  --show                Print the censored text

RESTful web service

Run:

$ uvicorn profanity_filter.web:app --reload
INFO: Uvicorn running on http://127.0.0.1:8000
...

Go to the {BASE_URL}/docs for interactive documentation.

Installation

First two parts of installation instructions are designed for the users who want to filter English profanity. If you want to filter profanity in another language you still need to read it.

Basic installation

For minimal setup you need to install profanity-filter with is bundled with spacy and download spacy model for tokenization and lemmatization:

$ pip install profanity-filter
$ # Skip next line if you want to filter profanity in another language
$ python -m spacy download en

For more info about Spacy models read: https://spacy.io/usage/models/.

Deep analysis

To get deep analysis functionality install additional libraries and dictionary for your language.

Firstly, install hunspell and hunspell-devel packages with your system package manager.

For Amazon Linux AMI run:

$ sudo yum install hunspell

For openSUSE run:

$ sudo zypper install hunspell hunspell-devel

Then run:

$ pip install -U profanity-filter[deep-analysis] git+https://github.com/rominf/hunspell_serializable@49c00fabf94cacf9e6a23a0cd666aac10cb1d491#egg=hunspell_serializable git+https://github.com/rominf/pyffs@6c805fbfd7771727138b169b32484b53c0b0fad1#egg=pyffs
$ # Skip next lines if you want deep analysis support for another language (will be covered in next section)
$ cd profanity_filter/data
$ wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.aff
$ wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/en/en_US.dic
$ mv en_US.aff en.aff
$ mv en_US.dic en.dic

Other language support

Let's take Russian for example on how to add new language support.

Russian language support

Firstly, we need to provide file profanity_filter/data/ru_badwords.txt which contains a newline separated list of profane words. For Russian it's already present, so we skip file generation.

Next, we need to download the appropriate Spacy model. Unfortunately, Spacy model for Russian is not yet ready, so we will use an English model for tokenization. If you had not install Spacy model for English, it's the right time to do so. As a consequence, even if you want to filter just Russian profanity, you need to specify English in ProfanityFilter constructor as shown in usage examples.

Next, we download dictionaries in Hunspell format for deep analysis from the site https://cgit.freedesktop.org/libreoffice/dictionaries/plain/:

> cd profanity_filter/data
> wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ru_RU/ru_RU.aff
> wget https://cgit.freedesktop.org/libreoffice/dictionaries/plain/ru_RU/ru_RU.dic
> mv ru_RU.aff ru.aff
> mv ru_RU.dic ru.dic

Pymorphy2

For Russian and Ukrainian languages to achieve better results we suggest you to install pymorphy2. To install pymorphy2 with Russian dictionary run:

$ pip install -U profanity-filter[pymorphy2-ru] git+https://github.com/kmike/pymorphy2@ca1c13f6998ae2d835bdd5033c17197dcba84cf4#egg=pymorphy2

Multilingual support

You need to install polyglot package and it's requirements for language detection. See https://polyglot.readthedocs.io/en/latest/Installation.html for more detailed instructions.

For Amazon Linux AMI run:

$ sudo yum install libicu-devel

For openSUSE run:

$ sudo zypper install libicu-devel

Then run:

$ pip install -U profanity-filter[multilingual]

RESTful web service

Run:

$ pip install -U profanity-filter[web]

Troubleshooting

You can always check will deep, morphological, and multilingual analyses work by inspecting the value of module variable AVAILABLE_ANALYSES. If you've followed all steps and installed support for all analyses you will see the following:

from profanity_filter import AVAILABLE_ANALYSES

print(', '.join(sorted(analysis.value for analysis in AVAILABLE_ANALYSES)))
# deep, morphological, multilingual

If something is not right, you can import dependencies yourself to see the import exceptions:

from profanity_filter.analysis.deep import *
from profanity_filter.analysis.morphological import *
from profanity_filter.analysis.multilingual import *

Credits

English profane word dictionary: https://github.com/areebbeigh/profanityfilter/ (author Areeb Beigh).

Russian profane word dictionary: https://github.com/PixxxeL/djantimat (author Ivan Sergeev).

profanity-filter's People

Contributors

Stargazers

Watchers

Forkers

vdarakjian khodjaevsh duke-crucible eevelweezel archana88 lotrofans ishwardgret lazercorn isolark mosjin weirdname404 aakashsundars darshan-majithiya dominictassio awryn viveksri96 krisfris globax89 juanjosanz saroad2 cruelmoney rcrvro jawon2173 vivekkumar2696 osamaaaaa alexk2037 blaise442 dangolbeeker genofede93 jonhassall neorusa mohitvalechalove brapacz 1315groop scottgriffin213 phamthaiide saxenasahilas temporal-games ajordanb gabrielhayoun aristarx-lintter gizatupu mukhituly billweasley noblenihal lwneal szepetry distilando beyond-reason katerinamerkulova dilorc akshaan darkrush parikls shawonashraf zamoroz shlaikov moonbaron david-pettifor-nd borderdata-io leaderofepic134 stefantrawicki baevic ayush-tanx huscker

profanity-filter's Issues

Refactor tests

Use the Spacy component for most tests, as it offers more information.

Publish on Spacy website

Parallelize censoring

I think dask is a good solution because it has a nice API and can be used in a cluster.

The easiest and most effective parallelization is to map words after tokenization.

Can't import profanity_filter

I'm getting this error when trying to import the library on a python terminal. Using both python 3.6.0 and 3.7.0

Publish all dependencies on PyPI to avoid installation via git URLs

Make it possible to change DATA_DIR

It should be implemented as settable property. Note, that cache should be cleared after the setting the new value.

Invalid syntax in profanity_filter.py Class config

Hi @rominf
When I do

from profanity_filter import ProfanityFilter as pf
pf.censor("That's bullshit!")

This error pops
SyntaxError: invalid syntax in File "/usr/local/lib/python3.5/dist-packages/profanity_filter/profanity_filter.py", line 102
censor_char: str = '*'

Is this a python3 issue? Does this only support python 2?

Get exception on particular input

For these inputs "deathfrom", "eskimobob", ""piazza@gma" with pf.censor_whole_words=False, pf.censor_word throws below exception.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 246, in censor_word
    word=censored_part)
  File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 746, in _censor_word
    censored = self._censor_word_by_part(word=word, profane_word=bad_word)
  File "/Users/nsmoli/venv/profanity/profanity-filter/profanity_filter/profanity_filter.py", line 603, in _censor_word_by_part
    flags=regex.IGNORECASE)
  File "/Users/nsmoli/venv/profanity/profanity-filter/lib/python3.7/site-packages/regex/regex.py", line 275, in sub
    return _compile(pattern, flags, kwargs).sub(repl, string, count, pos,
  File "/Users/nsmoli/venv/profanity/profanity-filter/lib/python3.7/site-packages/regex/regex.py", line 515, in _compile
    caught_exception.pos)
regex._regex_core.error: multiple repeat at position 4 (or 5)

Improve README.md

Things to do:

Add features and limitations.
Improve examples.
Move all usage examples into one place with links to installation.

Make REST webservice for profanity filtering

Also package it to the Docker.

Unable to mark words as not profane(Customization / English)

Hey,
I have been using the Library to classify english texts.
The one problem I have been facing is that the tool is wrongly classifying words that have devil, hell or allah in it. I was wondering if I can remove those from the Library's Dictionary.
Thanks,
Vyom

TypeError when calling extra_profane_word_dictionaries

When supplying a dict ({lang: {set}}) to extra_profane_word_dictionaries, it raises a TypeError after trying to divide a string by a string:

Traceback (most recent call last): File "<console>", line 4, in <module> File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 265, in custom_profane_word_dictionaries self.clear_cache() File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 384, in clear_cache self._update_profane_word_dictionary_files() File "/home/hwhite/frac37/lib/python3.7/site-packages/profanity_filter/profanity_filter.py", line 429, in _update_profane_word_dictionary_files profane_word_file = self._DATA_DIR / f'{language}_profane_words.txt' TypeError: unsupported operand type(s) for /: 'str' and 'str'

Speedup initialization

The bottlenecks are:

Spacy model loading.
Spacy nlp object deepcopy.

Do not try to search profanity in compound words of dictionary words in emails and URLs

For example, these words should not be detected as profane: "deathfrom", "eskimobob" if they come as part of emails and URLs.

where to cd?

the Deep learning section contains code to cd int profanity_filter/data, where are these

Optionaly store cache in MongoDB

This will make parallelized censoring faster. This should be optional because the user will need to setup MongoDB and install additional dependencies.

Some plurals not considered profane

The profane words dictionary contains both wank and fuck and also fucker and wanker, however pf.censor_word() correctly censors fucks and wanks but not fuckers or wankers. This seems counter intuitive since the extra s on shorter words would be a bigger % difference between words?

What am I doing wrong?

I created a simple service using instructions from your readme but nothing works.

My service:

import spacy

from django.conf import settings
from functools import cached_property

from profanity_filter import ProfanityFilter


class ProfanityService:

    def __init__(self):
        en_nlp = spacy.load("en_core_web_sm")
        pl_nlp = spacy.load("pl_core_news_sm")
        self.filter = ProfanityFilter(nlps={"en": en_nlp, "pl": pl_nlp})
        self.filter.custom_profane_word_dictionaries = self.dictionaries
        self.filter.censor_char = "*"

    @cached_property
    def dictionaries(self):
        dicts = {}
        with open(settings.EN_PROFANITY_DICT, "r") as f:
            dicts["en"] = f.read().splitlines()
        with open(settings.PL_PROFANITY_DICT, "r") as f:
            dicts["pl"] = f.read().splitlines()
        return dicts

    def censor(self, text):
        return self.filter.censor(text)

And I got errors when I call censor method.

common/services/profanity.py:28: in censor
    return self.filter.censor(text)
../venv/lib/python3.8/site-packages/profanity_filter/profanity_filter.py:201: in censor
    return self._censor(text=text, return_bool=False)
../venv/lib/python3.8/site-packages/profanity_filter/profanity_filter.py:798: in _censor
    if token._.is_profane:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <spacy.tokens.underscore.Underscore object at 0x7f85da213ee0>, name = 'is_profane'

    def __getattr__(self, name):
        if name not in self._extensions:
>           raise AttributeError(Errors.E046.format(name=name))
E           AttributeError: [E046] Can't retrieve unregistered extension attribute 'is_profane'. Did you forget to call the `set_extension` method?

../venv/lib/python3.8/site-packages/spacy/tokens/underscore.py:35: AttributeError

P.S. I tried different ways but have no luck.

Only first language in a list of languages is working

>>> pf = ProfanityFilter(languages=['en', 'ru'])
>>> pf.censor_whole_words=False
>>> pf.censor("goodshiit")
'good*****'
>>> pf.censor("улицабля")
'улицабля'

>>> pf = ProfanityFilter(languages=['ru', 'en'])
>>> pf.censor_whole_words=False
>>> pf.censor("улицабля")
'улица***'
>>> pf.censor("goodshiit")
'goodshiit'

censor() and censor_word() give different results for profanity

How to explain this behavior in a current version?

>>> pf.censor("KAME FUKUHARA")
'KAME **KUHARA'
>>> pf.censor_word("KAME FUKUHARA")
Word(uncensored='KAME FUKUHARA', censored='KAME FUKUHARA', original_profane_word=None)

Use more-itertools library

Useful functions: partitions, substrings_indexes.

Blocked by more-itertools/more-itertools#276, more-itertools/more-itertools#278.

TypeError: init() got an unexpected keyword argument 'lang'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 154, in __init__
    spells=spells,
  File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 180, in config
    self._set_languages(languages, load_morphs=morphs is None, load_nlps=nlps is None, load_spells=spells is None)
  File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 418, in _set_languages
    self.morphs = None
  File "/usr/local/lib/python3.6/dist-packages/profanity_filter/profanity_filter.py", line 319, in morphs
    self._morphs[language] = MorphAnalyzer(lang=language)
TypeError: __init__() got an unexpected keyword argument 'lang'

Windows 10, Python 3.8 can't run console command

After installing, I'm not getting the console command to work.

profanity_filter -h
command not found

It's not in my C:\Python38\Scripts nor my C:\Users\abc\AppData\Roaming\Python\Python38\Scripts

Fails to detect phrases

Hi,
Am I correct in assuming that this library can not censor phrases like "2 girls 1 cup" where individual words are harmless but the sentence is suggestive even if I add them to custom_profane_word_dictionaries?
Thanks

Not working with auto-py-to-exe

So I am trying to make a bot that uses this module and it can't seem to work after being turned into a .exe. It compiles fine but if I run the .exe from the command line, this is the output:
C:\Users\ReCor\Documents\Bot>bot.exe Traceback (most recent call last): File "bot.py", line 10, in <module> import better_profanity File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\__init__.py", line 3, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\better_profanity.py", line 5, in <module> File "<frozen importlib._bootstrap>", line 983, in _find_and_load File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 677, in _load_unlocked File "PyInstaller\loader\pyimod03_importers.py", line 531, in exec_module File "better_profanity\constants.py", line 14, in <module> FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\ReCor\\AppData\\Local\\Temp\\_MEI88242\\better_profanity\\alphabetic_unicode.json' [22044] Failed to execute script bot

Bug in saving profane word in redis

Expected behavior
Profane word is saved in redis.

Real behavior
Exception is thrown.

How to reproduce

Connect to redis
censor profane word

pf = ProfanityFilter(cache_redis_connection_url='redis://redis:6379/1')
pf.censor('fuck')

_save_censored_word will fail` method will throw an exception

Traceback (most recent call last):                                                                                                                                                                                 
  File "<console>", line 1, in <module>                                                                                                                                                                            
  File "/code/ratings_parser/utils/censorship/profanity_filter.py", line 15, in censor                                                                                                                             
    return pf.censor(text)                                                                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 201, in censor                                                                                                          
    return self._censor(text=text, return_bool=False)                                                                                                                                                              
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 796, in _censor                                                                                                         
    doc = self._parse(language=language, text=text_part)                                                                                                                                                           
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 523, in _parse                                                                                                          
    return spacy_utlis.parse(nlp=nlp, text=text, language=language, use_profanity_filter=use_profanity_filter)                                                                                                     
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_utlis.py", line 19, in parse                                                                                                                 
    return nlp(text, disable=disable, component_cfg=component_cfg)                                                                                                                                                 
  File "/usr/local/lib/python3.8/site-packages/spacy/language.py", line 439, in __call__                                                                                                                           
    doc = proc(doc, **component_cfg.get(name, {}))                                                                                                                                                                 
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_component.py", line 37, in __call__                                                                                                          
    span = self._censor_spaceless_span(doc[i:j], language=language)                                                                                                                                                
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/spacy_component.py", line 78, in _censor_spaceless_span                                                                                            
    censored_word = self._profanity_filter.censor_word(word=token, language=language)                                                                                                                              
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 206, in censor_word                                                                                                     
    return self._censor_word(language=language, word=word)                                                                                                                                                         
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 710, in _censor_word                                                                                                    
    censored_censored_part, no_profanity_inside = self._censor_word_part(language=language,                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 659, in _censor_word_part                                                                                               
    self._save_censored_word(censored_word)                                                                                                                                                                        
  File "/usr/local/lib/python3.8/site-packages/profanity_filter/profanity_filter.py", line 630, in _save_censored_word                                                                                             
    d = asdict(word)                                                                                                                                                                                               
  File "/usr/local/lib/python3.8/dataclasses.py", line 1072, in asdict                                                                                                                                             
    raise TypeError("asdict() should be called on dataclass instances")                                                                                                                                            
TypeError: asdict() should be called on dataclass instances

Show real profane word to a user

Hi Roman,
Thank you for sharing a code for your product. I learned a lot from it and
find it very powerful and reliable for the amount of features it provides. Did not try all of them yet though. :)

Have a suggestion.
Can we bring up the bad_word that was mutated by the user into result?
Ex, if I have "shiiiit" as an input, I would want to know what was the real bad_word that Levenshtein "had in mind" ("shit"). This example is easy but sometimes there are cases when you cannot even guess why the word is censored.
Do you see a value in it? Do you think it makes sense to add it? Maybe by extra parameter if not always?

Thank you very much for being very responsive and providing an excellent support for your great product!

Try TinyFastSS

Algorithm: https://fastss.csg.uzh.ch/
Implementation: https://github.com/fujimotos/TinyFastSS

Failed to detect number substitutions

When trying to identify profane words sh1t is not getting identified as profane.
Levenstein approach should have identified the variation to the original profane word.
Also, I see that sh1t is listed under the profane word dictionary. Could you please see where the problem is?

Minimize profane word dictionaries for deep analysis usage

Make tests faster

Every test a new instance of profanity filter is created. I think it should be possible to cache fixtures.

rominf / profanity-filter Goto Github PK

profanity-filter's Introduction

profanity-filter: A Python library for detecting and filtering profanity

Archived

Table of contents

Overview

Features

Caveats

Usage

Basics

Deep analysis

Multilingual analysis

Using as a part of Spacy pipeline

Customizations

Console Executable

RESTful web service

Installation

Basic installation

Deep analysis

Other language support

Russian language support

Pymorphy2

Multilingual support

RESTful web service

Troubleshooting

Credits

profanity-filter's People

Contributors

Stargazers

Watchers

Forkers

profanity-filter's Issues

Recommend Projects

Recommend Topics

Recommend Org