unitaryai / detoxify Goto Github PK

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at [email protected].

Home Page: https://www.unitary.ai/

License: Apache License 2.0

Python 100.00%

bert bert-model huggingface-transformers huggingface nlp toxic-comment-classification toxicity toxic-comments sentence-classification kaggle-competition

detoxify's People

Contributors

Stargazers

Watchers

Forkers

programmer-util databill86 1965aafc shantanunandan huyhoang17 adbmd deepa-sarojam hancia innocentius qaboahene hkim07 chaitanyabaweja worldie-com christinataft sylentheal paynesa strategist922 tguptamt chiranshu14 s2t2 truematthewkirkham cesasalaam soyabulislamlincoln andreajparker laplacekorea rajdeepborgohain dliofindia lightning-sandbox pandinosaurus saeidkp cc13ny andrmoura elliefy khatvangi kuior yufang67 batra-aarav artemvazh cold-eye techthiyanes vicmoh jmwoloso jameswburton18 gmurphy794 statsgary brianetaveras antlyfe zjzhang123 rdewolf127 gregpriday sarwar3328 mozaicjared vindhya-singh anitavero nloui durjoy01829 shaonc dobbytk meai charlieviettq radhictive tiyaro devrsi0n sorensenjs zizo1111 rune omidforoqi emmanuel-lud gnaw-sivam blenature pavithra37 aqhali dsj96 ibizaca allen-oneill mrubash1 devm0807 privapuru dcferreira tchallaalbertkwame djaym7 raaynml danroymwangi nkcsjxd jonbrouse loghinvladdev angus27rzz mengrusun tamassl dosatos hasbegun innoxai kairos03 mvandermeulen vasilije1990 topoteretes fakerybakery dcartertwo yosukehigashi vela-zz

detoxify's Issues

Running this on mobile with pytorch mobile

Is there anyway I can load the weights and "quantize" to reduce the size so I can run with pytorch mobile client side?

Toxicity scores, same as Perspective API?

Great repo!

I have a question, I hope someone can help?

Are the toxicity scores provided by the Unitary models, probability scores, in the same way that perspective API returns these values?

"The only score type currently offered is a probability score. It indicates how likely it is that a reader would perceive the comment provided in the request as containing the given attribute. For each attribute, the scores provided represent a probability, with a value between 0 and 1. A higher score indicates a greater likelihood that a reader would perceive the comment as containing the given attribute. For example, a comment like “You are an idiot” may receive a probability score of 0.8 for attribute TOXICITY, indicating that 8 out of 10 people would perceive that comment as toxic. "

Or do they represent the extent of the toxicity?

Thanks so much!

Add dutch language

Hi!
This is awesome!
Can you maybe add the dutch language?
Thanks,
Joachim.

Adding a New Language and Extending Previous Languages

How would one add a new language to the existing set? How would one extended what is already there with further examples?

The multilingual CSVs are missing from Kaggle

The various CSVs from Jigsaw Multilingual Toxic Comment Classification appear to no longer be available. These are:

jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-es-cleaned.csv
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-fr-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-it-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-pt-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-ru-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-tr-cleaned.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_es_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_fr_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_it_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_pt_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_ru_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_tr_clean.csv,

I've been able to find some of these at https://www.kaggle.com/miklgr500/jigsaw-train-multilingual-coments-google-api but these do not include the bias CSVs.

Do you happen to know where these are located?

Thank you

Checkpoints missing optimizer_states

Thank you for your work on this very useful library!

I have had success training Albert Unbiased from scratch. I'm curious how model performance would compare if training continued from one of your checkpoints (unbiased-albert-c8519128.ckpt in this case). However if I attempt to initiate train.py with this file I am getting an error like:

KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'

FYI I am using the following command:

python train.py --config configs/Unintended_bias_toxic_comment_classification_Albert_revised_training.json -d 1 --num_workers 0 -e 101 -r model_ckpts/unbiased-albert-c8519128_modified_state_dict.ckpt

Inspecting the checkpoint file I indeed observe it is missing some components, most critical of which (I think) is the optimizer_states. Comparing to one of my own checkpoints it looks like what is absent includes: ['pytorch-lightning_version', 'callbacks', 'optimizer_states', 'lr_schedulers', 'hparams_name', 'hyper_parameters'].

I'm wondering if I am doing something wrong? Or else, is it possible for you to share new versions of your checkpoints that include these missing components?

RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

Any idea what's happening?

GPU not used during Training

Hi there,
Thank you for the useful repository.

I am trying to use the script for model training. By leaving the "--device" parameter with the default value (default: all) should use the GPU if available, right?

However, even if available, it seems to do not using it.
It prints as output: GPU available: True, used: False, and the training script takes 48 hours.

Could you help me with how to make use of GPU?

Thanks in advance

Memory leak running on lightweight model

First of all, this repository has been a great help for my research effort, and I really appreciate you sharing this to the public.

My issue is that the program would not free up its memory after done with the calculation. Here is the code to replicate the issue:

from detoxify import Detoxify
import torch
Detoxify('original-small').predict("Beep, beep, I'm a sheep.")
torch.cuda.empty_cache()

When monitoring GPU memory use, we find that the memory is not freed (about 600M more than before). This could be a problem when we run the model with large amount of data or multithreading. I have been trying to reasonably multithreading the model, only to find it always run out of memory when running larger datasets (each thread runs the model only one time then joins, 5 threads are used but after running 500 times it exceed 8G of GPU memory). Only after the kernel is completely destroyed the memory would be freed. I'm using Windows 10 Anaconda 1.10, Python 3.8.3 , PyTorch 1.7 w/ CUDA 11.0.

I'm currently looking into ways to solve this, would you be able to help?

PS: The same issue could present in original model, maybe GPU and CPU computing uses different garbage collection methods.

Weird behavior of Smaller and Larger Models for same Text

Hey! Thanks for this easy to get started package. I was testing both original and unbiased model on following sentences:

doc_1 = "I don't know why people don't support Muslims and call them terrorists often. They are not."
doc_2 = "There is nothing wrong being in a lesbian. Everyone has feelings."

Following are the toxicity scores by them:

The original model which is supposed to be biased is predicting doc_1 to be non-toxic as it should while the unbiased-smaller model predicts it to be toxic.

Likewise, for doc_2, the prediction should be non-toxic in ideal scenario and the original model(both smaller and larger) being biased should predict it toxic. This is what it does:

Original smaller one predicts toxic while the larger one does not. Can you explain what might be causing different behavior for same text in smaller and larger models in case of both original and unbiased models here?

Unbiased model not returning identity labels

Thanks for the great repo!

I'm running the 'Quick prediction' code using the unbiased model, but there are no identity labels being returned - even with severe toxicity. I only get the toxicity labels.

Am I missing something?

Thanks again!

RuntimeError

RuntimeError: /Users/qab/.cache/torch/checkpoints/toxic_original-c1212f89.ckpt is a zip archive (did you mean to use torch.jit.load()?)
I get this error trying to run this for the first time. Any help?

How to overcome memory issues when predicting large batches of data?

Hello team,

I have a dataset of about 8000 comments each comment is around 6 to 8 words (some are shorted with 2 words only)

The problem is that I am unable to get the prediction since I run out of GPU memory during the process. To overcome this I am using a custom loop to loop over comments in batches and append the results to a data frame.

comments_list = comments["text"].to_list()
 df = pd.DataFrame()

 for i in range(0, len(comments_list), 32):
     comms = comments_list[i : i + 32]
     results = Detoxify("original", device=device).predict(comms)
     results = pd.DataFrame(results)
     df = df.append(results, ignore_index=True)

Is there a more efficient way of doing this than writing a for loop?

Currently I have a 16GB Testa T4 as GPU.

Thanks!

How do you load a custom checkpoint?

Hello I want to train the network on my own samples but I'm finding it quite difficult.

Right now I edited Toxic_comment_classification_BERT.json to point to my own training and test csv. Then I have to edit train.py to manually save the model object inside ToxicClassifier at the end of the training.

torch.save(model.model, 'custom.pt')

Then I have load the file manually, instantiate the normal instance of detoxify, and then replace the internal model object with the saved version to get it to work.

saved = torch.load('custom.pt')
d = detoxify.Detoxify('original')
d.model = saved

If I try to load a checkpoint generated at "saved\Jigsaw_BERT\lightning_logs\version_x\checkpoints\epoch=3-step=76.ckpt" with detoxify or try to instantiate detoxify with the "checkpoint parameter" or with a file generated by torch.save(model), it always says

Checkpoint needs to contain the config it was trained with as well as the state dict

What's the proper way of saving the checkpoint so it has the config and state dict with it? Or is my workaround the best way to use custom training data?

Example Discord Bot

Hi there,

I think you guys are doing great work!
I created a small Discord Moderation Bot (https://gist.github.com/KrautByte/975f404969f4de8f4147e1bb4f7b64cb)
maybe you wan't to use it as example.

false positive

wtf? for some reason this message is flagged as toxic:
"who selling lup pots"
can you fix? using original data set

Detoxify Roadmap

Issue to keep track of improvements we'd like to make (in no particular order). Feedback and suggestions welcome!

better way to handle emojis (#27)
train the unbiased model on the Wikipedia dataset from the first challenge as well
add a multilingual light model (#17)
train the multilingual model with more languages
add more datasets to training
add new categories like personal attack (potentially using https://github.com/ewulczyn/wiki-detox)
improve bias metrics & test on different benchmark like HateCheck

How to get the model?

How to get the model and pass it to use in javascript like tensorflow toxicity model?

Multilingual light model

Hi guys! Nice repo!

I'm deploying an app to detect hate tweets in Twitter as a part of my data science master, and it works perfectly in local.

As I live in Spain, the app main target are spanish accounts so I am developing the app on the multilingual model, the problem I am facing now is the deployment in a server like Streamlit Sharing or Heroku. I can't finish the deployment due to host size limits.

I've seen that you have developed light models for original and ubiased, but not for multilingual. Do you expect to deploy a light multilingual model early? If not, do you came across any workaround to avoid Streamlit Sharing (800 mb) or Heroku (500 mb) size limit?

Thank you so much!

The prediction is too slow (about ~3s/ text)

Hi,
Firstly, so thank you guys for this repo. It's so helpful for us.
I just used it for predicting texts in my dataset, but the speed is too slow. About 3 seconds per text. I wonder if it could be faster?

Looking forward to hearing from you soon.
Regards,
Luan

Error during training

I tried to start the training for Toxic Comment Classification Challenge with the code provided in the documentation:

# combine test.csv and test_labels.csv
python preprocessing_utils.py --test_csv jigsaw_data/jigsaw-toxic-comment-classification-challenge/test.csv --update_test

python train.py --config configs/Toxic_comment_classification_BERT.json

However, it returns the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'jigsaw_data/jigsaw-toxic-comment-classification-challenge/val.csv'

I saw that only training and test datasets are present among the data. Should I use the test by changing the configuration file?
( I have downloaded the datasets from the following link: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip )

Thanks in advance

Japanese model?

Are there any plans to release a Japanese model?
Thanks,

Question regards training with other models

Hello, I am a relatively new user of NLP and I am currently working on a project where I need to use the output of my model as input to your model, which should be applied sequentially.

During the training process, I need to pass the loss through your model without updating any of its weights, and only update the weights of my model. I have a question regarding the training process: Should I follow the steps outlined in the "Training" section of your documentation, or can I use the code provided in the "Prediction" section directly?

I would greatly appreciate your help and guidance on this matter. Thank you in advance.

Detoxify doesn't work well on Emojis

Currently all detoxify models seem to not recognize emojis that are meant to be toxic/hateful in context or on their own (#26). While the Bert tokenizer returns the same output for different emojis, Roberta-based tokenizers seem to differentiate between different emoji inputs.

Some potential solutions:

replacement method (fast): use an emoji library (e.g. demoji) and replace current emojis with their text description (i.e. 🖕 -> 'middle finger'). While this would work in some cases (when emojis are used with their literal meaning), there will be some cases where the description wouldn't make the intended meaning clearer e.g. drugs or sexually-related emojis. We would also need to be careful with how/when we're using emojis as keywords (could check for key emojis first and then replace).
training method (slow): train models to recognise various emojis under different contexts, might also be something that emerges naturally by training on lots of data containing emojis. Might work with the common use cases, but work less well with lesser used emojis. Would not work with the Bert tokenizer.
hybrid method where we train with emoji descriptions directly and replace them at inference time

To dos:

investigate how well the replacement method works on a dataset like Hatemoji
finetune Detoxify with Hatemoji train set and compare

Feature: Add lightweight models

Motivation

Currently this library only uses transformer models >= 418mb in size. Would be helpful to add functionality for lighter language models, such as Albert or a small Roberta, which would be more efficient for practical applications.

Implementation

Add a lightweight version for each toxic model e.g original-small, unbiased-small.

Other labels availability on multilingual models

Hi unitary!

Thank you for this fantastic project.

I was wondering how we could easily get access to the other labels in the datasets like severe_toxicity or identity_attack on the multilingual model, would you have any recommendations on how to achieve that?

Best,
marsouin

Installing on Heroku

Hey, thanks for this great package.

I havedetoxify in my "requirements.txt" file, and it works great locally.

But when I push a Heroku server, it raises this error: "Compiled slug size: 1.1G is too large (max is 500M)" when trying to install and compress the packages.

Some research indicates this error is sometimes caused because the tensorflow dependency is large, and it seems we can get passed this by installing tensorflow-cpu instead of tensorflow. But I'm not sure if tensorflow is even a dependency or not.

I was wondering if you have any ideas or suggestions as to how I could get your package to work on a Heroku server.

Thanks.

Detoxify pip .whl installs files in other locations

pip install detoxify should just install detoxify to the python site-packages/

Instead it is also creating the folders src and tests which could break unrelated packages

Uninstalling detoxify-0.2.0:
  Would remove:
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/detoxify-0.2.0.dist-info/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/detoxify/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/src/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/tests/*

unzip -l detoxify-0.2.0-py3-none-any.whl                                                                                           ✔  25s   detoxify2 
Archive:  detoxify-0.2.0-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      225  11-09-2020 11:07   detoxify/__init__.py
     4184  12-15-2020 21:19   detoxify/detoxify.py
        0  11-09-2020 11:07   src/__init__.py
     8041  11-09-2020 11:07   src/data_loaders.py
      960  11-09-2020 11:07   src/utils.py
        0  11-09-2020 11:07   tests/__init__.py
     2031  11-09-2020 11:07   tests/test_trainer.py
    11357  12-16-2020 09:51   detoxify-0.2.0.dist-info/LICENSE
    11824  12-16-2020 09:51   detoxify-0.2.0.dist-info/METADATA
       92  12-16-2020 09:51   detoxify-0.2.0.dist-info/WHEEL
        9  12-16-2020 09:51   detoxify-0.2.0.dist-info/top_level.txt
      907  12-16-2020 09:51   detoxify-0.2.0.dist-info/RECORD
---------                     -------
    39630                     12 files

Question regards use case

Hi,

I am new in the development of NLP models and I have in mind using your model as an assistant to fine-tune a chit-chat bot. I have seen in several issues that you don't recommend using it. Is it true? Also what kind of model as experts do you suggest using, even between your current models?

Thank you in advance,

Detoxify on AWS Lambda

Hi Team,
I have been trying to implement my code using Detoxify library on the AWS lambda function.
For this, I am downloading the 'whl' file of the library and then zipping it to put it into a Lambda layer to get it used with the Lambda function, also ensuring that detoxify is installed on my local system.
This process has been working with other Python libraries, that I mentioned above to use libraries with the Lambda function. But it's not happening with Detoxify library.
Kindly let me know the reasons or suggestions to get it worked, if any.

Regards,
Parth Sharma

index

What is input_text here:
"print(pd.DataFrame(results, index=input_text).round(5))"

Getting got_ver is None error when importing

I re-installed detoxify to the latest version and now I'm getting the following error when I try import detoxify. This has something to do with the transformers and torch dependencies, looks like the latest versions are incompatible

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\detoxify\__init__.py", line 1, in <module>
    from .detoxify import (
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\detoxify\detoxify.py", line 2, in <module>
    import transformers
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\__init__.py", line 43, in <module>
    from . import dependency_versions_check
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\dependency_versions_check.py", line 41, in <module>
    require_version_core(deps[pkg])
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 120, in require_version_core
    return require_version(requirement, hint)
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 114, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 45, in _compare_versions
    raise ValueError("got_ver is None")
ValueError: got_ver is None

Problem launching the model

Hello,
first of all thanks for your work. This model would amazing to use helping my company in comment-moderation around the social networks.
I followed your guide, but when I launch this to test the model:
python run_prediction.py --input 'example' --model_name original
I get this error:
RuntimeError: Only one file(not dir) is allowed in the zipfile

allow full offline execution

There are some applications such as Kaggle which require running without an internet connection.
At this moment the package can be downloaded along with the checkpoints, but still, the creation requires pulling details from the HF hub, so it'll be cool if we download the details offline to be able to use them instead of the online source...

I guess that the solution would be exposing this argument in Module init:

detoxify/detoxify/detoxify.py

Line 20 in 0cccd59

pretrained_model_name_or_path=None,

- classifier.out_proj.weight: found shape torch.Size([16, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated

I am using your model to fine-tune on binary classification task. ( Number of classes =2) instead of 16.

My class labels are just 0 and 1

https://huggingface.co/unitary/unbiased-toxic-roberta/tree/main

I am writing the below code:

Metrics to calculate loss on binary labels as accuracy

def compute_metrics(eval_pred):
    
    logits, labels = eval_pred
   

    predictions = np.argmax(logits, axis=-1)
    
    acc = np.sum(predictions == labels) / predictions.shape[0]
    
    return {"accuracy" : acc}

model = tr.RobertaForSequenceClassification.from_pretrained("/home/pc/unbiased_toxic_roberta",num_labels=2)
model.to(device)



training_args = tr.TrainingArguments(
#     report_to = 'wandb',
    output_dir='/home/pc/1_Proj_hate_speech/results_roberta',          # output directory
    overwrite_output_dir = True,
    num_train_epochs=20,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    learning_rate=2e-5,
    warmup_steps=1000,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs3',            # directory for storing logs
    logging_steps=1000,
    evaluation_strategy="epoch"
    ,save_strategy="epoch"
    ,load_best_model_at_end=True
)


trainer = tr.Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_data,         # training dataset
    eval_dataset=val_data,             # evaluation dataset
    compute_metrics=compute_metrics
)

Error:

- classifier.out_proj.weight: found shape torch.Size([16, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([16]) in the checkpoint and torch.Size([2]) in the model instantiated

How can I solve this?

RuntimeError: Only one file(not dir) is allowed in the zipfile

This error happens when I try executing the code below. I'm using Anaconda (run as admin) and Python 3.6+

from detoxify import Detoxify

results = Detoxify('original').predict('example text')

RuntimeError: Only one file(not dir) is allowed in the zipfile

Dependency error in CI testing

CI testing fails with this error:

TODO: Check if the --use-feature=2020-resolver flag is still needed or 2020-resolver needs to be replaced with one of fast-deps, truststore, no-binary-enable-wheel-cache

Progress Bar

Can you add a progress bar feature too?

Don't automatically use GPU

At the moment it will automatically use the GPU if cuda is available, with no way to select CPU mode or another sort of device.

detoxify/detoxify/detoxify.py

Line 84 in 486d53a

self.device = "cuda" if torch.cuda.is_available() else "cpu"

Which can be unexpected and cause anything the user is running on GPU 0 to run out of memory.

Suggested fix:
Have a device argument in the Detoxify __init__ that accepts any torch device specifier and defaults to cpu

Question - Adding additional models and labels.

Please delete, we ended up going with another tech.

Mismatched results between your lib vs huggingface

Hi team,

First of all, thank you very much for the library. But I need a clarification why your results are being different than huggingface's result for the same input? Can you please help me with this?
Thanks

Unable to properly load state_dict

Code is very basic. Just two lines on Google Colab after pip install detoxify

states that state_dict is Nonetype and thus cannot load the model. Unsure how to fix.

TypeError: 'NoneType' object is not subscriptable

I am having this error while trying to load the model.

from detoxify import Detoxify

model = Detoxify('original', device="cuda")


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 3
      1 from detoxify import Detoxify
----> 3 results = Detoxify('original').predict('some text')

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:103, in Detoxify.__init__(self, model_type, checkpoint, device, huggingface_config_path)
    101 def __init__(self, model_type="original", checkpoint=PRETRAINED_MODEL, device="cpu", huggingface_config_path=None):
    102     super().__init__()
--> 103     self.model, self.tokenizer, self.class_names = load_checkpoint(
    104         model_type=model_type,
    105         checkpoint=checkpoint,
    106         device=device,
    107         huggingface_config_path=huggingface_config_path,
    108     )
    109     self.device = device
    110     self.model.to(self.device)

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:56, in load_checkpoint(model_type, checkpoint, device, huggingface_config_path)
     50 change_names = {
     51     "toxic": "toxicity",
     52     "identity_hate": "identity_attack",
     53     "severe_toxic": "severe_toxicity",
     54 }
     55 class_names = [change_names.get(cl, cl) for cl in class_names]
---> 56 model, tokenizer = get_model_and_tokenizer(
     57     **loaded["config"]["arch"]["args"],
     58     state_dict=loaded["state_dict"],
     59     huggingface_config_path=huggingface_config_path,
     60 )
     62 return model, tokenizer, class_names

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:20, in get_model_and_tokenizer(model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path)
     16 def get_model_and_tokenizer(
     17     model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path=None
     18 ):
     19     model_class = getattr(transformers, model_name)
---> 20     model = model_class.from_pretrained(
     21         pretrained_model_name_or_path=None,
     22         config=huggingface_config_path or model_type,
     23         num_labels=num_classes,
     24         state_dict=state_dict,
     25         local_files_only=huggingface_config_path is not None,
     26     )
     27     tokenizer = getattr(transformers, tokenizer_name).from_pretrained(
     28         huggingface_config_path or model_type,
     29         local_files_only=huggingface_config_path is not None,
     30         # TODO: may be needed to let it work with Kaggle competition
     31         # model_max_length=512,
     32     )
     34     return model, tokenizer

File ~/.conda/envs/py/lib/python3.9/site-packages/transformers/modeling_utils.py:2379, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2369     if dtype_orig is not None:
   2370         torch.set_default_dtype(dtype_orig)
   2372     (
   2373         model,
   2374         missing_keys,
   2375         unexpected_keys,
   2376         mismatched_keys,
   2377         offload_index,
   2378         error_msgs,
-> 2379     ) = cls._load_pretrained_model(
   2380         model,
   2381         state_dict,
   2382         loaded_state_dict_keys,  # XXX: rename?
   2383         resolved_archive_file,
   2384         pretrained_model_name_or_path,
   2385         ignore_mismatched_sizes=ignore_mismatched_sizes,
   2386         sharded_metadata=sharded_metadata,
   2387         _fast_init=_fast_init,
   2388         low_cpu_mem_usage=low_cpu_mem_usage,
   2389         device_map=device_map,
   2390         offload_folder=offload_folder,
   2391         offload_state_dict=offload_state_dict,
   2392         dtype=torch_dtype,
   2393         load_in_8bit=load_in_8bit,
   2394     )
   2396 model.is_loaded_in_8bit = load_in_8bit
   2398 # make sure token embedding weights are still tied if needed

File ~/.conda/envs/py/lib/python3.9/site-packages/transformers/modeling_utils.py:2572, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, load_in_8bit)
   2569                 del state_dict[checkpoint_key]
   2570     return mismatched_keys
-> 2572 folder = os.path.sep.join(resolved_archive_file[0].split(os.path.sep)[:-1])
   2573 if device_map is not None and is_safetensors:
   2574     param_device_map = expand_device_map(device_map, original_loaded_keys)

TypeError: 'NoneType' object is not subscriptable

pip install information:

Collecting detoxify
  Downloading detoxify-0.5.0-py3-none-any.whl (12 kB)
Collecting transformers!=4.18.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.8/5.8 MB 75.2 MB/s eta 0:00:0000:0100:01
Collecting torch>=1.7.0
  Downloading torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl (890.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 890.2/890.2 MB 3.6 MB/s eta 0:00:0000:0100:01
Collecting sentencepiece>=0.1.94
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 107.1 MB/s eta 0:00:00
Collecting typing-extensions
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.0/21.0 MB 77.9 MB/s eta 0:00:0000:0100:01
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 317.1/317.1 MB 8.9 MB/s eta 0:00:0000:0100:01
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 849.3/849.3 kB 112.2 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 557.1/557.1 MB 6.0 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: wheel in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.7.0->detoxify) (0.37.1)
Requirement already satisfied: setuptools in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.7.0->detoxify) (63.4.1)
Collecting regex!=2019.12.17
  Downloading regex-2022.10.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 770.0/770.0 kB 116.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (1.23.4)
Requirement already satisfied: tqdm>=4.27 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (6.0)
Collecting filelock
  Downloading filelock-3.8.2-py3-none-any.whl (10 kB)
Requirement already satisfied: packaging>=20.0 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (21.3)
Requirement already satisfied: requests in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (2.28.1)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 182.4/182.4 kB 103.0 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.6/7.6 MB 33.4 MB/s eta 0:00:0000:0100:01m
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from packaging>=20.0->transformers!=4.18.0->detoxify) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (1.26.12)
Requirement already satisfied: idna<4,>=2.5 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (3.4)
Installing collected packages: tokenizers, sentencepiece, typing-extensions, regex, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, filelock, nvidia-cudnn-cu11, huggingface-hub, transformers, torch, detoxify
Successfully installed detoxify-0.5.0 filelock-3.8.2 huggingface-hub-0.11.1 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 regex-2022.10.31 sentencepiece-0.1.97 tokenizers-0.13.2 torch-1.13.0 transformers-4.25.1 typing-extensions-4.4.0

additional information
python 3.9.13 haa1d7c7_2
on linux

Any suggestions to handle longer text?

I'm trying to do predictions with the pre-trained model and I keep running into the issue of;

Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1

The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?

Thank you

Number of epochs to get the best model

Hello,

I wanted to reproduce the results by the models and was wondering the number of epochs each model had to be trained to get the scores shown.

Thank you!

Small models don't load on CPU-only machines

It looks like they were serialised with GPU tensors

model = Detoxify('original-small')

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.
If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu')
to map your storages to the CPU.

and so the map_location should be added when loading.

TypeError: expected str, bytes or os.PathLike object, not NoneType

Hey, I'm trying to use detoxify to predict, but I am getting the following error when I try to load the model (model = torch.hub.load('unitaryai/detoxify','toxic_bert')):

Downloading: "https://github.com/unitaryai/detoxify/archive/master.zip" to /root/.cache/torch/hub/master.zip

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _check_seekable(f)
    307     try:
--> 308         f.seek(f.tell())
    309         return True

AttributeError: 'NoneType' object has no attribute 'seek'


During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)

14 frames

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in load_state_dict(checkpoint_file)
    348     try:
--> 349         return torch.load(checkpoint_file, map_location="cpu")
    350     except Exception as e:

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, **pickle_load_args)
    593 
--> 594     with _open_file_like(f, 'rb') as opened_file:
    595         if _is_zipfile(opened_file):

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _open_file_like(name_or_buffer, mode)
    234         elif 'r' in mode:
--> 235             return _open_buffer_reader(name_or_buffer)
    236         else:

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in __init__(self, buffer)
    219         super(_open_buffer_reader, self).__init__(buffer)
--> 220         _check_seekable(buffer)
    221 

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _check_seekable(f)
    310     except (io.UnsupportedOperation, AttributeError) as e:
--> 311         raise_err_msg(["seek", "tell"], e)
    312     return False

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in raise_err_msg(patterns, e)
    303                                 + " try to load from it instead.")
--> 304                 raise type(e)(msg)
    305         raise e

AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.


During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)

[<ipython-input-12-ab26c4c96f7d>](https://localhost:8080/#) in <module>()
----> 1 model = torch.hub.load('unitaryai/detoxify','toxic_bert')

[/usr/local/lib/python3.7/dist-packages/torch/hub.py](https://localhost:8080/#) in load(repo_or_dir, model, source, force_reload, verbose, skip_validation, *args, **kwargs)
    397         repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation)
    398 
--> 399     model = _load_local(repo_or_dir, model, *args, **kwargs)
    400     return model
    401 

[/usr/local/lib/python3.7/dist-packages/torch/hub.py](https://localhost:8080/#) in _load_local(hubconf_dir, model, *args, **kwargs)
    426 
    427     entry = _load_entry_from_hubconf(hub_module, model)
--> 428     model = entry(*args, **kwargs)
    429 
    430     sys.path.remove(hubconf_dir)

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in toxic_bert()
    125 
    126 def toxic_bert():
--> 127     return load_model("original")
    128 
    129 

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in load_model(model_type, checkpoint)
     65 def load_model(model_type, checkpoint=None):
     66     if checkpoint is None:
---> 67         model, _, _ = load_checkpoint(model_type=model_type)
     68     else:
     69         model, _, _ = load_checkpoint(checkpoint=checkpoint)

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in load_checkpoint(model_type, checkpoint, device, huggingface_config_path)
     57         **loaded["config"]["arch"]["args"],
     58         state_dict=loaded["state_dict"],
---> 59         huggingface_config_path=huggingface_config_path,
     60     )
     61 

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in get_model_and_tokenizer(model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path)
     23         num_labels=num_classes,
     24         state_dict=state_dict,
---> 25         local_files_only=huggingface_config_path is not None,
     26     )
     27     tokenizer = getattr(transformers, tokenizer_name).from_pretrained(

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1795             if not is_sharded:
   1796                 # Time to load the checkpoint
-> 1797                 state_dict = load_state_dict(resolved_archive_file)
   1798             # set dtype to instantiate the model under:
   1799             # 1. If torch_dtype is not None, we use that dtype

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in load_state_dict(checkpoint_file)
    350     except Exception as e:
    351         try:
--> 352             with open(checkpoint_file) as f:
    353                 if f.read().startswith("version"):
    354                     raise OSError(

TypeError: expected str, bytes or os.PathLike object, not NoneType

I'm not sure what to do here.

Batch prediction on a very large text file?

Hey guys great repo, I played with your model and works very well on random real world data. I'd like to apply inference on a test file with 2 million lines. How can I do batch prediction with the 'multilingual' model since I couldn't fit the data in a 16GB GPU.

UnicodeDecodeError when installing from git

System: Windows 10
Python version: 3.10.9
Cmd reproduction:

G:\TestProject>python3 -m venv testenv

G:\TestProject>source testenv/bin/activate
'source' is not recognized as an internal or external command,
operable program or batch file.

G:\TestProject>/testenv/scripts/activate.bat
The system cannot find the path specified.

G:\TestProject>G:\TestProject\testenv\Scripts\activate.bat

(testenv) G:\TestProject>git clone https://github.com/unitaryai/detoxify
Cloning into 'detoxify'...
remote: Enumerating objects: 885, done.
remote: Counting objects: 100% (885/885), done.
remote: Compressing objects: 100% (390/390), done.
remote: Total 885 (delta 505), reused 834 (delta 482), pack-reused 0
Receiving objects: 100% (885/885), 52.01 MiB | 17.26 MiB/s, done.
Resolving deltas: 100% (505/505), done.

(testenv) G:\TestProject>pip install -e detoxify
Obtaining file:///G:/TestProject/detoxify
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... error
  error: subprocess-exited-with-error

  × Getting requirements to build editable did not run successfully.
  │ exit code: 1
  ╰─> [21 lines of output]
      Traceback (most recent call last):
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 351, in <module>
          main()
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 333, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 132, in get_requires_for_build_editable
          return hook(config_settings)
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 447, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
          self.run_setup()
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 484, in run_setup
          super(_BuildMetaLegacyBackend,
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
          exec(code, locals())
        File "<string>", line 6, in <module>
        File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\encodings\cp1250.py", line 23, in decode
          return codecs.charmap_decode(input,self.errors,decoding_table)[0]
      UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5960: character maps to <undefined>
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build editable did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Unable to load any model.

I am unsure whether this is due to being on an M1, but it is my suspicion after having tested with various Python versions satisfying the >=3.6 requirement on PyPI. It works on my personal laptop running an Arch-based Linux distribution, using the same code and Python 3.9.

The following code is being run using Python 3.9.12.

__import__('detoxify').Detoxify('original').predict('this does not work')

Running this simple test prediction will throw the error below.

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 309, in _check_seekable
    f.seek(f.tell())
AttributeError: 'NoneType' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 349, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 236, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 221, in __init__
    _check_seekable(buffer)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 312, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 305, in raise_err_msg
    raise type(e)(msg)
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 93, in __init__
    self.model, self.tokenizer, self.class_names = load_checkpoint(
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 49, in load_checkpoint
    model, tokenizer = get_model_and_tokenizer(
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 19, in get_model_and_tokenizer
    model = model_class.from_pretrained(
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1797, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 352, in load_state_dict
    with open(checkpoint_file) as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType