Giter Site home page Giter Site logo

alex000kim / nsfw_data_scraper Goto Github PK

View Code? Open in Web Editor NEW
12.2K 12.2K 2.9K 8.27 MB

Collection of scripts to aggregate image data for the purposes of training an NSFW Image Classifier

License: MIT License

Shell 60.03% Dockerfile 3.37% Jupyter Notebook 36.59%
content-moderation deep-learning machine-learning nsfw nsfw-classifier pornography

nsfw_data_scraper's Introduction

NSFW Data Scraper

Note: use with caution - the dataset is noisy

Description

This is a set of scripts that allows for an automatic collection of tens of thousands of images for the following (loosely defined) categories to be later used for training an image classifier:

  • porn - pornography images
  • hentai - hentai images, but also includes pornographic drawings
  • sexy - sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc.
  • neutral - safe for work neutral images of everyday things and people
  • drawings - safe for work drawings (including anime)

Here is what each script (located under scripts directory) does:

  • 1_get_urls_.sh - iterates through text files under scripts/source_urls downloading URLs of images for each of the 5 categories above. The ripme application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located in raw_data directory. No need to rerun unless you edit files under scripts/source_urls.
  • 2_download_from_urls_.sh - downloads actual images for urls found in text files in raw_data directory.
  • 3_optional_download_drawings_.sh - (optional) script that downloads SFW anime images from the Danbooru2018 database.
  • 4_optional_download_neutral_.sh - (optional) script that downloads SFW neutral images from the Caltech256 dataset
  • 5_create_train_.sh - creates data/train directory and copy all *.jpg and *.jpeg files into it from raw_data. Also removes corrupted images.
  • 6_create_test_.sh - creates data/test directory and moves N=2000 random files for each class from data/train to data/test (change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will move N images for each class from data/train to data/test.

Prerequisites

  • Docker

How to collect data

$ docker build . -t docker_nsfw_data_scraper
Sending build context to Docker daemon  426.3MB
Step 1/3 : FROM ubuntu:18.04
 ---> 775349758637
Step 2/3 : RUN apt update  && apt upgrade -y  && apt install wget rsync imagemagick default-jre -y
 ---> Using cache
 ---> b2129908e7e2
Step 3/3 : ENTRYPOINT ["/bin/bash"]
 ---> Using cache
 ---> d32c5ae5235b
Successfully built d32c5ae5235b
Successfully tagged docker_nsfw_data_scraper:latest
$ # Next command might run for several hours. It is recommended to leave it overnight
$ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
Getting images for class: neutral
...
...
$ ls data
test  train
$ ls data/train/
drawings  hentai  neutral  porn  sexy
$ ls data/test/
drawings  hentai  neutral  porn  sexy

How to train a CNN model

  • Install fastai: conda install -c pytorch -c fastai fastai
  • Run train_model.ipynb top to bottom

Results

I was able to train a CNN classifier to 91% accuracy with the following confusion matrix:

alt text

As expected, drawings and hentai are confused with each other more frequently than with other classes.

Same with porn and sexy categories.

nsfw_data_scraper's People

Contributors

alex000kim avatar alexkim-gh avatar ebazarov avatar gantman avatar geek-at avatar parmusingh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nsfw_data_scraper's Issues

Possible Issue with suggested docker run cmd in README

The $ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh will not run if the $(pwd) env variable returns the current directory path that has spaces in MacOS.

Eg. when $(pwd) = /Users/username/Downloads/NSFW computer vision, docker run thinks the name of the image is computer.

Error

C:\Users\Administrator>docker run -v M:\GitHub\docker_nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
scripts/runall.sh: line 5: syntax error: unexpected end of file

Can you create a torrent of that data?

Downloading all this takes a lot of time and your script seems to be breaking sometimes. Sometimes it says its downloading but nothing really happens. It'll be great if you create a torrent of the data.

xargs: wget: No such file or directory

when i first run the bash 2_download_from_urls.sh in my mac, i got

Class: neutral. Total # of urls:    20960
Downloading images...
xargs: wget: No such file or directory
Class: drawings. Total # of urls:    25732
Downloading images...
xargs: wget: No such file or directory
Class: sexy. Total # of urls:    19554
Downloading images...
xargs: wget: No such file or directory
Class: porn. Total # of urls:   116521
Downloading images...
xargs: wget: No such file or directory
Class: hentai. Total # of urls:    45228
Downloading images...
xargs: wget: No such file or directory

thanks for your answer

Getting rid of stale links

Several imgur links in the dataset are stale, which leads to a lot of placeholder images in the downloaded image set. Is there a way to check for this in the wget script?

The resulting model

Just wondering. Did you want to link to your resulting model?

If not and you won't: would it be ok if I trained a model independently and sent a PR to link to my resulting model? I love the scripts for researchers, however, some might want the result.

Error

Getting images for class: neutral
https://www.reddit.com/r/mildlypenis/top/?t=all
Loaded /root/.config/ripme/rip.properties
Loaded log4j.properties
Initialized ripme v1.7.74
[+] Creating directory: ./raw_data/neutral/reddit_sub_mildlypenis
Error while loading https://www.reddit.com/r/mildlypenis/top/.json?t=all
javax.net.ssl.SSLException: Read timed out
at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:127)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:320)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:263)
at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:258)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:137)
at java.base/sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1152)
at java.base/sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1063)
at java.base/sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:402)
at java.base/sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:567)
at java.base/sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at java.base/sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:168)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at com.rarchives.ripme.utils.Http.response(Http.java:130)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getJsonArrayFromURL(RedditRipper.java:141)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getAndParseAndReturnNext(RedditRipper.java:88)
at com.rarchives.ripme.ripper.rippers.RedditRipper.rip(RedditRipper.java:77)
at com.rarchives.ripme.App.rip(App.java:99)
at com.rarchives.ripme.App.ripURL(App.java:265)
at com.rarchives.ripme.App.handleArguments(App.java:248)
at com.rarchives.ripme.App.main(App.java:74)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at java.base/sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:448)
at java.base/sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:165)
at java.base/sun.security.ssl.SSLTransport.decode(SSLTransport.java:108)
... 17 more

set timeout and try times

I run the script but get stuck . I update the wget command to fix this :

wget -nc -q --timeout=5 --tries=1 -i "$urls_file" -P "$images_dir"

Details on this datasets

hi,
Thank you very much for your amazing works on collecting tons of images and labeling it.
And, should you please tell us more about the details on how you get it done? cus, i can't get more information on your "readme".
for example,
Where this images source from?
Relationship with other open source datasets(like NPDI )?
Why choose these labels? how to label it(by manual/by machine)?
Any experiment on this data?....
It would be really helpful to figure out all these question before we using it

Script Excessive Runtime

Hello,

I followed the docker instructions as listed in the README.md, but had to stop the docker container as it failed to finished after leaving it for over 24hours. When checking the mounted folders there seemed to be no images, just tones of new directories with there own .txt file with a bunch of image urls. Perhaps I stop the programme before it could download these image URLs?

Can anyone assist with this?

Screenshot 2023-06-12 at 18 42 58

dataset links not working

Hi,
I'm interested to train my own model but when I am trying to download images from the provided link in raw_data, most of the links are not working. Can you share your downloaded images dataset? So that I can directly download from there. I will be very thankful to you.
Thank you so much!

Adding new categories for NSFW Images and Drawing.

Hello, in the past I was using Yahoo! OpenNSFW to detect NSFW content. Just few categories were defined in the training set, while this dataset has a better classification, thank you for this dataset!
This new classification better works in real-world! I would like to further fine-tune the Keras model I have for other specific categories, like NSFW Drawings (like nazi symbols, etc.) or new NSFW Images categories (like guns, rifles, etc.). Do you have any reference for these categories?
Thank you.

false porn

2019-02-27_17-23-22
2019-02-27_17-30-37
Operator from film by Oliver Stone about Putin
2019-02-27_17-35-02
2019-02-27_17-36-51
Dota2
2019-02-27_17-39-13

Animated Gif images not supported

I found script 5 only checks corrupted jpeg, and did not differentiate the actual image types. Moreover, some animated gifs also get added with jpeg extension blindly. Tensorflow image reader doesn't support animated gif, so I'm wondering if checking and removing animated gif is necessary for script 5? If needed, I can open a PR. Thanks

VolleyballGirls and false positives?

Hello! Thanks for putting this repo together. It's been excellent to see its widespread adoption when facing the daunting task of data collection for an NSFW model.
To that end though, I wanted to bring up the subreddit VolleyballGirls as a potential source of error. While reviewing the existing sexy category, this particular subcategory of imagery seemed to have much more mixed quality than any of the others. While there are definitely many qualifying images in it, the posts on the subreddit tend to be of four types:

  1. Legitimately NSFW content with poses by one or more volleyball players.
  2. Content that may be NSFW, but may be legitimate sportswear in the correct context.
  3. Content that is NSFW, but primarily due to the precise moment of capture or the focus of the image rather than the quality itself.
  4. Content that is not likely to be considered NSFW, but rather that the viewer likes the overall appearance of one or more players.

Types 1 and likely 2 are probably excellent to capture. Type 3 is good to capture, but it should be noted that e.g. a legitimate volleyball picture could be random-cropped to obtain something from type 3.
Unfortunately, type 4 is not a small part of the data. Many of these are not dissimilar to local sports photos of group huddles or individual stars.

This leads to issues like the following photo being classified with high confidence by NSFW JS as sexy. (You might be interested too, @GantMan ?)

I understand that this repo is noisy by its very nature, but this particular category was enough an outlier that I wanted to report what I was seeing. It's tough, though, because I think an important category of sports-related NSFW would be lost by removing this.

Thoughts?

The urls issue

Almost all the urls are out of date, is there any ways to update them?

I tried a dozen images or so when I saw this profiled on an ai tool aggregate site. it got 7 wrong and five were false NEGATIVE.. NSFW Warning

First of all i want to iterate I think this is a noble project and a good potential way to help law enforcement bust CSAM or revenge porn or help web mods or even ordinary users of social media to keep material they find objectionable off their feeds. I also understand that of course you'd rather see false positives than negatives especially in these early days. This issue however is not good news.
Screenshot 2024-04-06 220923
Screenshot 2024-04-06 221043
Screenshot 2024-04-06 221219
Screenshot 2024-04-06 221503
Screenshot 2024-04-06 221656
Screenshot 2024-04-06 221913

I'm using MS Edge Canary to access the site on Win11. I'm using an AMD Ryzen7 7840hs in case you leverage client side hardware for anything. Now most of the pics here were generated by a self hosted Stable Diffusion but not all, but that might be a good place to start looking for the models blindspot. I did not censor these pics to convey how glaring the false negative is.

About download picture

Thanks your project help me. I am from China. I am not good at English, But I have some questions.
1.About cmd '1_get_urls.sh', the name is get url, but not really to get url list, because it use ripme.jar, it can get a lot of videos and picture, but now it not work.
2. Now 1_get_urls.sh do not work, but if you to run this cmd, it make 'neutral' 'drawings' 'sexy'.... all url_xxx.txt empty. make cmd '2_download_from_urls.sh' do not work.
3. About cmd '5_create_train.sh'. in '1_get_urls.sh' use ripme.jar it download a lot of videos, but video can not be recognized, I can do this work. I know how to video to pictures and gif to pictures.

your project is very very good, lost of company need your project get picture dataset and train self model.

use proxychains

nsfw_data_scrapper/scripts$
proxychains bash ./1_get_urls.sh
ProxyChains-3.1 (http://proxychains.sf.net)
Getting images for class: neutral
https://www.reddit.com/r/mildlypenis/top/?t=all
Loaded /media/ryan/data/workspace/nsfw_data_scrapper/scripts/rip.properties
Loaded log4j.properties
Initialized ripme v1.7.74
|DNS-request| www.reddit.com
|S-chain|-<>-127.0.0.1:1080-<><>-4.2.2.2:53-<><>-OK
|DNS-response| www.reddit.com is 151.101.41.140
Error while loading https://www.reddit.com/r/mildlypenis/top/.json?t=all
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at com.rarchives.ripme.utils.Http.response(Http.java:130)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getJsonArrayFromURL(RedditRipper.java:141)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getAndParseAndReturnNext(RedditRipper.java:88)
at com.rarchives.ripme.ripper.rippers.RedditRipper.rip(RedditRipper.java:77)
at com.rarchives.ripme.App.rip(App.java:99)
at com.rarchives.ripme.App.ripURL(App.java:265)
at com.rarchives.ripme.App.handleArguments(App.java:248)
at com.rarchives.ripme.App.main(App.java:74)
[!] Error while ripping URL https://www.reddit.com/r/mildlypenis/top/?t=all
java.io.IOException: Failed to load https://www.reddit.com/r/mildlypenis/top/.json?t=all after 1 attempts
at com.rarchives.ripme.utils.Http.response(Http.java:137)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getJsonArrayFromURL(RedditRipper.java:141)
at com.rarchives.ripme.ripper.rippers.RedditRipper.getAndParseAndReturnNext(RedditRipper.java:88)
at com.rarchives.ripme.ripper.rippers.RedditRipper.rip(RedditRipper.java:77)
at com.rarchives.ripme.App.rip(App.java:99)
at com.rarchives.ripme.App.ripURL(App.java:265)
at com.rarchives.ripme.App.handleArguments(App.java:248)
at com.rarchives.ripme.App.main(App.java:74)
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
at sun.security.ssl.InputRecord.read(InputRecord.java:503)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:434)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:181)
at com.rarchives.ripme.utils.Http.response(Http.java:130)
... 7 more

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.