gosecure / freshonions-torscraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dirtyfilthy/freshonions-torscraper

81.0 15.0 24.0 2.65 MB

Fresh Onions is an open source TOR spider / hidden service onion crawler

License: GNU Affero General Public License v3.0

Shell 36.68% Python 40.97% JavaScript 0.42% CSS 1.75% HTML 12.40% Dockerfile 1.89% TSQL 5.89%

freshonions-torscraper's Introduction

Fresh Onions TOR Hidden Service Crawler

Archival Notice

This project has been archived and is no longer maintained. Rationale behind this choice is described in this blog post. If you are interested in taking over maintenance we will happily refer any active fork here. Contact us.

This is a copy of the source for the http://zlal32teyptf4tvi.onion hidden service, which implements a tor hidden service crawler/spider and website.

Features

Crawls the darknet looking for new hidden service
Find hidden services from a number of clear net sources
Optional full-text Elasticsearch support
Marks clone sites of the /r/darknet super list
Finds SSH fingerprints across hidden services
Finds email addresses across hidden services
Finds bitcoin addresses across hidden services
Shows incoming / outgoing links to onion domains
Up-to-date alive/dead hidden service status
Portscanner
Search for "interesting" URL paths, useful 404 detection
Automatic language detection
Fuzzy clone detection (requires Elasticsearch, more advanced than super list clone detection)
Doesn't fuck around in general.

Licence

This software is made available under the GNU Affero GPL 3 License. What this means is that is you deploy this software as part of the networked software that is available to the public, you must make the source code available (and any modifications).

From the GNU site:

The GNU Affero General Public License is a modified version of the ordinary GNU GPL version 3. It has one added requirement: if you run a modified program on a server and let other users communicate with it there, your server must also allow them to download the source code corresponding to the modified version running there

Docker installation

First of all, clone the GitHub project and run the script create_flask_web to generate the secret file used by the web server.

git clone https://github.com/GoSecure/freshonions-torscraper.git
cd freshonions-torscraper/scripts/
./create_flask_secret.sh

Once your flask secet is create, you should see this confirmation message:

('Directory ', '/your/path/freshonions-torscraper/etc/private/', ' Created ') Written flask secret to '/your/path/freshonions-torscraper/etc/private/flask.secret'

Now go to the freshonions-torscraper root directory and start the docker containers by doing:

sudo docker-compose up

The docker-compose command will start 9 different containers.

Web service (1)
Crawler (1)
Database (1)
Kibana (1)
Elasticsearch (1)
Tor-Privoxy (4)

Do these steps once (only when all containers are built for the first time). Once all the containers are started, open another terminal and connect to the crawler container.

sudo docker exec -it freshonions-torscraper-crawler /bin/bash

Now you supposed to have a terminal in the container. So we will run the script elasticsearch_migrate.sh

cd scripts
./elasticsearch_migrate.sh

It will Initialize Elasticsearch database.

In the crawler container, it has a script that will crawl automatically (docker_haproxy_harvest_scrape.sh). This script restart the haproxy service (repartition of request), start harvest (search all onions site in the list of website that we provide) and after that it scrape all of them (Find bitcoin address, Email, link between onions, and save the data of website to the Elasticsearch and the database). Once this script finishes his execution, it will start over.

** Harvesting takes a lot of time so be patient, It can take up to (45 minutes) to get all onions in the list of website that we provide. **

If you prefer doing it the manual way, follow the procedure below.

Manual Installation

Dependencies

python
tor

Warning

This software requires an Elasticsearch version in the 5.x series. As of this writing, the latest is 5.6.6. 6.x is known to be problematic. Also, if you decide to install Kibana or any extra functionalities linked to Elasticsearch, install them with the same version otherwise it won't work.

Do not start too many instances of scraper/crawler because, with only 4 instances of tor proxy, it will be hard to connect to the onion site. If you create more than 3-4 instances, it could become really slow. In this situation, the crawler will become so slow that they will not be able to crawl pages. So you will not progress with this method. Let the crawler run and you will create a bigger list of valid domains with information in it.

The Pastebin script works only if you are on the whitelist of Pastebin. If you're not, you will need to read the scraping API to understand how to activate it: https://pastebin.com/api_scraping_faq

After booting, be sure that the link between Tor and Privoxy are working. To test it, use these commands.

curl --socks5-hostname 127.0.0.1:9050 http://workingOnionWebsite
curl --proxy 127.0.0.1:3129 http://workingOnionWebsite

If it didn't work, fix the problem before crawling because all your onions will convert to a "dead" status. You can try to run the script:start.sh to reinitialize the links.

Tor service

To use the new version of tor, you should follow these steps: https://www.torproject.org/docs/debian.html.en By using the last version of tor, you will be able to crawl the new generation of onions (V3).

If you used a version older than 0.3.x, you can have a problem with the update to 0.3.x. I was missing two libraries:

libssl1.1
libzstd1

So, I installed them:

    sudo apt-get install libzstd1

To install libssl1.1, I used a Debian package: https://packages.ubuntu.com/bionic/libssl1.1

    lynx  https://packages.ubuntu.com/bionic/libssl1.1

Use the bottom arrow to go at the bottom of the page and select your "Architecture Package Size". When you had made your choice, click on the right arrow, it will redirect you to the download page. Now it's the same thing. Use the bottom arrow to go down and choose the one that you want. When you find the one, just click on the right arrow. At the bottom of your interface, you will see D) Download or C) Cancel. Press D. When you will see the text Save to disk, go on it. Press the right arrow and press on Enter. When it's done click on q and y to quit.

    dpkg -i libssl1.1_1.1.0g-2ubuntu2_amd64.deb #the name of your debain package

Finish the tor installation by looking to your version. If you have the last one (0.3.2 at the time that I wrote it).

    tor --version

Haproxy service

sudo apt-get install haproxy

Privoxy service

sudo apt-get install privoxy

Install Pip:

sudo apt-get install python-pip
sudo pip install --upgrade pip

Install Virtual environment

sudo pip install virtualenv
sudo apt-get install python-virtualenv

Go in your crawler/scraper folder and write.

virtualenv venv

then activate it.

. venv/bin/activate
# Run the next command when you're in your virtual environment because if you aren't, it will install in your normal environment
pip install -r requirements.txt

Install MariaDB

*** Mysql has problems with some syntax in the code so I recommend you to install MariaDB ***

sudo apt-get install mariadb-server
sudo apt-get install mariadb-client

Now we will connect to MariaDB and create our database from schema.sql. We need to be in the folder to be able to see schema.sql because we will need it later.

mysql -u root
CREATE DATABASE databaseName;
use databaseName;
source schema.sql

To know if all works well you should have "Query OK" on each row. You should have 20 tables if you do this command:

show tables;

Need a modification to be able to connect Elasticsearh with our database.

use mysql;
update user set plugin='mysql_native_password' where User='root';
flush privileges;
exit
#To secure the installation. By default the password should be empty so just press enter. I recommand to put one.
sudo mysql_secure_installation
#To reconnect
mysql -u root -p

Config your files

Edit etc/database for your database setup

Edit etc/tor/torrc to uncomment the line : SocksPort 9050 (line 18)

Edit etc/uwsgi_only and set BASEDIR to wherever torscraper is installed (i.e. /home/user/torscraper)

Edit etc/proxy for your TOR setup

export TOR_PROXY_PORT=3129
#export TOR_PROXY_PORT=3140
export TOR_PROXY_HOST=localhost
export http_proxy=http://localhost:3129
#export http_proxy=http://localhost:3140
export https_proxy=https://localhost:3129
export SOCKS_PROXY=localhost:9050
HIDDEN_SERVICE_PROXY_HOST=127.0.0.1
HIDDEN_SERVICE_PROXY_PORT=9090

Now we will go in Privoxy config

cd /etc/privoxy/
cp default.action default.action.orig
cp default.filter default.filter.orig
touch default.action (leave the file empty)
touch default.filter (leave the file empty)

Start your services

service tor start
service privoxy start
service haproxy start
service elasticsearch start
service mysql start

Go to the scripts folder and run this command

./create_privoxy_confs.sh

Now it's time to try. Go to the directory: .../freshonions-torscraper/scripts/. This directory is relative, you could have changed the name of the directory.

./start.sh

Now you can test if it works with the new generation of onions (V3) (test all ports 9051, 9052, 90... and 3129, 3130, 31...)

curl --socks5-hostname 127.0.0.1:9051 http://jamie3vkiwibfiwucd6vxijskbhpjdyajmzeor4mc4i7yopvpo4p7cyd.onion/
curl --proxy 127.0.0.1:3129 http://jamie3vkiwibfiwucd6vxijskbhpjdyajmzeor4mc4i7yopvpo4p7cyd.onion/

If you get something like "Privoxy localhost port forwarding" don't continue, it will not work.

./push.sh someoniondirectory.onion

To start the flask server to see our web interface. First, create a flask secret with:

mkdir -p etc/private/
python3 -c 'import os; print("FLASK_SECRET=\"" + os.urandom(32).decode("ascii", errors="backslashreplace") + "\"")' > etc/private/flask.secret

Then start the Web server with:

./scripts/web.sh

To set up the port forwarding from your server to your browser, do this command on your computer to access server

ssh -L 5000:localhost:5000 username@IpAddressOfServer

To try if it works well for now.

scripts/push.sh someoniondirectory.onion
scripts/push.sh anotheroniondirectory.onion

Run:

script/harvest.sh  #To get onions (just detect the onions, don't go deeper to find bitcoin address, emails, etc.)
init/scraper_service.sh  #To start crawling (will get bitcoin address, emails, etc. if you already found onions with harvest.sh)
init/isup_service.sh  #To keep site status up to date

Optional ElasticSearch Fulltext Search

The Torscraper comes with optional Elasticsearch capability (enabled by default). Edit etc/elasticsearch and set vars or set ELASTICSEARCH_ENABLED=false to disable.

Run scripts/elasticsearch_migrate.sh to perform the initial setup after configuration.

If Elasticsearch is disabled there will be no full-text search, however crawling and discovering new sites will still work.

ElasticSearch

You will need to install Elasticsearch(probably not only the pip package), this is the link to download the last version of 5.x. : https://www.elastic.co/downloads/past-releases/elasticsearch-5-6-6 . You can have problems with versions (like I said in the warning section). If you want to be sure you are using the right version, you can do this command :

curl -XGET 'http://localhost:9200'

To enable Elasticsearch

service elasticsearch start
./elasticsearch_migrate.sh  #To perform the initial setup or if you want to reset Elasticsearch, but we need it at the beginning to start it.

After restart :

. venv/bin/activate
./script/start.sh  #To start the instance of tor and privoxy

FLASK :

./scripts/web.sh  #Launch flask to have a web interface

Cronjobs

#Harvest onions from various sources
1 18 * * * /home/freshonions-torscraper/scripts/harvest.sh

#Get ssh fingerprints for new sites
1 4,16 * * * /home/freshonions-torscraper/scripts/update_fingerprints.sh

#Mark sites as genuine / fake from the /r/darknetmarkets superlist
1 1 * * 1 /home/freshonions-torscraper/scripts/get_valid.sh

#Scrape pastebin for onions (needs paid account / IP whitelisting)
*/5 * * * * /home/freshonions-torscraper/scripts/pastebin.sh

#Portscan new onions
1 13 * * * /home/freshonions-torscraper/scripts/portscan_up.sh

#Scrape stronghold paste
32 */2 * * * /home/freshonions-torscraper/scripts/stronghold_paste_rip.sh

#Detect clones
20 14 * * * /home/freshonions-torscraper/scripts/detect_clones.sh

#Keep a sql dump of data
1 */1 * * * mysqldump -u username -ppassword --database tor --result-file=/home/dump.sql
1 */8 * * * mysqldump -u username -ppassword --database tor --result-file=/home/dump_backup.sql

Infrastructure

Fresh Onions runs on two servers, a frontend host running the database and hidden service website, and a backend host running the crawler. Probably most interesting to the reader is the setup for the backend. TOR as a client is COMPLETELY SINGLETHREADED. I know! It's 2017, and along with a complete lack of flying cars, TOR runs in a single thread. What this means is that if you try to run a crawler on a single TOR instance you will quickly find you are maxing out your CPU at 100%.

The solution to this problem is running multiple TOR instances and connecting to them through some kind of frontend that will round-robin your requests. The Fresh Onions crawler runs eight Tor instances.

Debian (and Ubuntu) comes with a useful program "tor-instance-create" for quickly creating multiple instances of TOR. I used Squid as my frontend proxy, but unfortunately, it can't connect to SOCKS directly, so I used "Privoxy" as an intermediate proxy. You will need one Privoxy instance for every TOR instance. There is a script in "scripts/create_privoxy.sh" to help with creating Privoxy instances on Debian systems. It also helps to replace /etc/privoxy/default.filter with an empty file, to reduce CPU load by removing unnecessary regexes.

Additionally, this resource https://www.howtoforge.com/ultimate-security-proxy-with-tor might be useful in setting up squid. If all you are doing is crawling and don't care about anonymity, I also recommend running TOR in tor2web mode (required recompilation) for increased speed.

freshonions-torscraper's People

Contributors

Stargazers

Watchers

freshonions-torscraper's Issues

Issue while running ./docker_haproxy_harvest_scrape.sh

Hey,
I run into an issue when I run the ./docker_haproxy_harvest_scrape.sh command. It runs for a while and give me expected results then, after 25 seconds or so it says 0curl: (6) Could not resolve host: freshonions-torscraper-tor-privoxy I was wondering what the fix may be for this or if I may have caused a problem during the install. I believe it has something to do with the privoxy and privoxy 2. I run sudo docker-compose down to retry creating my containers to see if that was the issue and then when I run sudo docker-compose up the output is:

Creating network "freshonions-torscraper_default" with the default driver
Creating freshonions-torscraper-tor-privoxy2 ...
Creating freshonions-torscraper-tor-privoxy   ... error
Creating freshonions-torscraper-tor-privoxy4  ... done
Creating freshonions-torscraper-tor-privoxy2  ... error
Creating freshonions-torscraper-elasticsearch ...
Creating freshonions-torscraper-db            ... done

ERROR: for freshonions-torscraper-tor-privoxy  Cannot start service tor-privoxy: b'driver failed programming external connectivity on endpoint freshonions-torscraper-tor-privoxy (75b22a9cdcf16822faac449ce940a48feb51b5da9656523f7c9e7e771183cd2a): Error starting userland proxy: listen tcp 0.0.0.0:9050: bind: address already in use'
Creating freshonions-torscraper-tor-privoxy3  ... done
Creating freshonions-torscraper-elasticsearch ... done
2316ec3c8be6928e9b1fa9f45e33e195292a1a): Error starting userland proxy: listen tcp 0.0.0.0:9051: bind: address already in use'
Creating freshonions-torscraper-web-interface ... done
Creating freshonions-torscraper-kibana        ... done
Creating freshonions-torscraper-crawler       ... done

Thanks in advance.

Pin requirements to specific versions

Right now none of the libraries in requirements.txt are pinned to versions. If any library gets updated, this could break the entire application.

The usual suspect I've had trouble with is the pycrypto library, which if it varies between versions, can cause a ton of problems in all sorts of low-end dependencies

What is "show fh default"

When searching there is an option called "show fh default" what is that ?
found hosts ? or ?

Not finish scraper_service.sh

Good morning,

I have a question.

I have executed the command scraper_service.sh, but the execution not end despite having three days running.

I have the tool installed in a single machine with RAM: 16GB.

Should finish the execution or is always kept running?

I do not understand what happens.

Thank you very much.

Can't find start.sh file

Hello, I can't find the start.sh file in the path specified.

build_corpus

hi there... there is a script,namely, build_corpus.sh in which it executes build_corpus.py file, but there is not such file. Do you have any idea?

can not run crawler within login and captcha part

hi ...
I added the commit that you'd added about login and captcha detection in https://github.com/GoSecure/freshonions-torscraper/commit/def5f96cfc2dfd8a58e5a8a8204292010180dcc1
Now, when I am pushing a domain, I face with this:

Do you know any idea what is my fault!?

What to set for a kibana index

Kibana, on boot, asks for an index. The default is the logstash index, which is empty so it won't accept it. Assuming this is indexing all of the webpages as the documents, what is the proper index that's being sent to elasticsearch? We can't tell to actually boot up kibana properly.

Start.sh script

Good morning

I have not find Start.sh script. Can you help me?

No such file or directory: rotating-tor-proxies.cfg

[ALERT] 212/103513 (25016) : Connot open configuration file/directory /home/torscraper/init/../etc/rotating-tor-proxies.cfg : No such file or directory

By executing ./start.sh I got this alert.
Can you help me with this?

Installation Issues

Hi - I followed the updated readme and would appear that all the tests work.

curl -v --socks5-hostname 127.0.0.1:9051 http://<working_site>.onion
curl -v --socks5-hostname 127.0.0.1:9054 http://<working_site>.onion
curl -v --proxy 127.0.0.1:3129 http://<working_site>.onion
curl -v --proxy 127.0.0.1:3132 http://<working_site>.onion

All return back HTML.

When harvest.sh starts it starts out fine. I then notice that I get a lot of 403: forbidden.

https://www.deepwebsiteslinks.com/tor-emails-chat-rooms-links/
Resolving www.deepwebsiteslinks.com (www.deepwebsiteslinks.com)...
Connecting to www.deepwebsiteslinks.com (www.deepwebsiteslinks.com)|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2019-09-01 09:51:50 ERROR 403: Forbidden.

I was wandering if this is an issue with running it through whonix?

I let it continue to run and then get to the section where it writes to SDOUT.

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

This takes around 2 days to complete and I get a mixture of total 100% received and;

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0curl: (7) Failed to receive SOCKS5 connect request ack.

Is this due to the onion link being dead?

Anyone interested in working together drop me a line. This is a project I would like to continue with.

Remove GoSecure's modified logo from master

People have deployed this software and made it publicly available. The fact that the master branch hosts our logo can mislead people into thinking these deployments are endorsed or performed by GoSecure.

Let's get rid of our customized logo in master.

Issue when running docker_haproxy_harvest_scrape.sh

When I run the docker_haproxy_harvest_scrape.sh command, it spits out a few different things and then I get 2 hours of

 100  2197  100  2197    0     0   2034      0  0:00:01  0:00:01 --:--:--  2034
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed to receive SOCKS5 connect request ack.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1249  100  1249    0     0   1881      0 --:--:-- --:--:-- --:--:--  1878
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2357  100  2357    0     0   1082      0  0:00:02  0:00:02 --:--:--  1082
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   162  100   162    0     0    156      0  0:00:01  0:00:01 --:--:--   156
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0

I was wondering how long this is supposed to take and if this seems right. Going to leave it running over night. The web-service container also receives no data so it isn't saving anything into there.

scrapy not found

When I try to run './push.sh someoniondirectory.onion', the following message is shown:

but i know scrapy is in venv/bin/scrapy

Could you help me?

torscraper/middlewares.py : raise ProgrammingError(e)

Hi guys,
I follow your guide in order to install everything and actually it works great until the scripts/harverst.sh.

As soon I run the scraper, I get this error:
except dbapi_module.ProgrammingError as e: raise ProgrammingError(e) ProgrammingError: (1064, u"You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near 'JSON))\n AND 'description_json_at' = '1970-01-01 01:00:00'' at line 31")

Referred to:
File "/root/freshonion/torscraper/middlewares.py", line 192, in <genexpr> return (_set_range(r) for r in result or ())

I'm using the version 0.7.3 of pony, and version 10.1.34 of MariaDB.
The platform is based on Ubuntu Bionic 18.04.

What could be the solution to this problem?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.