Giter Site home page Giter Site logo

sbatururimi / nutch-test Goto Github PK

View Code? Open in Web Editor NEW
2.0 3.0 0.0 263 KB

Different example of using Nutch: with Solr, Selenium Hub, standalone web drivers

License: MIT License

Dockerfile 52.44% Java 41.98% Shell 5.58%
apache-nutch apache-solr selenium

nutch-test's Introduction

Installating Nutch

Option 1: Nutch only

docker build --force-rm -t nutch .

Option 2: selenium hub + nutch + solr

Selenium hub with 10 Chrome nodes and 10 Firefox nodes each in headless mode

docker-compose -f docker-compose_selenium_nutch_solr.yaml up -d --scale chrome=10 --scale firefox=10

Option 3: nutch + solr

docker-compose -f docker-compose_nutch_solr.yaml up -d

Option 4: selenium hub + nutch + solr + tor instances

docker-compose -f docker-compose_selenium_nutch_solr_tor.yaml up -d --scale firefox=40

Installing Chrome Driver

This is an option when not using Selenium HUB.

  1. Install Chrome browser:
  • edit sources.list
vi /etc/apt/sources.list
# add at the bottom of the file
deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main
  • Download the signing key
wget https://dl.google.com/linux/linux_signing_key.pub
apt-key add linux_signing_key.pub
  • Install the stable version of Google Chrome
apt update
apt install google-chrome-stable

NB You may need to upgrade and then update your packages:

apt upgrade
apt update
  1. download chrome driver from the download page
cd ~
wget https://chromedriver.storage.googleapis.com/2.44/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
rm chromedriver_linux64.zip
  1. Change the location of the ChromeDriver binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Installing Firefox Driver

This is an option when not using Selenium HUB.

  1. Install Firefox browser:
apt install firefox
  1. download gecko driver from the download page
cd ~
wget https://github.com/mozilla/geckodriver/releases/download/v0.23.0/geckodriver-v0.23.0-linux64.tar.gz
tar -zxvf geckodriver-v0.23.0-linux64.tar.gz
rm geckodriver-v0.23.0-linux64.tar.gz
  1. Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Installing Opera Driver

This is an option when not using Selenium HUB.

  1. Install Opera browser by downloading the last version from link
wget http://download4.operacdn.com/ftp/pub/opera/desktop/56.0.3051.99/linux/opera-stable_56.0.3051.99_amd64.deb
dpkg -i opera-stable_56.0.3051.99_amd64.deb
apt install -f

NB Update to the appropriate Opera version.

  1. download opera driver from the download page
cd ~
wget wget https://github.com/operasoftware/operachromiumdriver/releases/download/v.2.40/operadriver_linux64.zip
unzip operadriver_linux64.zip
rm operadriver_linux64.zip
mv operadriver_linux64/operadriver /root
chmod +x operadriver
  1. Change the location of the gecko binary path if necessary in nutch-default.xml or nutch-site.xml by specifying the value for selenium.grid.binary

Run a test

  1. Set the value for selenium.driver in conf/nutch-site.xml to the selenium driver you want to test
  2. If you don't have a screen being attached to the server, set selenium.enable.headless to true
  3. crawl
# connect to the nutch container
docker exec -it nutch bash

# execute the crawl
/root/nutch/bin/crawl -i -D solr.server.url=http://solr:8983/solr/mycore -s urls crawler 1
  1. check the result
  • Test your result in Solr by opening in your browser: localhost:8983/
  • navigate to the created node mycore,
  • execute the default query fetch:
*:*

Hints

Regarding the redirects: if you want to follow redirects immediately in the fetcher you simply could adjust http.redirect.max (e.g., set it to 3) and Fetcher will follow the redirects immediately. Btw., for quick testing you could just set the required parameters in the command-line, e.g.:

% bin/nutch parsechecker -Dplugin.includes='protocol-selenium|parse-tika' \
   -Dselenium.grid.binary=.../geckodriver \
   -Dselenium.enable.headless=true \
   -followRedirects \
   -dumpText https://nutch.apache.org

License

License: MIT

nutch-test's People

Contributors

sbatururimi avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.