Giter Site home page Giter Site logo

rouxrc / hyphe Goto Github PK

View Code? Open in Web Editor NEW

This project forked from medialab/hyphe

0.0 3.0 0.0 70.95 MB

Websites crawler with built-in exploration and control web interface

License: Other

Python 13.07% JavaScript 54.91% Shell 1.67% CSS 4.79% HTML 7.43% PHP 2.85% Java 14.81% Thrift 0.46%

hyphe's Introduction

Hyphe: web corpus builder & links crawler

Welcome to Hyphe: developped by SciencesPo's médialab for the DIME-SHS Web project (Equipex).

Hyphe aims at providing a tool to crawl data from the web to generate networks between what we call WebEntities, which can be single pages as well as a website or a combination of such.

Demo

You can try a restricted version of Hyphe at the following url: http://hyphe.medialab.sciences-po.fr/demo/

Papers and references

Papers about Hyphe

Papers using Hyphe

Easy start

DISCLAIMER: Hyphe has changed a lot between version 0.1 and 0.2. Migrating from an older version by pulling the code from git was guaranteed as best as possible, although it is highly recommended to reinstall from scratch. Older corpora can be reran by exporting the list of WebEntities from the old version and recrawl from that list of urls in the new version.

Install a release

For an easy install, the best solution is to download directly the release version, which was built to run against various GNU/Linux distributions (Ubuntu, Debian, CentOS...).

MacOS users and other distribution can now also run Hyphe locally on their machine using Docker thanks to @oncletom's work. See the dedicated section below.

Just uncompress the release archive, go into the directory and run the installation script.

Do not use sudo: the script will do so on its own and will ask for your password only once. This works so in order to install all missing dependencies at once, including mainly Java (OpenJDK-6-JRE), Python (python-dev, pip, virtualEnv, virtualEnvWrapper...), Apache2, MongoDB & ScrapyD.

If you are not comfortable with this, you can read the script and run the steps line by line or follow the Advanced install instructions for more control on what is actually installed.

# WARNING: DO NOT prefix any of these commands with sudo!
tar xzvf hyphe-release-*.tar.gz
cd Hyphe
./bin/install.sh

To install from git sources of if you want to contribute to Hyphe's development, please follow the Advanced install documentation.

Configure Hyphe

Before starting Hyphe, you should probably adjust the settings first. Everything you need to change is in the global configuration file config/config.json.

Please read the Configuration documentation for details.

Run Hyphe

Hyphe relies on a web interface communicating with a server daemon which must be running at all times. To start, stop or restart the daemon, run (again, no sudo):

bin/hyphe <start|restart|stop> [--nologs]

By default the starter will display Hyphe's log in the console using tail. You can Ctrl-C whenever you want without shutting it off. Use the --nologs option to disable this.

You can always check the logs for both the core backend and each corpus' MemoryStructure in the log directory:

tail -f log/hyphe-*.log

As soon as the daemon is started, you can start playing with the web interface on your local machine at the following url: http://localhost/hyphe.

Serve on the web

Using the website on localhost, you can already use Hyphe. Although, if you want to let others use it as well (typically if you installed on a distant server), you need to make a few adjustments to the Apache configuration.

Please read the dedicated WebService documentation to do so.

Docker setup

Docker enables isolated install and execution of software stacks, which can be an easy way to install Hyphe locally on an individual computer, including on unsupported distributions like MacOS. Follow Docker install instructions to install Docker on your machine.

Install Docker Compose to set up and orchestrate Hyphe services in a single line.

docker-compose up

When using boot2docker for instance on MacOS, you might need beforehand to run the following:

boot2docker up
# and copy paste the 3 lines starting with export to set the environment variables

It will take a couple of minutes to spin everything up for the first time. Once the services are ready, you can access the frontend interface by connecting on its IP address:

open http://$(docker inspect -f '{{.NetworkSettings.IPAddress}}' hyphe_frontend_1):8000

Or, if you use boot2docker:

open http://$(boot2docker ip):8000

Notice: this is not a production setup. Get some inspiration from the docker-compose.yml to understand how to distribute the application on one or many machines.

Advanced developers features & contributing

Please read the dedicated Developers documentation and the API description.

What's next?

See our roadmap!

Authors

Mathieu Jacomy & Benjamin Ooghe-Tabanou @ SciencesPo médialab

Discover more of our projects at médialab tools

This work is supported by DIME-WEB part of DIME-SHS research equipment financed by the EQUIPEX program (ANR-10-EQPX-19-01).

Hyphe is a free software released under LGPL & CECILL-C licenses.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.