Giter Site home page Giter Site logo

felipelodur / simplepycrawler Goto Github PK

View Code? Open in Web Editor NEW

This project forked from danielgunna/simplepycrawler

0.0 2.0 0.0 38 KB

A simple web crawler developed as coursework for Algorithms on Graph Theory - PUC Minas

License: MIT License

Python 96.96% Shell 3.04%

simplepycrawler's Introduction

SimplePyCrawler

A simple web crawler developed as coursework for Graph Algorithms - PUC Minas

What is this supposed to do?

This simple script crawl hyperlinks contained on a HTML(Hypertext Markup Language) page from a specified domain and then it fetches the hyperlinked pages to construct a graph. The graph nodes represents the hyperlinks and the egdges indicates that there is a hyperlink reference from one page to another one (this is a directed graph).

  • How is the graph built? The graph is built recursively in such a way that new referenced pages are visited to search for more hyperlinks and further extend the graph. This recursive process occurs until the script complete a previously specified number of jumps.

  • How is the graph structured? This approach uses an Adjacency List to represent the graph. New hyperlinks are always added at the end of the adjacency list and thus. The hyperlinks are fetched following the Breadth-first search Algorithm strategy.

How does it work?

The python package manager pip is required to configure the environment to use this script, because it will be used to install project's dependencies. The required dependecies are listed below:

Before running the config_env.sh configuration script you need to provide permission to run it by typing chmod +x config_env.sh in a terminal. Afterwards, you have to type ./config_env.sh to run the configuration script, which requires user's password because it is a procedure that may require root privileges. Finally, wait for process conclude.

After you have executed all previous steps, you simply need to execute script using a python interpreter by typing the python SimplePyCrawler.py command in a terminal. Then, you will be asked to insert an url to be processed and choose how many jumps you want the graph to consider.

Todo

I do not intend to mantain this script anymore. If you want to contribute or simply would like to use this script on your project, below there are some enhancements suggestions.

  • Implement a more robust method to verify if an URL is valid;
  • Handle graph cycles;
  • Implement politeness in the HTML documents fetching process (such as respecting robots.txt and adding some time before repating requests) to avoid having your bot/IP banned from web servers;
  • Handle url fetching that generates double edges.

simplepycrawler's People

Contributors

danielgunna avatar felipelodur avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.