Giter Site home page Giter Site logo

web-scraper's Introduction

  • Note: This repository is a work in progress. More scripts are supposed to be added in this repository in near future.

What is Web Scraping

Web Scraping is a computer software technique used for extracting large amounts of information from websites. It mainly focuses on the transformation of unstructured data (in the form of HTML) to structured tree format data (database, CSV). This is particularly useful when we want to organize and analyse data obtained from a website, outside of the browse, and then study and draw patterns in them.

Difference between Web Crawler and Web Scraper

Web crawlers browse through a series of links present on webpages and index them in a database. This process of following various links of webpages is referred to as crawling. As they are used on a large scale at a time (i.e. run on larg e number of websites at once), they yield generic information. Its main purpose is locating information on the web and indexing it.

Web scrapers primarily extract data from webpages. They capture the contents of pages that it has crawled, extracts the required information and stores them in an organized manner for further study. Though web scrapers can crawl to different pages their primary purpose is scraping the data on those pages, not indexing the web.

Is it Legal?

Generally, if you are using scraped data for personal use and do not plan to republish that data, it may not cause any problems. Read the Terms of Use, Conditions of Use, and also the robots.txt before scraping the website. You must follow the robots.txt rules before scraping, otherwise, the website owner has every right to take legal action against you.

One may check the robots.txt of a website by adding a /robots.txt after the URL of the website. The following link would give a beneficial insight about robots.txt

  1. About /robots.txt
  2. How to Read and Respect Robots.txt

Python Tools for Web Scraping

Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages. It is a data parser. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival. It can also be used to extract data using APIs. It is a complete web scraping solution.

Selenium is an open-source web testing tool. It is used to automate browser activities, hence it is also known as a web-driver. It is particularly useful for scraping, if the website uses JavaScript to serve content.

Other libraries (such as Requests, lxml, requests-html), also have their own benefits. All of these libraries have quite a simple learning curve, and contain good documentation.

Resources

  1. Web Scraping Toolbox
  2. 5 Tasty Python Web Scraping Libraries

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.