Giter Site home page Giter Site logo

scraping's Introduction

Introduction to Scraping

This repository contains resources related to scraping.

Slides and notes

Tutorials

Examples

I keep examples of stories that have used scraping on Pinboard at https://pinboard.in/u:paulbradshaw/t:scrapingeg - you can drill down to more specific examples by adding + and a particular topic, e.g. https://pinboard.in/u:paulbradshaw/t:scrapingeg+sport will narrow down to examples related to sport.

Here are some of the examples in the slides, with a little explanation:

What makes a website suitable for scraping?

Ultimately, it comes down to structure:

  • A HTML table (using the <table> tag) is the easiest thing to scrape.
  • Alternatively, other HTML structures such as tags or attributes (e.g. <class>) can be used
  • Another structure might be textual, e.g. the information is always preceded by a particular string of characters.
  • URL structures are also important. Examples include:
    • Pagination, e.g. p=1, p=2 and so on.
    • Search criteria may be encoded in the URL in key-value pairs, e.g. category=companies
    • ID codes that refer to entities being described, e.g. SchoolID=5323320
  • Website structures: often it is possible to start from an 'index' page that links to all the pages or documents that you want to scrape.

URLs can often be simplified: many sites insert keywords into URLs for SEO or analytics purposes which can be removed without preventing the URL from working. For example in Reed jobs URLs the job title part can be replaced by anything - it is only the code for the job which is essential.

How easy is it to scrape a website?

Different challenges require different skills and tools. Broadly speaking factors affecting difficulty include:

  • Check if there is an API for the data (e.g. Mediawiki for Wikipedia) - this is designed to be queried to collect data, in other words to be 'scraped'
  • HTML tables are the easiest - these can be scraped by most tools
  • HTML tags are somewhat harder
  • Textual patterns come next
  • Generating the list of pages to scrape may add further difficulty
  • Websites which require cookies are problematic, but there are ways to scrape these
  • Document types add further difficulty: XML and JSON are still data formats but will need converting
  • XLS and XLSX files are more difficult still
  • PDFs present the highest level of difficulty - if they are scanned PDFs then you will need to consider an OCR (optical character recognition) stage

Tools

Other scraping tools can be found in my bookmarks

Law and ethics

Where to find coders

  • Hacks/Hackers is a network of meetups where journalists and developers can meet and learn from each other, collaborate etc.
  • Democracy Club is a network of civic coders - people who want to use their coding skills to improve society and, in particular, democratic accountability. They often organise events.
  • Look out for hackdays organised within the news industry, too, where journalists and developers are put together to explore ideas and potential projects.

Resources

  • The book Scraping for Journalists covers scraping using Google Sheets and tools like OutWit Hub, plus programming techniques for scraping databases, spreadsheets and PDFs.

scraping's People

Contributors

paulbradshaw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.