Introduction to Scraping

This repository contains resources related to scraping.

Slides and notes

A presentation introducing scraping and examples of its use in journalism is on Slideshare here
Another presentation using IFTTT and Google Sheets to explore scraping techniques is on Slideshare here

Tutorials

IMPORTHTML: Here's a tutorial on using scraping functions in Google Sheets
You can also follow this tutorial by Dan Wainwright
How to scrape webpages and ask questions with Google Docs and =importXML
More importXML: Asking questions of a webpage – and finding out when those answers change
And more: Scraping data from a list of webpages using Google Docs
How to: find the data behind an interactive chart or map using the inspector
Here's a tutorial on using Workbench to scrape Twitter - and regex

Examples

I keep examples of stories that have used scraping on Pinboard at https://pinboard.in/u:paulbradshaw/t:scrapingeg - you can drill down to more specific examples by adding + and a particular topic, e.g. https://pinboard.in/u:paulbradshaw/t:scrapingeg+sport will narrow down to examples related to sport.

Here are some of the examples in the slides, with a little explanation:

Channel 4 News: Why is the government website carrying fake jobs? (video) - this was based on a scrape of the Universal Jobmatch website by a non-journalist, who the reporter worked with.
Daily Mirror: Which singer has the best vocal range in the UK - No, it's not who you think - a great example of spotting a story in a website which isn't explicitly 'data': each page on MusicNotes.com contains the vocal range, among other pieces of information - see this example
BuzzFeed: The Tennis Racket - tennis results and odds needed to be scraped to establish unusual patterns
Private Eye: Selling England (and Wales) by the pound
BBC News: David Cameron's prime questioners - scraped Prime Minister's Questions to see which names appeared the most
O'Reilly Radar: You're A Bigger Deal On Twitter Than You Think - based on a massive scrape of Twitter accounts to get a picture of the distribution of followers
FT: Interactive: explore the statistical identity of every team at the World Cup - based on a scrape of player profiles from whoscored.com
Nature: Scientific publishing: The inside track - based on a scrape of journal papers to identify the most prolific publishers
New York Times: Inside the Evolving Hotel Bathroom - based on a scrape of hotel reviews and a text analysis
Sunday Post: Council sick days cost taxpayers £250m - based on a scrape of Excel spreadsheets published by every council in Scotland.
Independent (also Guardian and others): Seb Coe promised an 'uplifting torch relay to inspire a generation'. So are these really the role models to do it? - one of the earliest stories based on a scrape of the Olympic Torch Relay website. All the stories to come from the investigation can be found in the free ebook 8000 Holes: How the Olympic Torch Relay Lost its Way
BBC News: Libraries lose a quarter of staff as hundreds close - based on a combination of FOI requests and scraping hundreds of reports by CIPFA.
BBC News: Check NHS cancer, A&E and operations targets in your area - based on live scraping of NHS 'sitrep' spreadsheets, but also manual checking and cleaning.
BBC News: Help to Buy Isa scheme 'helps lucky few' - based on a scrape of Zoopla, who were approached to obtain permission. The scraped data was more detailed than Zoopla were able to easily provide.

What makes a website suitable for scraping?

Ultimately, it comes down to structure:

A HTML table (using the <table> tag) is the easiest thing to scrape.
Alternatively, other HTML structures such as tags or attributes (e.g. <class>) can be used
Another structure might be textual, e.g. the information is always preceded by a particular string of characters.
URL structures are also important. Examples include:
- Pagination, e.g. p=1, p=2 and so on.
- Search criteria may be encoded in the URL in key-value pairs, e.g. category=companies
- ID codes that refer to entities being described, e.g. SchoolID=5323320
Website structures: often it is possible to start from an 'index' page that links to all the pages or documents that you want to scrape.

URLs can often be simplified: many sites insert keywords into URLs for SEO or analytics purposes which can be removed without preventing the URL from working. For example in Reed jobs URLs the job title part can be replaced by anything - it is only the code for the job which is essential.

How easy is it to scrape a website?

Different challenges require different skills and tools. Broadly speaking factors affecting difficulty include:

Check if there is an API for the data (e.g. Mediawiki for Wikipedia) - this is designed to be queried to collect data, in other words to be 'scraped'
HTML tables are the easiest - these can be scraped by most tools
HTML tags are somewhat harder
Textual patterns come next
Generating the list of pages to scrape may add further difficulty
Websites which require cookies are problematic, but there are ways to scrape these
Document types add further difficulty: XML and JSON are still data formats but will need converting
XLS and XLSX files are more difficult still
PDFs present the highest level of difficulty - if they are scanned PDFs then you will need to consider an OCR (optical character recognition) stage

Tools

Other scraping tools can be found in my bookmarks

Law and ethics

Where to find coders

Hacks/Hackers is a network of meetups where journalists and developers can meet and learn from each other, collaborate etc.
Democracy Club is a network of civic coders - people who want to use their coding skills to improve society and, in particular, democratic accountability. They often organise events.
Look out for hackdays organised within the news industry, too, where journalists and developers are put together to explore ideas and potential projects.

Resources

The book Scraping for Journalists covers scraping using Google Sheets and tools like OutWit Hub, plus programming techniques for scraping databases, spreadsheets and PDFs.

paulbradshaw / scraping Goto Github PK

scraping's Introduction

Introduction to Scraping

Slides and notes

Tutorials

Examples

What makes a website suitable for scraping?

How easy is it to scrape a website?

Tools

Law and ethics

Where to find coders

Resources

scraping's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent