Giter Site home page Giter Site logo

wiki-table-scrape's Introduction

wiki-table-scrape

Scrape all the tables from a Wikipedia article into a folder of CSV files.

You can read more about it in the blog post

Installation

This is a Python 3.5 module that depends on the Beautiful Soup and requests packages.

  1. Clone and cd into this repo.
  2. Install Python 3.5.
  3. Install requirements from pip with pip install -r requirements.txt.
  4. If on Windows, download the .whl for the lxml parser and install it locally.
  5. Test the program by running python test_wikitablescrape.py.

Usage

Just import the module and call the scrape function. Pass it the full URL of a Wikipedia article, and a simple string (no special characters or filetypes) for the output name. The output will all be written to the output_name folder, with files named output_name.csv, output_name_1.csv, etc.

import wikitablescrape

wikitablescrape.scrape(
    url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    output_name="films"
)

Inspecting the output with Bash gives the following results:

$ ls films/
films.csv  films_1.csv  films_2.csv  films_3.csv

$ cat films/films_1.csv
"Rank","Title","Worldwide gross (2014 $)","Year"
"1","Gone with the Wind","$3,440,000,000","1939"
"2","Avatar","$3,020,000,000","2009"
"3","Star Wars","$2,825,000,000","1977"
"4","Titanic","$2,516,000,000","1997"
"5","The Sound of Music","$2,366,000,000","1965"
"6","E.T. the Extra-Terrestrial","$2,310,000,000","1982"
"7","The Ten Commandments","$2,187,000,000","1956"
"8","Doctor Zhivago","$2,073,000,000","1965"
"9","Jaws","$2,027,000,000","1975"
"10","Snow White and the Seven Dwarfs","$1,819,000,000","1937"

Disclaimers

The script won't give you 100% clean data for every page on Wikipedia, but it will get you most of the way there. You can see the output from the pages for mountain height, volcano height, NBA scores, and the highest-grossing films in the output folder of this repo.

I only plan to add features to this module as I need them, but if you would like to contribute, please open an issue or pull request.

If you'd like to read more about this module, please check out my blog post.

wiki-table-scrape's People

Contributors

martinapugliese avatar rocheio avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.