Giter Site home page Giter Site logo

django-data-warehouse's Introduction

A Data Warehouse with Django and Scrapy

A Django app to store scraped website data with the intention to use the data as a source to import from.

It's a work in progress and not ready for use in a production environment.

Many parts of this project are based on previous work I have done. See the credits section below.

It's highly likely that this project will change significantly over time ๐Ÿ’ฅ

How it works so far

  1. Initial command to obtain links to all pages to scrape: scrapy crawl sitemap
  2. Collect the page content for each site map page: scrapy crawl pages
  3. Run command python manage.py build_blocks to "build the blocks" from the scraped data (page content)

Setup

You'll need a wordpress site running from which you can scrape data. I used a local install of wordpress with default theme and sample content.

  1. Clone this repo
  2. Create a virtualenv and install requirements poetry install then poetry shell
  3. Create a database and user for the project python manage.py migrate then python manage.py createsuperuser
  4. Run the initial command to obtain links to all pages to scrape: scrapy crawl sitemap from the warehouse/sitemap/spiders directory
  5. Collect the page content for each site map page: scrapy crawl pages from the warehouse/pages/spiders directory
  6. Run command python manage.py build_blocks to "build the blocks" from the scraped data (page content). Run from the root directory of the project.

TODO

  • Add tests
  • Refine the django admin interface
  • Add a JSON API to access the data from a wagtail site for import
  • and more...

Dependencies

Production

Development

License

MIT

Credits

Previous work I have done and where I have pulled ideas from

django-data-warehouse's People

Contributors

nickmoreton avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.