Giter Site home page Giter Site logo

discourse-reader's Introduction

discourse-reader

Data pipeline to access and store data from Discourse forums

Design decisions

  • Currently storing dynamic data in the topics table, instead of relying on joins with posts and other tables to generate dynamic counts
  • Pulling down entire scrape of raw API each time, would be more efficient to find a way to get all new information from a given date.

Raw Data Structure

1. Users
  - Single JSON file containing all information returned about users in the discourse API
  - List of dictionaries (no edits from API)
2. Raw Categories
  - Single JSON file containing all general information about each category
  - List of dictionaries (no edits from API)
2. Individual Categories
  - Individual JSON file for each category returned by the API
  - Singular dictionary for each file
  - Posts and Topics fields are inserted into the dicationary in addition to raw information from API

discourse-reader's People

Contributors

rohanbansal12 avatar clemp avatar

Watchers

 avatar Angela Liu avatar  avatar Alexander Keating avatar

discourse-reader's Issues

Multi-Threading

Implement multi-threading for the raw scrape to speed up data collection

Set up scheduled job

Configure the EC2 instance on AWS to run the scrape + ingestion pipeline on a fixed time basis.

Insert/Update Data

Try to find a way to scrape only new information and update the database as opposed to re-creating the entire table each time

Create likes table

Less important task, but eventually create a likes table for discourse posts

Initial DB Schema

Category/Guild

  • category_id
  • name
  • slug
  • topic_count
  • post_count
  • position
  • description
  • description_text
  • description_excerpt
  • topic_url
  • read_restricted
  • notification_level
  • has_children
  • num_featured_topics
  • minimum_required_tags
  • topics_day
  • topics_week
  • topics_month
  • topics_year
  • topics_all_time
  • uploaded_logo
  • uploaded_background

Topic

  • topic_id
  • category_id
  • title
  • fancy_title
  • slug
  • posts_count
  • reply_count
  • image_url
  • created_at
  • bumped
  • bumped_at
  • archetype
  • unseen
  • pinned
  • excerpt
  • visible
  • closed
  • archived
  • bookmarked
  • liked
  • tags
  • views
  • like_count
  • has_summary
  • pinned_globally
  • featured_link

Posts

  • post_id
  • topic_id
  • user_id
  • name
  • username
  • created_at
  • cooked
  • post_number
  • post_type
  • updated_at
  • reply_count
  • reply_to_post_number
  • quote_count
  • incoming_link_number
  • reads
  • readers_count
  • score
  • read
  • bookmarked
  • admin
  • staff
  • hidden
  • deleted_at
  • user_deleted
  • edit_reason
  • like_count

User

  • user_id
  • username
  • name
  • time_read
  • likes_received
  • likes_given
  • topics_entered
  • topic_count
  • post_count
  • posts_read
  • days_visited
  • avatar_template
  • title
  • ethUser

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.