Giter Site home page Giter Site logo

hom3chuk / readit Goto Github PK

View Code? Open in Web Editor NEW
4.0 1.0 0.0 15 KB

A tool with no package dependencies that extracts Pushshift.io data archives from The Eye Reddit Archives into HTML files that are lightweight and redistributable

License: MIT License

JavaScript 100.00%

readit's Introduction

readit

"Read it" /ɹɛd ɪt/ is a tool with no package dependencies that extracts Pushshift.io data archives from The Eye Reddit Archives into HTML files that are

  • redistributable: can be copied elsewhere as a single HTML file.
  • lightweight: only needed data is embeded, which is text and OP images links.
  • compatible: no JS needed to render the files, they can be viewed on pretty much anything that is capable of rendering HTML

Requirements

  • nodejs v16 or higher (how to install)
  • any zstd-compatible app that can unpack zstd files:
    • windows: look for zstd-vx.y.z-win64.zip or zstd-vx.y.z-win32.zip in the latest release's "Assets" section at the official zstd releases page
    • OS X: install zstd using homebrew, then use unzstd filename.zst

Usage

1. Clone the repo

Clone the repo via git clone or just download the ZIP archive and unpack it. All commands and files are meant to happen inside that directory.

No usual installation is required. No yarn, no npm install, just download and it's ready to go.

2. Unpack zst archives for both Submissions and Comments

We gonna use r/Permaculture as an example. After downloading both files from The Eye, you'll end up with two files:

$ ls
Permaculture_comments.zst   Permaculture_submissions.zst

Unpack them:

$ unzstd Permaculture_comments.zst 
Permaculture_comments.zst: 425907510 bytes  

$ unzstd Permaculture_submissions.zst 
Permaculture_submissions.zst: 97972912 bytes 

You'll end up with two more files: Permaculture_comments and Permaculture_submissions

3. Run readit

Keep in mind: readit creates cache subfolder where it stores indexed comments data. For a 406 MB comments file, cache used 453 MB of disk storage (OS X). Please be aware of that disk space requirement when processing huge subreddits.

Now we can run readit to process these files into readable HTML (don't forget to provide your sub name):

$ node readit.js Permaculture
Prebuilding comments cache (each dot is 1k comments processed)
...............................................................................................................................................................................................................................................................................................................................................
Processing submissions (each dot is 100 posts)
................................................................................................................................................................................................................................................................................................

Once readit starts the second phase (Processing submissions) you can see that out directory is being populated with HTML files. Those are ready to be used or distributed right away.

HTML files are pretty simple, yet some of the reddit's markdown is supported.

4. Cleanup

If you plan on extracting another sub, please manually delete the cache and out directories beforehand.

Roadmap

  • a lil bit more (selectable) styles in output HTML
  • download from The Eye & unpack
  • search/partial export
  • better UX with cleaning up and cache
  • word cloud index so yo ucan explore topics you never knew about

Example HTML

image image


Ope!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.