Giter Site home page Giter Site logo

nschrading / redditdataextractor Goto Github PK

View Code? Open in Web Editor NEW
226.0 226.0 29.0 419 KB

The reddit Data Extractor is a cross-platform GUI tool for downloading almost any content posted to reddit. Downloads from specific users, specific subreddits, users by subreddit, and with filters on the content is supported. Some intelligence is built in to attempt to avoid downloading duplicate external content.

License: GNU General Public License v3.0

Python 100.00%

redditdataextractor's People

Contributors

nschrading avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

redditdataextractor's Issues

Not able to download

I am trying to download files and get the error "to attempt to redownload this file, uncheck "restrict retrieved submissions to creation dates after the last downloaded submission" in the settings". I have this setting unchecked but it will not download the files.

Allow saving by the poster in sub folders

is it possible to store by subreddit/author/ rather than labeling photos as the reddit id.

for some reason some of the images are in the root of the subreddit folder and others are in folders..

Fails to run, when installed on OSX.

Installed all the required bits, and it doesn't seem to work:

natasha:john(130:355)$ pwd
/Users/john/workarea/bin/redditDataExtractor-master
natasha:john(131:356)$ python3.4 main.py
Traceback (most recent call last):
  File "main.py", line 125, in <module>
    main()
  File "main.py", line 98, in main
    rddtDataExtractor._r.http.validate_certs = 'RedditDataExtractor/cacert.pem'
AttributeError: 'Reddit' object has no attribute 'http'
natasha:john(132:357)$ 

If I comment out line 98 or main.py, it runs, but all the default lists fail with a message every subreddit fails to exist.

The subreddit movies does not exist. Remove from list?
With a Yes|No dialog popup for each reddit list, I am using the defaults list to test.

Also had problems getting it to run without a praw.ini that included the client_id, once I had that, it ran, but didn't read the client_id, and I needed to re-enter in the extractor.

Any ideas? I am not a python guy, so I am operating outside my talent envelope here. But I have successfully installed dozens of other python apps without too many issues, well a file locking issue on one I was able to fix, and commit. This one has me stumped.

natasha:john(138:363)$ uname -a
Darwin natasha.wrongcrowd.net 16.3.0 Darwin Kernel Version 16.3.0: Thu Nov 17 20:23:58 PST 2016; root:xnu-3789.31.2~1/RELEASE_X86_64 x86_64
natasha:john(139:364)$ python3.4 -V
Python 3.4.6
natasha:john(139:364)$ pip3.4 list
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
beautifulsoup4 (4.5.3)
pathlib (1.0.1)
pip (9.0.1)
praw (4.3.0)
prawcore (0.7.0)
requests (2.13.0)
setuptools (32.3.1)
update-checker (0.16)
youtube-dl (2015.7.4)
natasha:john(138:365)$ port installed | grep -E 'sip|pyqt'
  py-sip @4.19_0 (active)
  py27-sip @4.19_0 (active)
  py34-pyqt4 @4.12.0_0 (active)
  py34-sip @4.19_0 (active)
natasha:john(139:366)$ 

Anything I can do to gather more relevant info? The only runtime error I get when I comment out line 98 of main.py:

libpng warning: iCCP: known incorrect sRGB profile

Which would seem to me not relevant to the issues at hand.

Home folder hardcoded in the premade executable

I tried downloading the premade executable for Linux, and I'm getting the following error when trying to run it:
/exe.linux-x86_64-3.4$ ./redditDataExtractor /mnt/plaintext_data/Downloads/ubuntu/exe.linux-x86_64-3.4/library.zip/imp.py:32: PendingDeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses Traceback (most recent call last): File "/home/jschradi/anaconda3/lib/python3.4/site-packages/cx_Freeze/initscripts/Console.py", line 27, in <module> File "main.py", line 125, in <module> File "main.py", line 89, in main File "main.py", line 84, in loadState File "/home/jschradi/anaconda3/lib/python3.4/shelve.py", line 141, in close File "/home/jschradi/anaconda3/lib/python3.4/shelve.py", line 168, in sync File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 113, in _commit File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 257, in _chmod PermissionError: [Errno 1] Operation not permitted: 'RedditDataExtractor/saves/settings.db.dir' Exception ignored in: <bound method DbfilenameShelf.__del__ of <shelve.DbfilenameShelf object at 0x7fc0433c92b0>> Traceback (most recent call last): File "/home/jschradi/anaconda3/lib/python3.4/shelve.py", line 158, in __del__ File "/home/jschradi/anaconda3/lib/python3.4/shelve.py", line 141, in close File "/home/jschradi/anaconda3/lib/python3.4/shelve.py", line 168, in sync File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 113, in _commit File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 257, in _chmod PermissionError: [Errno 1] Operation not permitted: 'RedditDataExtractor/saves/settings.db.dir' Exception ignored in: <bound method _Database.close of <dbm.dumb._Database object at 0x7fc0433c92e8>> Traceback (most recent call last): File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 250, in close File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 113, in _commit File "/home/jschradi/anaconda3/lib/python3.4/dbm/dumb.py", line 257, in _chmod PermissionError: [Errno 1] Operation not permitted: 'RedditDataExtractor/saves/settings.db.dir'
Looks like a bunch of paths are hardcoded, and don't get ported to a different system.

Only allowed to download max 1000 posts?

In settings there is "Max Posts Retrieved in Subreddit Content Download[1-1000]", so if I want to download posts more than 1000? 1000 posts is far from enough to get data analysis.

Which setting will allow me to extract next 1001-2000, 2001-3000 etc posts? Is there an automatically mechansim to download ALL posts for one reddit topic?

Thanks!

Unrelated issue

Hey there mate, the guy who created this beautiful tool and posted it on Github.
You're the reason why i signed to Github to learn how to code.
Just yesterday i couldn't sleep was thinking about this tool that i wanted to create that crawls a website gather links and download them, but i found yours the perfect one to start learning with, thing is im kind of a noob, new to this whole programing thing, i want to learn python.
I've installed python, installed pyqt4 that you provided, but the thing is on the line 24 it freezes.
So i want to change that to pyqt5 because i had so many problems with pyqt4.
if i do that will it affect the rest of the code.
Edit = you know! after a second note this shit is fucking nerve breaking mate, you fix a problem another one pops out from the middle of nowhere, install fucking sip, oh wait sip doesn't work it need qt, install qt oh shit qt needs cs, cs needs pv, pv need dc, dc nneds ls, and when you do all that shit the program still won't work at the end, excuse my language, i don't have time for this shit, ... i need my smoke..

pre-made Linux executable crashes probably because a hardcoded home folder

@NSchrading , that's probably related to #12 . Trying to launch the pre-made Linux executable and it crashes:

exe.linux-x86_64-3.4$ /home/myusername/rde/exe.linux-x86_64-3.4/library.zip/imp.py:32: PendingDeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
Traceback (most recent call last):
  File "/home/jschradi/anaconda3/lib/python3.4/site-packages/cx_Freeze/initscripts/Console.py", line 27, in <module>
  File "main.py", line 24, in <module>
ImportError: cannot import name 'QApplication'

[5]-  Exit 1                  ./redditDataExtractor

Sadly even with Pull Request #15 your RDE tool does not work - downloads empty .txt files (no comment body) and complains about AttributeError: '<class 'praw.objects.Submission'>' has no attribute 'comments'

@NSchrading Sadly even with Pull Request #15 your RDE tool does not work - downloads empty .txt files (no comment body) and complains about AttributeError: '<class 'praw.objects.Submission'>' has no attribute 'comments' . Please reply if you still care about your tool, I think I did like 90% of work but got stuck a bit

Videos not downloading

Videos seem to not download, instead "Uncheck Restrict retrieved submissions to creation dates after the last downloaded submission" error is logged; this is verified to be unchecked. Also, does this support redgifs.com hosted gifs and gifv as well?

name of file

Is it possible to change the name of the file to the Title of the post?

extractor not validating any subreddits

Hi -

I'm hoping to download a specific subreddit's data (r/PoliticalDiscussion) but receive an error message each time I try to use the extractor. Even the default subreddits produce a message "The subreddit does not exist." Am I doing something wrong?

Thanks!

Top does not get top of all time

When choosing top as the sorting option, it only get the top post of less than a year. Is there a way to expand that to top from "all time"? I would like to archive a few now-dead sub-reddits and I don't want to get all the spam stuff and waste the 1000 post limit with "new".

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.