mattpodolak / pmaw Goto Github PK

View Code? Open in Web Editor NEW

212.0 9.0 27.0 16.09 MB

A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.

License: MIT License

Python 100.00%

reddit api wrapper multithreaded data-science big-data reddit-api

pmaw's Introduction

PMAW: Pushshift Multithread API Wrapper

Description
Getting Started
Features
Parameters
Examples
- Comments
- Submissions
Advanced Examples
Benchmarks
Deprecated Examples

Description

PMAW is a wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions. General usage is through the PushshiftAPI class which provides methods for interacting with different Pushshift endpoints, please view the Pushshift Docs for more details on the endpoints and accepted parameters. Parameters are provided through keyword arguments when calling the method, some methods will have required parameters. When using a method PMAW will complete all the required API calls to complete the query before returning a Response generator object.

The following three methods are currently supported:

Searching Comments: search_comments
- Details
Search Submissions: search_submissions
- Details
Search Submission Comment IDs: search_submission_comment_ids
- Details

Getting Started

Why Multithread?

When building large datasets from Reddit submission and comment data it can require thousands of API calls to the Pushshift API. The time it takes for your code to complete pulling all this data is limited by both your network latency and the response time of the Pushshift server, which can vary throughout the day.

Current API libraries such as PRAW and PSAW currently run requests sequentially, which can cause thousands of API calls to take many hours to complete. Since API requests are I/O-bound they can benefit from being run asynchronously using multiple threads. Implementing intelligent rate limiting can ensure that we minimize the number of rejected requests, and the time it takes to complete.

Installation

PMAW currently supports Python 3.5 or later. To install it via pip, run:

$ pip install pmaw

General Usage

from pmaw import PushshiftAPI
api = PushshiftAPI()

View the optional parameters for PushshiftAPI here.

Features

Multithreading

The number of threads to use during multithreading is set with the num_workers parameter. This is optional and defaults to 10, however, you should provide a value as this may not be appropriate for your machine. Increasing the number of threads you use allows you to make more concurrent requests to Pushshift, however, the returns are diminishing as requests are constrained by the rate-limit. The optimal number of threads for requests is between 10 and 20 depending on the current response time of the Pushshift server.

When selecting the number of threads you can follow one of the two methodologies:

Number of processors on the machine, multiplied by 5
Minimum value of 32 and the number of processors plus 4

If you are unsure how many processors you have use: os.cpu_count().

Rate Limiting

Multiple different options are available for rate-limiting your Pushshift API requests, and are defined by two different types, rate-averaging and exponential backoff. If you're unsure on which to use, refer to the benchmark comparison.

Rate-Averaging

PMAW by default rate limits using rate-averaging so that the concurrent API requests to the Pushshift server are limited to your provided rate.

Providing a rate_limit value is optional, this defaults to 60 requests per minute which is the recommended value for interacting with the Pushshift API. Increasing this value above 60 will increase the number of rejected requests and will increase the burden on the Pushshift server. A maximum recommended value is 100 requests per minute.

Additionally, the rate-limiting behaviour can be constrained by the max_sleep parameter which allows you to select a maximum period of time to sleep between requests.

Exponential Backoff

Exponential backoff can be used by setting the limit_type to backoff. Four flavours of backoff are available based on the usage of jitter: None, full, equal, and decorr - decorrelated.

Exponential backoff is calculated by multiplying the base_backoff by 2 to the power of the number of failed batches. This allows batches to be spaced out, reducing the resulting rate-limit when requests start to be rejected. However, the threads will still be requesting at nearly the same time, increasing the overall number of required API requests. The exponential backoff sleep values are capped by the max_sleep parameter.

Introducing an element of randomness called jitter allows us to reduce the competition between threads and distribute the API requests across the window, reducing the number of rejected requests.

full jitter selects the length of sleep for a request by randomly sampling from a normal distribution for values between 0 and the capped exponential backoff value.
equal jitter selects the length of sleep for a request by adding half the capped exponential backoff value to a random sample from a normal distribution between 0 and half the capped exponential backoff value.
decorr - decorrelated jitter is similar to full jitter but increases the maximum jitter based on the last random value, selecting the length of sleep by the minimum value between max_sleep and a random sample between the base_backoff and the last sleep value multiplied by 3.

Caching

Memory Safety

Memory safety allows us to reduce the amount of RAM used when requesting data, and can be enabled by setting mem_safe=True on a search method. This feature should be used if a large amount of data is being requested or if the machine in use has a limited amount of RAM.

When enabled, PMAW caches the responses retrieved every 20 batches (approx 20,000 responses with 10 workers) by default, this can be changed by passing a different value for file_checkpoint when instantiating the PushshiftAPI object.

When the search is complete, a Response generator object is returned, when iterating through the responses using this generator, responses from the cache will be loaded in 1 cache file at a time.

Safe Exiting

Safe exiting will ensure that if a search method is interrupted that any unfinished requests and current responses are cached before exiting. If the search method successfully completes, all the responses are also cached. This can be enabled by setting safe_exit=True on a search method.

Re-running a search method with the exact same parameters that you have ran before will load previous responses and any unfinished requests from the cache, allowing it to resume if all the required responses have not yet been retrieved. If there are no unfinished requests, the responses from the cache are returned.

A before value is required to load previous responses / requests when using non-id based search, as before is set to the current time when the search method is called, which would result in a different set of parameters then when you last ran the search despite all other parameters being the same.

Similarly to the memory safety feature, a Response generator object is returned. When iterating through the responses using this generator, responses from the cache will be loaded in 1 cache file at a time.

PRAW Enrichment

Enrich results with the most recent metadata from Reddit by passing a PRAW Reddit instance when instantiating the PushshiftAPI. Results not found on Reddit will not be enriched or returned.

If you don’t already have a client ID and client secret, follow Reddit’s First Steps Guide to create them. A user agent is a unique identifier that helps Reddit determine the source of network requests. To use Reddit’s API, you need a unique and descriptive user agent.

Custom Filtering

A user-defined function can be provided using the filter_fn parameter for either the search_submissions or search_comments method. This function will be used to filter results before they are saved by passing each item to the function and filtering it out if a False value is returned, saving the value if True is returned. The limit parameter does not take into account any results that are filtered out.

Unsupported Parameters

order='asc' is unsupported as it can have unexpected results
until and since only support epoch time (float or int)
aggs are unsupported, as PMAW is intended to be used for collecting large numbers of submissions or comments. Use PSAW for aggregation requests.

Feature Requests

For feature requests please open an issue with the feature request label, this will allow features to be better prioritized for future releases

Parameters

Objects

`PushshiftAPI`

num_workers (int, optional): Number of workers to use for multithreading, defaults to 10.
max_sleep (int, optional): Maximum rate-limit sleep time (in seconds) between requests, defaults to 60s.
rate_limit (int, optional): Target number of requests per minute for rate-averaging, defaults to 60 requests per minute.
base_backoff (float, optional): Base delay in seconds for exponential backoff, defaults to 0.5s
batch_size (int, optional): Size of batches for multithreading, defaults to number of workers.
shards_down_behavior (str, optional): Specifies how PMAW will respond if some shards are down during a query. Options are 'warn' to only emit a warning, 'stop' to throw a RuntimeError, or None to take no action. Defaults to 'warn'.
limit_type (str, optional): Type of rate limiting to use, options are 'average' for rate averaging, 'backoff' for exponential backoff. Defaults to 'average'.
jitter (str, optional): Jitter to use with backoff, options are None, 'full', 'equal', 'decorr'. Defaults to None.
checkpoint (int, optional): Size of interval in batches to print a checkpoint with stats, defaults to 10
file_checkpoint (int, optional): Size of interval in batches to cache responses when using mem_safe, defaults to 20
praw (praw.Reddit, optional): Used to enrich the Pushshift items retrieved with metadata directly from Reddit

`Response`

Response is a generator object which will return the responses once when iterated over.

len(Response) will return the number of responses that were retrieved from Pushshift
load_cache(key, cache_dir=None) returns an instance of Response with the responses loaded with the provided key

`search_submissions` and `search_comments`

max_ids_per_request (int, optional): Maximum number of ids to use in a single request, defaults to 500, maximum 500.
max_results_per_request (int, optional): Maximum number of items to return in a single non-id based request, defaults to 100, maximum 100.
mem_safe (boolean, optional): If True, stores responses in cache during operation, defaults to False
search_window (int, optional): Size in days for search window for submissions / comments in non-id based search, defaults to 365
safe_exit (boolean, optional): If True, will safely exit if interrupted by storing current responses and requests in the cache. Will also load previous requests / responses if found in cache, defaults to False
cache_dir (str, optional) - An absolute or relative folder path to cache responses in when mem_safe or safe_exit is enabled
filter_fn (function, optional) - A function used for custom filtering the results before saving them. Accepts a single comment or submission parameter and returns False to filter out the item, otherwise returns True.

Keyword Arguments

Unlike the Pushshift API, the until and since keyword arguments must be in epoch time
limit is the number of submissions/comments to return. If set to None or if the set limit is higher than the number of available submissions/comments for the provided parameters then limit will be set to the amount available.
Other accepted parameters are covered in the Pushshift documentation for submissions and comments.

`search_submission_comment_ids`

ids is a required parameter and should be an array of submission ids, a single id can be passed as a string
max_ids_per_request (int, optional): Maximum number of ids to use in a single request, defaults to 500, maximum 500.
mem_safe (boolean, optional): If True, stores responses in cache during operation, defaults to False
safe_exit (boolean, optional): If True, will safely exit if interrupted by storing current responses and requests in the cache. Will also load previous requests / responses if found in cache, defaults to False
cache_dir (str, optional) - An absolute or relative folder path to cache responses in when mem_safe or safe_exit is enabled

Keyword Arguments

Other accepted parameters are covered in the Pushshift documentation

Examples

The following examples are for pmaw version >= 1.0.0.

Comments

Search Comments

from pmaw import PushshiftAPI

api = PushshiftAPI()
comments = api.search_comments(subreddit="science", limit=1000)
comment_list = [comment for comment in comments]

Search Comments by IDs

from pmaw import PushshiftAPI

api = PushshiftAPI()
comment_ids = ['gjacwx5','gjad2l6','gjadatw','gjadc7w','gjadcwh',
  'gjadgd7','gjadlbc','gjadnoc','gjadog1','gjadphb']
comments = api.search_comments(ids=comment_ids)
comment_list = [comment for comment in comments]

You can supply a single comment by passing the id as a string or an array with a length of 1 to ids

Detailed Example

Search Comment IDs by Submission ID

from pmaw import PushshiftAPI

api = PushshiftAPI()
post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
comment_ids = api.search_submission_comment_ids(ids=post_ids)
comment_id_list = [c_id for c_id in comment_ids]

You can supply a single submission by passing the id as a string or an array with a length of 1 to ids

Detailed Example

Submissions

Search Submissions

from pmaw import PushshiftAPI

api = PushshiftAPI()
posts = api.search_submissions(subreddit="science", limit=1000)
post_list = [post for post in posts]

Search Submissions by IDs

from pmaw import PushshiftAPI

api = PushshiftAPI()
post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
posts = api.search_submissions(ids=post_ids)
post_list = [post for post in posts]

You can supply a single submission by passing the id as a string or an array with a length of 1 to ids

Detailed Example

Advanced Examples

PRAW

import praw
from pmaw import PushshiftAPI

reddit = praw.Reddit(
 client_id='YOUR_CLIENT_ID',
 client_secret='YOUR_CLIENT_SECRET',
 user_agent=f'python: PMAW request enrichment (by u/YOUR_USERNAME)'
)

api_praw = PushshiftAPI(praw=reddit)
comments = api_praw.search_comments(q="quantum", subreddit="science", limit=100, until=1629990795)

Custom Filter

The user defined function must accept a single item (comment / submission) and return either True or False, returning False will filter out the item passed to it.

from pmaw import PushshiftAPI

api = PushshiftAPI()
def fxn(item):
  return item['score'] > 2
posts = api.search_submissions(ids=post_ids, filter_fn=fxn)

Caching Examples

Memory Safety

If you are pulling large amounts of data or have a limited amount of RAM, using the memory safety feature will help you avoid an out of memory error from being thrown during data retrieval.

from pmaw import PushshiftAPI

api = PushshiftAPI()
posts = api.search_submissions(subreddit="science", limit=700000, mem_safe=True)
print(f'{len(posts)} posts retrieved from Pushshift')

A Response generator object will be returned, and you can load all the responses, including those that have been cached by iterating over the entire generator.

# get all responses
post_list = [post for post in posts]

With default settings, responses are cached every 20 batches (approx 20,000 responses with 10 workers), however, with limited memory you can decrease this further.

# cache responses every 10 batches
api = PushshiftAPI(file_checkpoint=10)

Safe Exiting

If you expect that your query may be interrupted while its running, setting safe_exit=True will cache responses and unfinished requests before exiting when an interrupt signal is received. Re-running a search method with the exact same parameters that you have ran before will load previous responses and any unfinished requests from the cache, allowing it to resume if all the required responses have not yet been retrieved.

from pmaw import PushshiftAPI

api = PushshiftAPI()
posts = api.search_submissions(subreddit="science", limit=700000, until=1613234822, safe_exit=True)
print(f'{len(posts)} posts retrieved from Pushshift')

A until value is required to load previous responses / requests when using non-id search, as until is set to the current time when the search method is called, which would result in a different set of parameters then when you last ran the search despite all other parameters being the same.

Loading Cache with Key

The Response class has a staticmethod load_cache that allows you to pass a key value and cache_dir to load responses from the cache into an instance of Response.

from pmaw import Response

cache_key = 'a904e22ea572a8c0109b7bc5330528e4'
cache_dir = './cache'
resp = Response.load_cache(cache_key, cache_dir)

Benchmarks

Benchmark Notebook

PMAW and PSAW Comparison

Completion Time

A benchmark comparison was performed to determined the completion time for different size requests, ranging from 1 to 390,000 requested posts. This will allow us to determine which Pushshift wrappers and rate-limiting methods are best for different request sizes.

Default parameters were used for each PMAW rate-limit configuration as well as the default PSAW configuration, which does not provide multiple rate-limit implementations.

For the first benchmark test we compare the completion times for all possible PMAW rate-limiting configurations with PSAW for up to 16,000 requested posts. We can see that the three most performant rate-limiting settings for PMAW are rate-averaging, and exponential backoff with full or equal jitter.

We ran this second benchmark increasing up to 390,000 requested posts, excluding the least performant PMAW rate-limiting configurations. From this benchmark, we can see that PMAW was on average 1.79x faster than PSAW at 390,625 posts retrieved. The total completion time for 390,625 posts with PSAW was 2h38m, while the average completion time was 1h28m for PMAW.

Number of Requests

We also compare the number of required requests for each of the three PMAW rate-limit configurations. From this comparison, we can see that for 390,625 requested posts rate-averaging made 33.60% less API requests than exponential backoff.

Memory Safety (Cache)

A benchmark test was performed for the memory safety feature (mem_safe=True) to see the impact of caching responses has on the completion time, memory use, and max memory use while running requests for different limits.

We can see that when memory safety was enabled, the completion time for 390,000 posts was 17.11% slower than when this feature was disabled and responses were not being cached, finishing in 1h30m instead of 1h17m.

When memory safety is enabled responses start being cached after 20 checkpoints (default file_checkpoint=20), equivalent to approximately 20,000 responses, causing the memory use to level out around 170MB of memory. Enabling memory safety allows us to use 90.97% less memory is used than when it is disabled, with the non-cached responses using 1.9GB of memory when 390,000 posts were retrieved. It's clear to see that we could easily trigger an out of memory error if we were to retrieve millions of submissions with memory safety disabled.

We compare the maximum memory use during data retrieval as well. Once again, around the 20,000 response mark, the two methods diverge as responses begin to be added to the cache. For 390,000 posts, the maximum memory use when memory safety was enabled was 58.2% less than when it was disabled (797MB vs 1.9GB).

Deprecated Examples

These examples are for pmaw version <=0.1.3.

Comments

Search Comments

comments = api.search_comments(subreddit="science", limit=1000)

Search Comments by IDs

comment_ids = ['gjacwx5','gjad2l6','gjadatw','gjadc7w','gjadcwh',
  'gjadgd7','gjadlbc','gjadnoc','gjadog1','gjadphb']
comments_arr = api.search_comments(ids=comment_ids)

Search Comment IDs by Submission ID

post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
comment_id_dict = api.search_submission_comment_ids(ids=post_ids)

Submissions

Search Submissions

submissions = api.search_submissions(subreddit="science", limit=1000)

Search Submissions by IDs

post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
  'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
posts_arr = api.search_submissions(ids=post_ids)

License

PMAW is released under the MIT License. See the LICENSE file for more details.

pmaw's People

Contributors

Stargazers

Watchers

pmaw's Issues

fields parameter

Hi. I'm drawing blank datasets when I try to use the fields param to filter for specific fields (in order to reduce bandwidth).

This is what I want to do:
comments = api.search_comments(subreddit=subreddit, limit=100, before=before, after=after, sort_type="score", fields="body,score,author,created_utc")

This feature is supposed to be supported by Pushshift (Pushshift: Using the fields parameter). However, when I use the PMAW wrapper to do this, it returns me a blank object.

The PMAW wrapper appears to work fine for filtering exactly one field (e.g. fields="body"). Is it possible to filter for more fields with this wrapper?

Changed format in parent_id

Hello Matt, Thank you for your hardwork.
I just found out that the data format of the column parent_id changed from post id, such as t3_zyw8x9, to an int , like 41497293111.

Is it still the correct parent_id? If yes, do we have a way to convert it back to the traditional post_id format?

slower than psaw for me

Hello,

Thanks for v0.1.0!

I just tried it and for me it was slower than psaw. Below is a standalone that compares the completion times for both. Result: pmaw took 5m16s and psaw took 2m42s.

Am I doing something wrong?

from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI
import os
import time


pmaw_api = pmawAPI(num_workers=os.cpu_count()*5)
psaw_api = psawAPI()


start = time.time()
test = pmaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
end = time.time()
print("pmaw took " + str(end - start) + " seconds.")

start = time.time()
test_gen = psaw_api.search_submissions(after=1612372114,
                              before=1612501714,
                              subreddit='wallstreetbets',
                              filter=['title', 'link_flair_text', 'selftext', 'score'])
test = list(test_gen)
end = time.time()
print("psaw took " + str(end - start) + " seconds.")

signal only works in main thread

Hello, I'm currently developing a very simple Flask app, running only locally and I wanted to scrape some Reddit posts, using your API. I followed the example, as it's specified in the documentation, however whenever I run my script, I get the following error:

ValueError: signal only works in main thread

I read that Flask-SocketIO package causes this, but I saw that this project uses Websocket-client, which is a different package.

Would really appreciate your input.

'api.search_submission_comment_ids' function only returns one comment id for a post with multiple comments

Hi!

I have been trying pmaw to request pushshift reddit data. I am trying to get all the comment ids given a post/submission id using api.search_submission_comment_ids. An example of my code is:

list_id = ['6uey5x'] 
comment = api.search_submission_comment_ids(ids=list_id)

However, it only returns one comment id from this requested submission id: ['dls663o']. The request should have returned multiple comment ids according to the pushshift api: https://api.pushshift.io/reddit/submission/comment_ids/6uey5x

I also tried configuring the available options like memsafe, which didn't help. Interestingly, if I try to repeat the post id in the input, such as:

list_id = ['6uey5x', '6uey5x', '6uey5x', '6uey5x'] 
comment = api.search_submission_comment_ids(ids=list_id)

It gives more comment ids from the corresponding post: ['dls663o', 'dls67q6', 'dls6fgv', 'dls6fqj'].

Could you help look into it? I feel like this workaround might have not fully utilized the pipeline.

I really apprecaite your effort and this great tool!

Increase specificity of try catch blocks

There are several try catch blocks that can be improved by specifying the exceptions that are expected to occur. This will help avoid unexpected behaviour which can be caused by the wide scope of the current implementation.

Score parameters

Hello
When i try api.search_submissions(subreddit="science", limit=None, score>=1000
I get an error SyntaxError: positional argument follows keyword argument
Do you know how to fix it ?
Thanks

Confusion about multithreading

When I call :

from psaw import PushshiftAPI as psawAPI
from pmaw import PushshiftAPI as pmawAPI

api = psawAPI()
api_request_generator = list(api.search_submissions(...)

And

api = pmawAPI()
api_request_generator = list(api.search_submissions(...)

How it makes search_submissions mutithreaded? ain't both psaw and pmsaw making a single REST call to PushShift?

Can not get post_ids

I tried the examples on README, but I can not get any result. The code and results are as shown below. I used other post_ids and can not get comment_ids either.

from pmaw import PushshiftAPI
api = PushshiftAPI()
post_ids = ['kxi2w8','kxi2g1','kxhzrl','kxhyh6','kxhwh0',
'kxhv53','kxhm7b','kxhm3s','kxhg37','kxhak9']
comment_ids = api.search_submission_comment_ids(ids=post_ids)
comment_id_list = [c_id for c_id in comment_ids]
print(len(comment_id_list))

UserWarning: 10 items were not found in Pushshift
warnings.warn(f"{self.limit} items were not found in Pushshift")

Cannot pass https_proxy parameter to PushShift function

In psaw and in Pushshift we can pass a proxy for each request. I want to use pmaw but we use proxies heavily and hence cannot use pmaw as there is no way to pass proxies or Is there a way to do it?

PushshiftAPI(https_proxy=proxy)

Sort = "created_utc" didnt sort results properly

My code looks like this:

gen = list(api.search_submissions(q="fire",
sort="created_utc",
order="desc",
subreddit=subreddit,
filter=['id'],
limit=None))

for pmaw_submission in gen:
praw_submission = reddit.submission(id=pmaw_submission['id'])
print(datetime.utcfromtimestamp(praw_submission.created_utc))

Returns the following: https://pastebin.com/Y91fYYDy

You can see the years are not in order, otherwise, it kinda looks ordered, if we talking days/time only

Any idea why is this happening / how to fix this? Thanks

comment.parent()

With psaw hooked into praw I can get a comment's parent as an object by call comment.parent(). Is there a way to do this in pmaw? If not, would you consider adding it?

api call seems to return nothing

Hi, I copied the code below from the example page, it used to work but have stopped working recently, returning an empty list.

from pmaw import PushshiftAPI

api = PushshiftAPI()
comments = api.search_comments(subreddit="science", limit=1000)
comment_list = [comment for comment in comments]
comment_list

returns

[]

search_submissions also returns nothing....

Thank you so much.

Adding flair search

I know this is possible with praw by simply saying in search() like this:

reddit.search("flair:cats")

Although, I can't find a solution when using PMAW, since paramater "q" doesnt seem to recognize the "flair:" string.

The main reasoning between flair search is, that it return much more relevant posts. For example "flair:fire", is much better than "fire" etc.

Strategy for missed posts?

Hi Matt,

Thanks for pmaw - it's a very nice library you've created!

This is not a problem with pmaw, but with my understanding of how to use it.

I have executed a search_comments with the same before and after parameters in both psaw and pmaw. It was so much faster in psaw, I was amazed! But, success rate varied from 93% to 83%, so at the end I had 40,630 comments using pmaw compared to 40,762 using psaw.

What is the best strategy for retrieving the comments that were missed? Should I assemble a list of submission ids from a search_submissions with the same parameters, based on their having num_comments greater than in the retrieved comments (or not even in the comments), then use search_submission_comment_ids with them? Or, can I utilize safe_exit, and re-run the process to see if I can get more? Or, something else?

Perhaps I it would be best to use search_submission_comment_id from the get-go? I have found that searching by id with psaw much slower that just using a date range, and as I have a range I didn't bother with it in this case. Is it slower to use than search_comments?

Cheers
Chris

Add unit testing framework

Testing is currently performed through various Jupyter notebooks on my local machine which is quite inefficient and prone to error. This could be improved by slowly adding unit tests for various components to ensure they work as expected.

PullPush - PushShift replacement

Hello!

I created a replacement service for PushShift functionality that's now restricted. See https://pullpush.io/ for details. Overall it will aim to be compatible to Dec-2022 version of PushShift API.

In case you feel burned out by the whole 3rd party tool affair; I would like to ask for permission to fork and rebrand your project :)

PullPush Actual

only id supported?

Looks great!

From the readme, am I to understand that the search_submissions method only supports the ids parameter?

IndexError: list index out of range

Hi, there. I'm running into this error for the scraping for submissions. Could you let me know why and how can I get pass it?

subs=['depression','anxiety','suicidewatch']
for sub in subs::
    posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
    print(f'{len(posts)} posts retrieved from Pushshift')
    post_list = [post for post in posts]
    pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")

IndexError: list index out of range
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-61805bcbf485> in <module>
      1 for sub in subs:
----> 2     posts = api.search_submissions(subreddit=sub, mem_safe=True, after=1585713600, before=1637207011, safe_exit=True)
      3     print(f'{len(posts)} posts retrieved from Pushshift')
      4     post_list = [post for post in posts]
      5     pd.DataFrame(post_list).to_pickle(f"{sub}_submissions.pkl")

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPI.py in search_submissions(self, **kwargs)
     72             Response generator object
     73         """
---> 74         return self._search(kind='submission', **kwargs)

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _search(self, kind, max_ids_per_request, max_results_per_request, mem_safe, search_window, dataset, safe_exit, cache_dir, filter_fn, **kwargs)
    261             self.req.gen_url_payloads(
    262                 url, self.batch_size, search_window)
--> 263 
    264             # check for exit signals
    265             self.req.check_sigs()

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _multithread(self, check_total)
     98 
     99                 futures = {executor.submit(
--> 100                     self._get, url_pay[0], url_pay[1]): url_pay for url_pay in reqs}
    101 
    102                 self._futures_handler(futures, check_total)

~\AppData\Roaming\Python\Python37\site-packages\pmaw\PushshiftAPIBase.py in _futures_handler(self, futures, check_total)
    166                             num = 0
    167 
--> 168                         if num > 0:
    169                             # find minimum `created_utc` to set as the `before` parameter in next timeslices
    170                             if len(data) > 0:

IndexError: list index out of range

I'd recommend to use parquet partitions

it gives the smaller size and faster save\load time, while supported by the majority of data libraries

Returned 0 result

Using pmaw I got 0 submission returned today. It worked before. Don't know why. I exactly followed the medium post Matt wrote. I checked the status of pushshift server and it says it is fine. Any idea what has happened?

list index out of range

Hello,

After running the script for some time, I get list index out of range. Here is a screenshot of the error:

Do you know the cause of the error?

Thank you!

Why can't I query all comments in the last 7 days?

Why does this return 0 result(s) available in Pushshift

from pmaw import PushshiftAPI
from datetime import datetime, timedelta

api = PushshiftAPI()
seven_days_ago = (datetime.now() - timedelta(days=7)).timestamp()
comments = api.search_comments(after=seven_days_ago)

>>> 0 result(s) available in Pushshift

while if I directly query pushshift using the same parameters, I get results.

>>> seven_days_ago
>>> 1624497069.970137

https://api.pushshift.io/reddit/comment/search?after=1624497069

In addition: apparently, an integer value is needed, but only works if a subreddit or query term is given

>>> comments = api.search_comments(subreddit="science", after=seven_days_ago)
0 result(s) available in Pushshift

>>> comments = api.search_comments(subreddit="science", after=int(seven_days_ago))
15132 result(s) available in Pushshift

>>> comments = api.search_comments(q="science", after=int(seven_days_ago))
36019 result(s) available in Pushshift

ChunkedEncodingError while scraping subreddit submissions

Hi Matt, I'm trying to scrape subreddit posts within a time period of six months, with a limit set to none. After irregular periods of time however, the connection gets broken apparently. Following is the code snippet and the error. I tried restarting it multiple times, but the same issue comes up. Is there any way to ensure that all data is scraped in one go? Thanks

Code :

from pmaw import PushshiftAPI
api = PushshiftAPI()

import datetime as dt
before = int(dt.datetime(2020,9,1,0,0).timestamp())
after = int(dt.datetime(2020,3,1,0,0).timestamp())

subreddit="teenagers"
posts = api.search_submissions(subreddit=subreddit, limit=None, before=before, after=after,mem_safe=True)
print(f'Retrieved {len(posts)} posts from Pushshift')

Error :

ValueError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/urllib3/response.py in _update_chunk_length(self)
696 try:
--> 697 self.chunk_left = int(line, 16)
698 except ValueError:

ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

InvalidChunkLength Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/urllib3/response.py in _error_catcher(self)
437 try:
--> 438 yield
439

/opt/conda/lib/python3.7/site-packages/urllib3/response.py in read_chunked(self, amt, decode_content)
763 while True:
--> 764 self._update_chunk_length()
765 if self.chunk_left == 0:

/opt/conda/lib/python3.7/site-packages/urllib3/response.py in _update_chunk_length(self)
700 self.close()
--> 701 raise InvalidChunkLength(self, line)
702

InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)

During handling of the above exception, another exception occurred:

ProtocolError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/requests/models.py in generate()
757 try:
--> 758 for chunk in self.raw.stream(chunk_size, decode_content=True):
759 yield chunk

/opt/conda/lib/python3.7/site-packages/urllib3/response.py in stream(self, amt, decode_content)
571 if self.chunked and self.supports_chunked_reads():
--> 572 for line in self.read_chunked(amt, decode_content=decode_content):
573 yield line

/opt/conda/lib/python3.7/site-packages/urllib3/response.py in read_chunked(self, amt, decode_content)
792 if self._original_response:
--> 793 self._original_response.close()
794

/opt/conda/lib/python3.7/contextlib.py in exit(self, type, value, traceback)
129 try:
--> 130 self.gen.throw(type, value, traceback)
131 except StopIteration as exc:

/opt/conda/lib/python3.7/site-packages/urllib3/response.py in _error_catcher(self)
454 # This includes IncompleteRead.
--> 455 raise ProtocolError("Connection broken: %r" % e, e)
456

ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

During handling of the above exception, another exception occurred:

ChunkedEncodingError Traceback (most recent call last)
/tmp/ipykernel_19/3224342533.py in
1 subreddit="teenagers"
----> 2 posts = api.search_submissions(subreddit=subreddit, limit=None, before=before, after=after,mem_safe=True,safe_exit=True)
3 print(f'Retrieved {len(posts)} posts from Pushshift')

/opt/conda/lib/python3.7/site-packages/pmaw/PushshiftAPI.py in search_submissions(self, **kwargs)
72 Response generator object
73 """
---> 74 return self._search(kind='submission', **kwargs)

/opt/conda/lib/python3.7/site-packages/pmaw/PushshiftAPIBase.py in _search(self, kind, max_ids_per_request, max_results_per_request, mem_safe, search_window, dataset, safe_exit, cache_dir, filter_fn, **kwargs)
266
267 if self.req.limit > 0 and len(self.req.req_list) > 0:
--> 268 self._multithread()
269
270 self.req.save_cache()

/opt/conda/lib/python3.7/site-packages/pmaw/PushshiftAPIBase.py in _multithread(self, check_total)
100 self._get, url_pay[0], url_pay[1]): url_pay for url_pay in reqs}
101
--> 102 self._futures_handler(futures, check_total)
103
104 # reset attempts if no failures

/opt/conda/lib/python3.7/site-packages/pmaw/PushshiftAPIBase.py in _futures_handler(self, futures, check_total)
131 self.num_req += int(not check_total)
132 try:
--> 133 data = future.result()
134 self.num_suc += int(not check_total)
135 url = url_pay[0]

/opt/conda/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
426 raise CancelledError()
427 elif self._state == FINISHED:
--> 428 return self.__get_result()
429
430 self._condition.wait(timeout)

/opt/conda/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result

/opt/conda/lib/python3.7/concurrent/futures/thread.py in run(self)
55
56 try:
---> 57 result = self.fn(*self.args, **self.kwargs)
58 except BaseException as exc:
59 self.future.set_exception(exc)

/opt/conda/lib/python3.7/site-packages/pmaw/PushshiftAPIBase.py in _get(self, url, payload)
50 def _get(self, url, payload={}):
51 self._impose_rate_limit()
---> 52 r = requests.get(url, params=payload)
53 status = r.status_code
54 reason = r.reason

/opt/conda/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
73 """
74
---> 75 return request('get', url, params=params, **kwargs)
76
77

/opt/conda/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
59 # cases, and look like a memory leak in others.
60 with sessions.Session() as session:
---> 61 return session.request(method=method, url=url, **kwargs)
62
63

/opt/conda/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
540 }
541 send_kwargs.update(settings)
--> 542 resp = self.send(prep, **send_kwargs)
543
544 return resp

/opt/conda/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
695
696 if not stream:
--> 697 r.content
698
699 return r

/opt/conda/lib/python3.7/site-packages/requests/models.py in content(self)
834 self._content = None
835 else:
--> 836 self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
837
838 self._content_consumed = True

/opt/conda/lib/python3.7/site-packages/requests/models.py in generate()
759 yield chunk
760 except ProtocolError as e:
--> 761 raise ChunkedEncodingError(e)
762 except DecodeError as e:
763 raise ContentDecodingError(e)

ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

How to not enrich certain values using PRAW?

The fact that results can be enriched with current data is really nice, but is there a way to select what values to enrich or at least exclude certain ones? For example, I want the current amount of upvotes on a post at any given time, but I don't want the updated subscriber count as I'd like to be able to see changes in it over time. Is this already a feature, or am I just doing something wrong? Also please let me know if this should be asked somewhere else.

If pushshift fails to return data, can crash trying to warn about it

in gen_url_payloads

all_ids = self.payload['ids']
if len(all_ids) == 0 and self.limit > 0:
warnings.warn(
f'{self.limit} items were not found in Pushshift')
self.limit = len(all_ids)

TypeError: '>' not supported between instances of 'NoneType' and 'int'

So seemingly limit is None type. Or I've made a mistake somehow. But I think my code is working because this only crashes intermittently (meaning it is consistent with what I would expect if Pushshift was occassionally failing).

Comment and submission search snags

Hello, I've been noticing in the last couple of days that pmaw will frequently snag on batches of either submission or comment searches. It always seems to eventually pull the data, but extracting a small set of ~500 posts can take anywhere between 10 seconds and 5 minutes. Any thoughts on why?

I modified pmaw to print out the URL endpoints to see if a particular search is causing the problem, but clicking on the pushshift links shows the data in the browser incredibly fast.

I'm attaching a recent example of a pull of 500 posts with the query "test." It took a little over two minutes to pull the data. I also highlight the link with a red arrow that caused pmaw to snag for most of this request. Any thoughts on what might be causing this?

Thanks for maintaining pmaw -- it's such a great tool!

Length of response object can be incorrect sometimes

With mem_safe=True if you don't iterate through the generator fully, responses are loaded in and add to the total length, not counting how many examples have been iterated through.

Code to recreate:

comments = api.search_comments(subreddit="science", mem_safe=True, safe_exit=True, limit=2000)

len(comments) # 2000

i = 0
comment_list = []
for c in comments:
    comment_list.append(c)
    i += 1
    if i > 500:
        break
len(comments) # 3000 at this point
len(comment_list) # 501

comment_list_2 = []
for c in comments:
    comment_list_2.append(c)

len(comment_list_2) # 1499
len(comments) # 2000

Successful Rate 0% when batch loading comments by ids

I found a strange bug, basically when calling search comments, large max_ids_per_request will result in 0% successful rate and forever retry. Maybe the result is too long?

To reproduce:

comments = api.search_submission_comment_ids(ids="haucpf")    
comment_list1 = [comment for comment in comments]    
comments = api.search_comments(ids=comment_list1, max_ids_per_request=1000)    
comment_list_full = [comment for comment in comments]

The above code works if we change to max_ids_per_request=500.

Successful Rate 0% no matter what

Hi,

I found a strange issue, similar to #23. Basically, now matter what I do I get 0% Successful Rate when trying to fetch comments by their ids. I tried to pass the ids as list and as a joint string with ids delimited by ,.

Below is the reproducible example with sample of ids that I want to fetch:

import time
from pmaw import PushshiftAPI

api = PushshiftAPI()

if __name__ == "__main__":
    # list of comments
    comment_ids = ['czyja5h', 'czyjfsw', 'czyjhfs', 'czyjhqz', 'czyjufm', 'czyk71o', 'czykgjk', 'czykuef', 'czyl1me', 'czylher']

    # fetch the body of comments
    t = time.time()

    comments = api.search_comments(ids=comment_ids)
    comment_list = [comment for comment in comments]

    print(f"Fetching {len(comment_list)} comments complete, it took {time.time() - t:.3f}s")
    print('...')

Usage restrictions

Do I have to be a Reddit moderator to use pmaw?
Do I have to sign up with PushShift to use pmaw?

new to github/programming (so might be a dumb question) Module is pulling multiple duplicates for each Submission

Again, not sure how to ask a question....but I have been using the 'search_submissions' class and it is pulling multiple duplicates for each submission. Is there something I am doing wrong?

Unable to fetch comments by ID

#comment_id is list of string
comments_arr = pushshift_api.search_comments(ids = comment_ids)

output Error: INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 87.50% - Requests: 16 - Batches: 4 - Items Remaining: 6720
/usr/local/lib/python3.7/dist-packages/pmaw/Request.py:230: UserWarning: 6720 items were not found in Pushshift
f'{self.limit} items were not found in Pushshift')

Question: Is it true that pushshift has a delay in fetching data? Can we fetch data for current date ?

Incorrect Number for Items Remaining when Recovering Cache

I've posted this issue on reddit; transfer it here as a formal issue report.

My initial requests:
posts = api.search_comments(subreddit='de', limit=None, mem_safe=True, safe_exit=True)

The metadata I obtained from this initial response.
Here is the timestamp I have:

Setting before to 1631651549

Response cache key: 186e2bb94155846df2c5d321b768b8cb

10380641 result(s) available in Pushshift

The last file checkpoint before this scraping was interrupted was:

File Checkpoint 167:: Caching 15802 Responses

Checkpoint:: Success Rate: 75.43% - Requests: 33400 - Batches: 3340 - Items Remaining: 8009761

Checkpoint:: Success Rate: 75.48% - Requests: 33500 - Batches: 3350 - Items Remaining: 8001173

Now I want to resume the process, so I plugged in the timestamp I got from the initial requests as the "before" value.

api = pmaw.PushshiftAPI()
posts = api.search_comments(subreddit="de", limit=None, mem_safe=True, safe_exit=True, before=1631651549)

However, it seems to start a new process of scraping the same query; the item remaining is what's left from the previously interrupted requests.

Loaded Cache:: Responses: 2399576 - Pending Requests: 140 - Items Remaining: 10367680

Not all PushShift shards are active. Query results may be incomplete.

parquet for caching responses

Use parquet instead of pickle for caching responses.

Need to assess if this is a reasonable improvement:

Benchmark parquet cache speed for storing and retrieval against pickle
Compare resulting cache sizes

Queries that specify `before` and `after` can return a different number of results than reported as available by Pushshift

Test Query:

comments = api.search_comments(
                    after=1606262347,
                    before=1618581599,         
                    subreddit="CovidVaccinated",
                    fields=["id","subreddit","link_id","parent_id","is_submitter","author",
                                "author_fullname","body","score","created_utc","permalink"],
                    limit=None
                    )

Results:

40730 result(s) available in Pushshift
Checkpoint:: Success Rate: 71.00% - Requests: 100 - Batches: 10 - Items Remaining: 33898
Checkpoint:: Success Rate: 79.00% - Requests: 200 - Batches: 20 - Items Remaining: 25661
Checkpoint:: Success Rate: 81.67% - Requests: 300 - Batches: 30 - Items Remaining: 18163
Checkpoint:: Success Rate: 81.75% - Requests: 400 - Batches: 40 - Items Remaining: 11467
Checkpoint:: Success Rate: 82.80% - Requests: 500 - Batches: 50 - Items Remaining: 4262
Checkpoint:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 60 - Items Remaining: 1
1 result(s) not found in Pushshift

Discovered in #12

Issue with limit?

Hi Matthew, thank you so much for your great work on PMAW!

I tried to use your example with a limit = 100000. It seems 0 comments will be retrieved if the limit is greater than 1000.

import datetime as dt
before = int(dt.datetime(2021,2,1,0,0).timestamp())
after = int(dt.datetime(2020,12,1,0,0).timestamp())

subreddit="wallstreetbets"
limit=100000
comments = api.search_comments(subreddit=subreddit, limit=limit, before=before, after=after)
print(f'Retrieved {len(comments)} comments from Pushshift')

The log:

WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
Retrieved 0 comments from Pushshift

I have tried with limit = 100, 1000, 1001. It seems 0 comments will be retrieved if the limit is greater than 1000.

Can you please let me know if I missed anything? Thanks!

Is there a way to turn off log response?

Hi, is there a way to turn off the log response?

I0317 17:36:13.663935 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2852
I0317 17:36:14.613399 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2771
I0317 17:36:16.100443 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2700
I0317 17:36:16.821416 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2626
I0317 17:36:17.709720 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2555
I0317 17:36:19.522857 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2495
I0317 17:36:20.668120 139656604510016 PushshiftAPIBase.py:141] Remaining limit 2439

Fetching comments never completes

Hi there!

I have a very simple PMAW script:

import sys
from pmaw import PushshiftAPI
import json
import os

api = PushshiftAPI(num_workers=os.cpu_count(), rate_limit=100)

id_f = sys.argv[1]
comment_ids = [l.strip() for l in open(id_f, 'r')]
comments_arr = api.search_comments(ids=comment_ids)
for c in comments_arr:
    print(json.dumps(c))

The script takes in IDs from a file, puts them into a list, and passes them into search_comments.

This script has worked fine once (en masse for lots of IDs), but for some reason, it now never completes on even a test set of 10 IDs. I'm certain I must be doing something silly, or the API has potentially changed. Could someone point me in the right direction?

Thanks!

always 100 unique ids despite the size of returned comments

Hi! I am getting comments from the subreddit using before and after dates, but I found out that the number of unique items per day is always 100. The number of total result varies and seems right, but there are a lot of duplicates. The unique items are always 100 which is also the limit from reddit API, so I wonder if there's any connection here. Do I need to specify something in the query additionally? I tried adding size or limit but didn't seem to solve this problem (other than returning zero result when the limit is too big as others pointed out) Below is how I am sending the query now:

from pmaw import PushshiftAPI
api = PushshiftAPI()
api_request_generator = list(api.search_comments(subreddit='The_Donald',
                                                            before=calendar.timegm(until_date.timetuple()),
                                                            after=calendar.timegm(since_date.timetuple()),
                                                            safe_exit=True,
                                                            size=500,
                                                            mem_safe=True,
                                                            until=calendar.timegm(until_date.timetuple())
                                                         )

How to skip a request if taking too long?

Sometimes a request within a loop of requests ends up going through hundreds (thousands?) of batches trying to locate 1 single comment. I'd rather just skip that comment and move on to the next request. Any way to implement this in the code?

example output:
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 19 - Batches: 10 - Items Remaining: 1
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 29 - Batches: 20 - Items Remaining: 1
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 39 - Batches: 30 - Items Remaining: 1
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 49 - Batches: 40 - Items Remaining: 1
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 59 - Batches: 50 - Items Remaining: 1
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 69 - Batches: 60 - Items Remaining: 1
etc
etc
etc
etc

Get comments given list of submission ids

I know you can get the list of comment IDs with comment_id_dict = api.search_submission_comment_ids(ids=post_ids) and you can get a list of actual comments from a single submission ID with comments = list(api.search_comments(subreddit, link_id=post_id)).

Any way to get a list of actual comments from a list of submission IDs?

I know I can probably for loop through the submission IDs and call api.search_comments(subreddit, link_id=post_id) on each of them, but I'm assuming that pmaw's multithreading magic wouldn't be used to the bests of its abilities if I did it this way.

Thanks! :)

Help encoding a search query for multiple keywords

Hi there, I would like to use your wonderful wrapper PMAW to conduct the following data search for all comments with the following keywords "don't hate" AND "Thomas" (i.e., any comments that contain "don't hate" and "Thomas" somewhere).

The call to the pushshift api directly works fine:
https://api.pushshift.io/reddit/comment/search?q=%22don%27t%20hate%22+Thomas

However, I cannot figure out how to translate this q="..." search query into a format that works in PMAW (or PSAW for that matter)...

For instance, the following returns 0:
comments = api.search_comments(q="%22don%27t%20hate%22+Thomas", limit=limit, before=before, after=after)

Been googling all day without any success.

Issues with search_comments and search_submission_comment_ids

Ive been working on a project where I am gathering large amounts of comment data and I have ran into two issues with each of these functions. The first issue I ran into when using the search_submission_comment_ids function is that when searching post comment ids, there returns zero comments for any post following November 26th 2021, as well as periodically prior to this date (although I haven't done extensive testing for prior).

Following the discovery of this issue, I attempted to remedy it by checking if the use of the prior function resulted in comment data being available, and if not, then switching to the use of the search_comments function. While this did work and I was able to find comment data following the November 26th 2021 date, every API request made using the search_comments function gave a warning of "Not all Pushshift shards are active. Query Results may be incomplete.". Upon investigation using the api.metadata_.get('Shards') command, I was getting results such as:

{'failed': 0, 'skipped': 0, 'successful': 67, 'total': 74}

If anybody has any idea for why either of these issues is occurring, or why the shard metadata shows the missing shards as neither failed nor skipped and is willing to share id greatly appreciate it.

Am I able to ask what is the maximum number of comments can I get from a post

Slow running time for scrapping all subs for an author

Hi. I'm trying to use pmaw to iterate a list of usernames and scrape all their submissions and comments.

It turns out to be very slow.

Using pmaw:

gen = api2.search_submissions(author='seeellayewhy').  
list(gen)

It took 7s to scrape 300+ submission

Using pmaw, it took several minutes to finish it:

from pmaw import PushshiftAPI

api = PushshiftAPI(num_workers=40, )

submisions = api.search_submissions(author='seeellayewhy', limit=None)
submission_list = [sub for sub in submisions]

WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:325 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 40 - Batches: 1 - Items Remaining: 325
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 80 - Batches: 2 - Items Remaining: 323
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 120 - Batches: 3 - Items Remaining: 318
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 161 - Batches: 5 - Items Remaining: 300
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:1 result(s) not found in Pushshift
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 201 - Batches: 6 - Items Remaining: 271
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 241 - Batches: 7 - Items Remaining: 267
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 281 - Batches: 8 - Items Remaining: 252
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 321 - Batches: 9 - Items Remaining: 238
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 361 - Batches: 10 - Items Remaining: 221
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 361 - Batches: 10 - Items Remaining: 221
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
WARNING:pmaw.PushshiftAPIBase:Not all PushShift shards are active. Query results may be incomplete.
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 400 - Batches: 11 - Items Remaining: 0

Is there something I'm not doing it right, or it's not recommended to do this kind of task with pmaw? From the logging, it seems like the each requests are requesting only several records? Does it has to do with search windows? If so, how to configure for this kind of tasks.

Subreddit restriction is not an exact name match and includes subreddits with a superset of that name

Hi all! I was trying to restrict to a specific subreddit "web10" and I noticed that content from other subreddits with a name that includes the text "web10" is also coming up! I assume this is not the intended behavior because it could cause a lot of unrelated results to appear, especially if the subreddit name has a common word like "science" which is used by many subreddits besides r/science.

In this example (https://api.pushshift.io/reddit/search/submission?subreddit=web10), content from r/web10, r/u_Psychological-Web10, and r/u_ronaldo-web10 appears. r/u_Psychological-Web10 is a subreddit, and r/u_ronaldo-web10 is handled differently by Reddit (Reddit displays a page indicating the user has been banned, rather than a page indicating the subreddit doesn't exist) so perhaps it was previously a subreddit.

How do I sort by date?

Can I pass a parameter to the wrapper to responses by date created? Not clear in documentaiton

Random number of Enhanced Items

With 1873 results available in Pushshift i get a variable enrichment between 200 and 30 items.
Does that mean that the other items were as same as Pushshift or that they were not checked?

Code below

#Imports
import praw
from pmaw import PushshiftAPI
from datetime import datetime
from datetime import timezone
import pandas as pd
reddit = praw.Reddit(
    client_id="CLIENT_ID",
    client_secret="CLIENT_SECRET",
    password="PASSWORD",
    user_agent="USER AGENT",
    username="USERNAME")
api = PushshiftAPI(num_workers=20,praw=reddit)

#Set Query Parameters
datewindow_start=datetime(2021,1,1,0,0,0,tzinfo=timezone.utc)
datewindow_end=datetime(2021,1,1,0,0,0,tzinfo=timezone.utc)
query="bitcoin"

#Run Query (running only on 01/01/21 as a test)
all_dates=pd.date_range(datewindow_start,datewindow_end,freq="d")
for date in all_dates:
    start_date=int(date.timestamp()-1)
    end_date=int(date.timestamp()+86400)
    posts=api.search_submissions(q=query,after=start_date,before=end_date)

#Dataframing
comments_df = pd.DataFrame(posts)
df_2 = comments_df[["author","subreddit","num_comments","upvote_ratio","score","title","likes","id","url"]]
df_2.head(30)

mattpodolak / pmaw Goto Github PK

pmaw's Introduction

PMAW: Pushshift Multithread API Wrapper

Contents

Description

Getting Started

Why Multithread?

Installation

General Usage

Features

Multithreading

Rate Limiting

Rate-Averaging

Exponential Backoff

Caching

Memory Safety

Safe Exiting

PRAW Enrichment

Custom Filtering

Unsupported Parameters

Feature Requests

Parameters

Objects

PushshiftAPI

Response

search_submissions and search_comments

Keyword Arguments

search_submission_comment_ids

Keyword Arguments

Examples

Comments

Search Comments

Search Comments by IDs

Search Comment IDs by Submission ID

Submissions

Search Submissions

Search Submissions by IDs

Advanced Examples

PRAW

Custom Filter

Caching Examples

Memory Safety

Safe Exiting

Loading Cache with Key

Benchmarks

PMAW and PSAW Comparison

Completion Time

Number of Requests

Memory Safety (Cache)

Deprecated Examples

Comments

Search Comments

Search Comments by IDs

Search Comment IDs by Submission ID

Submissions

Search Submissions

Search Submissions by IDs

License

pmaw's People

Contributors

Stargazers

Watchers

Forkers

pmaw's Issues

Recommend Projects

Recommend Topics

Recommend Org

`PushshiftAPI`

`Response`

`search_submissions` and `search_comments`

`search_submission_comment_ids`