Giter Site home page Giter Site logo

Comments (17)

Dominyk4s avatar Dominyk4s commented on June 14, 2024 2

Same here;
Pushshift API is working, just pmaw wrapper does not. An example with results from pushshift without any wrapper (and empty results from pmaw):

import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt
import requests
import time

start_date = dt.date(2022, 12, 1)
end_date = dt.date(2022, 12, 15)

start_date = dt.datetime.fromordinal(start_date.toordinal())
end_date = dt.datetime.fromordinal(end_date.toordinal())

api = PushshiftAPI()

start_epoch = int(start_date.timestamp())
end_epoch = int(end_date.timestamp())

submissions = api.search_submissions(subreddit='politics', q='biden', after=start_epoch,
                                     before=end_epoch, num_workers=20)

sub_df = pd.DataFrame(submissions)
print('---------------------------------------------------------------')
print(f'pmaw df size: {sub_df.shape}')
print(sub_df.head())

time.sleep(10)
# Pushsift api directily
api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
            + '&after=' + str(start_epoch) \
            + '&before=' + str(end_epoch) \
            + '&subreddit=' + 'politics' \
            + '&limit=' + str(100)

r = requests.get(api_query)
json = r.json()
df_pushshift = pd.DataFrame(json['data'])

print('---------------------------------------------------------------')
print(f'Pushshift direct df size: {df_pushshift.shape}')
print(df_pushshift.head())
print('---------------------------------------------------------------')

Results:

---------------------------------------------------------------
pmaw df size: (0, 0)
Empty DataFrame
Columns: []
Index: []
---------------------------------------------------------------
Pushshift direct df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------

from pmaw.

qosmio avatar qosmio commented on June 14, 2024 2

@Dominyk4s

> # Pushsift api directily
> api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
>             + '&after=' + str(start_epoch) \
>             + '&before=' + str(end_epoch) \
>             + '&subreddit=' + 'politics' \
>             + '&limit=' + str(100)
> 
> 

Just a heads up, before and after are deprecated. Not sure how long that will work.

after | string (After) Search after this epoch time (inclusive) -- deprecated (use since)

before | string (Before) Search before this epoch time (exclusive) -- deprecated (use until)

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024 2

closing this as the COLO switch over fixes have been merged + released in version 3.0.0!

from pmaw.

eddvrs avatar eddvrs commented on June 14, 2024 1

Before/after have changed names to until/since. Some of the param nmes fr sorting have changed also.

I've addressed the before/after => since/until change, as well as the sort/sort_type, and some other changes in my PR for PMAW, because at the moment it's giving 0 results!

from pmaw.

jaspark-ea avatar jaspark-ea commented on June 14, 2024

I am seeing the same thing. Nothing returned.

from pmaw.

MatchaOnMuffins avatar MatchaOnMuffins commented on June 14, 2024

Same here, nothing is being returned.
However, the Pushshift API itself is working

from pmaw.

Sellitus avatar Sellitus commented on June 14, 2024

Same here, returning nothing despite the parameters being used

from pmaw.

Dominyk4s avatar Dominyk4s commented on June 14, 2024

Just a quick temp fix:

I've added some changes to the code (see it here), using it two new parameters exist for calling Pushshift: sort_var='order' and check_totals=False.

Now it works with recent data only (older data cannot be queried even using Pushshift directly either yet).

Code example (with sort_var='order' and check_totals=False on calling it):

import pandas as pd
from pmaw import PushshiftAPI
import datetime as dt
import requests
import time


start_date = dt.date(2022, 12, 1)
end_date = dt.date(2022, 12, 15)

start_date = dt.datetime.fromordinal(start_date.toordinal())
end_date = dt.datetime.fromordinal(end_date.toordinal())

api = PushshiftAPI()

start_epoch = int(start_date.timestamp())
end_epoch = int(end_date.timestamp())

submissions = api.search_submissions(subreddit='politics', q='biden', after=start_epoch,
                                     before=end_epoch, limit=100, sort_var='order', check_totals=False, praw=True,
                                     num_workers=1)

sub_df = pd.DataFrame(submissions)
print('---------------------------------------------------------------')
print(f'pmaw df size: {sub_df.shape}')
print(sub_df.head())


time.sleep(10)
# Pushsift api directily
api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
            + '&after=' + str(start_epoch) \
            + '&before=' + str(end_epoch) \
            + '&subreddit=' + 'politics' \
            + '&limit=' + str(100)

r = requests.get(api_query)
json = r.json()
df_pushshift = pd.DataFrame(json['data'])

print('---------------------------------------------------------------')
print(f'Pushshift direct df size: {df_pushshift.shape}')
print(df_pushshift.head())
print('---------------------------------------------------------------')

Results:

---------------------------------------------------------------
pmaw df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------
Pushshift direct df size: (100, 89)
  subreddit selftext  ... updated_utc     utc_datetime_str
0  politics           ...  1671053635  2022-12-14 21:33:38
1  politics           ...  1671053019  2022-12-14 21:23:28
2  politics           ...  1671048929  2022-12-14 20:15:13
3  politics           ...  1671048448  2022-12-14 20:07:17
4  politics           ...  1671038417  2022-12-14 17:20:01

[5 rows x 89 columns]
---------------------------------------------------------------

Process finished with exit code 0

from pmaw.

Security-Chief-Odo avatar Security-Chief-Odo commented on June 14, 2024

Thanks for showing those changes. For some reason, implementing those makes PMAW very slow for responses. What used to be done in ~ 30 seconds, after these changes is taking 5+ minutes.

Start 2022-12-18 15:33:13

resPosts = api.search_submissions(since=start_epoch, subreddit=<sub>, author=user, limit=10, check_totals=False)
resComments = api.search_comments(since=start_epoch, subreddit=<sub>, author=user, limit=25, check_totals=False)

End 2022-12-18 15:38:45

from pmaw.

Sellitus avatar Sellitus commented on June 14, 2024

@Dominyk4s

> # Pushsift api directily
> api_query = 'https://api.pushshift.io/reddit/submission/search/?q=' + 'biden' \
>             + '&after=' + str(start_epoch) \
>             + '&before=' + str(end_epoch) \
>             + '&subreddit=' + 'politics' \
>             + '&limit=' + str(100)
> 
> 

Just a heads up, before and after are deprecated. Not sure how long that will work.

after | string (After) Search after this epoch time (inclusive) -- deprecated (use since)

before | string (Before) Search before this epoch time (exclusive) -- deprecated (use until)

Wow, no one is going to use the push shift API after those go away lol

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

this will be fixed in v3.0.0 after #52 is merged + released

from pmaw.

Arobnett avatar Arobnett commented on June 14, 2024

There's no working alternative version in the meantime?

from pmaw.

YS-SHI-93 avatar YS-SHI-93 commented on June 14, 2024

I also encountered this issue.

Using "api.pushshift.io/reddit/submission/search/" directly is useful but it seems only work for very limited period of time (i.e., one month or so).

If I want to retrieve something earlier than a month, say, something in 12 months ago, calling this link (see below) in browser will only generate blank list:

link: https://api.pushshift.io/reddit/submission/search?q=&after=1633010400&before=1640236250&subreddit=science&limit=999

Specific return:
{"data":[],"error":null,"metadata":{"es":{"took":8,"timed_out":false,"_shards":{"total":4,"successful":4,"skipped":3,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null}},"es_query":{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}}"}}

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

There's no working alternative version in the meantime?

Not right now. I'm hesitant to release a version that I havent fully tested, however, if everything goes well I will be able to release today 🙏🏾

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

I also encountered this issue.

Using "api.pushshift.io/reddit/submission/search/" directly is useful but it seems only work for very limited period of time (i.e., one month or so).

If I want to retrieve something earlier than a month, say, something in 12 months ago, calling this link (see below) in browser will only generate blank list:

link: https://api.pushshift.io/reddit/submission/search?q=&after=1633010400&before=1640236250&subreddit=science&limit=999

Specific return: {"data":[],"error":null,"metadata":{"es":{"took":8,"timed_out":false,"_shards":{"total":4,"successful":4,"skipped":3,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null}},"es_query":{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}},"es_query2":"{"size":999,"query":{"bool":{"must":[{"bool":{"must":[{"range":{"created_utc":{"gte":1633010400000}}},{"range":{"created_utc":{"lt":1640236250000}}}]}},{"bool":{"should":[{"match":{"subreddit":"science"}}],"minimum_should_match":1}}]}},"aggs":{},"sort":{"created_utc":"desc"}}"}}

theres been some parameter changes: until is the new before and since is the new after

looking at the COLO switchover bug thread, it looks like some other people have had the 1 month of data issue

from pmaw.

eddvrs avatar eddvrs commented on June 14, 2024

Yes, they've not loaded in the old historical data yet. I think it's due soon, but I've not been following too closely last couple of days. I'd be happy to revisit my changes once more testing is possible.

from pmaw.

RadoslavL avatar RadoslavL commented on June 14, 2024

I am still getting the same issue in version 3.0.0 with this code:

api = PushshiftAPI(praw=reddit)
posts = api.search_submissions(subreddit="science", limit=10000)
post_list = [p for p in posts]
print(len(post_list))

The print call returns 0.

Edit: Scratch that, it's a completely different problem. The API is locked for unregistered users.

from pmaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.