Comments (10)
Hi @ChrisPalmerNZ, the success rate metric which is printed represents how many requests are rejected due to rate-limiting from Pushshift, any failed request is retried automatically.
That's interesting that there was a different number of comments. Can you share the query that you ran? Were there any shards down while you ran the query?
The safe_exit
feature is intended for long-running queries which may be interrupted during execution, as it allows you to resume using cached requests and responses. I may be able to recommend a strategy for missed posts based on the parameters you are using.
from pmaw.
Hi @mattpodolak
Thanks for replying, sorry its taken me a while to get back to you.
I used the same subreddit, start, end and parameters, and query for both libraries, they were subreddit='CovidVaccinated', before=1618581599, and after=1606262347.
And the query (<api>
signifies that I used either psaw or pmaw):
<api>.search_comments(
after=after,
before=before,
subreddit=subreddit,
fields=["id","subreddit","link_id","parent_id","is_submitter","author",
"author_fullname","body","score","created_utc","permalink"],
limit=None
)
I didn't check shards, should I execute an api.metadata_.get('shards') to check them?
I got this from psaw - but eventually I got 40,762 comments:
D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 502
warnings.warn("Got non 200 code %s" % response.status_code)
D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 522
warnings.warn("Got non 200 code %s" % response.status_code)
I got this from pmaw, and got 40,630 comments:
40730 results available in Pushshift
Checkpoint:: Success Rate: 93.00% - Requests: 100 - Batches: 10 - Items Remaining: 31889
Checkpoint:: Success Rate: 90.50% - Requests: 200 - Batches: 20 - Items Remaining: 23382
Checkpoint:: Success Rate: 88.00% - Requests: 300 - Batches: 30 - Items Remaining: 16397
Checkpoint:: Success Rate: 84.50% - Requests: 400 - Batches: 40 - Items Remaining: 10293
Checkpoint:: Success Rate: 83.20% - Requests: 500 - Batches: 50 - Items Remaining: 4528
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0
from pmaw.
I'm currently working on troubleshooting what happened to those 100 missing comments.
The number of comments returned by psaw
is likely incorrect, I ran your query directly against Pushshift, the metadata indicates that there are only 40730 total results available.
"metadata": {
"after": 1606262347,
"agg_size": 100,
"api_version": "3.0",
"before": 1618581599,
"es_query": {
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"terms": {
"subreddit": [
"covidvaccinated"
]
}
},
{
"range": {
"created_utc": {
"gt": 1606262347
}
}
},
{
"range": {
"created_utc": {
"lt": 1618581599
}
}
}
],
"should": []
}
},
"must_not": []
}
},
"size": 25,
"sort": {
"created_utc": "asc"
}
},
"execution_time_milliseconds": 38.79,
"index": "rc_delta3",
"metadata": "true",
"ranges": [
{
"range": {
"created_utc": {
"gt": 1606262347
}
}
},
{
"range": {
"created_utc": {
"lt": 1618581599
}
}
}
],
"results_returned": 25,
"shards": {
"failed": 0,
"skipped": 0,
"successful": 4,
"total": 4
},
"size": 25,
"sort": "asc",
"sort_type": "created_utc",
"subreddit": [
"CovidVaccinated"
],
"timed_out": false,
"total_results": 40730
}
from pmaw.
Additional update, I ran your query with both psaw
and pmaw
.
psaw
returned 40726 comments with unique ids, while pmaw
returned 40729 comments with unique ids.
I am currently investigating why there was 1 comment missed, I'll release an update sometime this week once I discover the root cause.
from pmaw.
Thanks for doing this Matthew. I ran the pmaw query a day after the psaw one, and I noticed at that time that it said that fewer (40,730) posts were available than what psaw returned. I measured the number of posts from both libraries by the length of the data, rather than any reporting by the library. I am not currently in front of my PC, but I have saved the data so when I get home tonight I will look at it to see if there were any duplicates returned that might explain the higher psaw number.
from pmaw.
Hi Matthew
I looked at my data, and realize that I transposed the last 2 digits of the psaw data - it was 40,726 rather than 40,762. Which agrees with what you reported for psaw. However, my data from pmaw was less at 40,630. Perhaps I was unlucky and there were shards down - can you advise me how I should have tested for that? I am happy to email the IDs to you if that helps your inquiry.
from pmaw.
Usually, if shards are down a warning should be printed in both pmaw
and psaw
.
I'm not too sure why there were 100 missing results as I was unable to re-create this, so it could be data inconsistency with Pushshift. I have in the past partially lost pmaw
results when exploring the data directly using the generator before storing in a CSV.
I would refer to the number of items available reported by pmaw
: "40730 results available in Pushshift," as a baseline for the number that you should expect to be returned for a query, and you can re-run accordingly.
Based on the logs you provided, it appears that pmaw
re-ran the query after 1 result was not found, this has been fixed for the next version which will be released.
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1 # finished the query
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99 # re-tries the query to get the missing item
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0
from pmaw.
Two problems discovered thanks to this issue:
- Duplicate results can be included if a query with
before
andafter
misses result(s) - fixed inv1.0.5
- Missing results can occur for queries that specify
before
andafter
- will be tracked in #13
from pmaw.
Thanks for all of that Matt - I'm glad, and very impressed, that my issue resulted in your devoted attention, and that it led to an improvement - its a great product! BTW, last night I re-ran the query and got all 40,730 results. And, I am familiar with how generators work, I unpacked it straight to CSV, so that wasn't the issue here...
from pmaw.
No problem, thanks for reporting the issue. It's worth noting that the 40,730 results that pmaw
returned to you likely has a single duplicate.
I'm still working on figuring out the root cause, but v1.0.5
will not add duplicate results.
from pmaw.
Related Issues (20)
- ChunkedEncodingError while scraping subreddit submissions HOT 10
- Unable to fetch comments by ID HOT 3
- Fetching comments never completes HOT 2
- Issues with search_comments and search_submission_comment_ids HOT 3
- Cannot pass https_proxy parameter to PushShift function
- Confusion about multithreading HOT 1
- signal only works in main thread HOT 5
- api call seems to return nothing HOT 17
- Returned 0 result HOT 19
- Changed format in parent_id HOT 2
- Comment and submission search snags
- Issue with limit? HOT 7
- always 100 unique ids despite the size of returned comments HOT 2
- Subreddit restriction is not an exact name match and includes subreddits with a superset of that name HOT 3
- Adding flair search
- Sort = "created_utc" didnt sort results properly HOT 1
- Can not get post_ids
- Am I able to ask what is the maximum number of comments can I get from a post
- PullPush - PushShift replacement
- Usage restrictions
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pmaw.