Giter Site home page Giter Site logo

Strategy for missed posts? about pmaw HOT 10 CLOSED

mattpodolak avatar mattpodolak commented on June 14, 2024
Strategy for missed posts?

from pmaw.

Comments (10)

mattpodolak avatar mattpodolak commented on June 14, 2024

Hi @ChrisPalmerNZ, the success rate metric which is printed represents how many requests are rejected due to rate-limiting from Pushshift, any failed request is retried automatically.

That's interesting that there was a different number of comments. Can you share the query that you ran? Were there any shards down while you ran the query?

The safe_exit feature is intended for long-running queries which may be interrupted during execution, as it allows you to resume using cached requests and responses. I may be able to recommend a strategy for missed posts based on the parameters you are using.

from pmaw.

ChrisPalmerNZ avatar ChrisPalmerNZ commented on June 14, 2024

Hi @mattpodolak

Thanks for replying, sorry its taken me a while to get back to you.

I used the same subreddit, start, end and parameters, and query for both libraries, they were subreddit='CovidVaccinated', before=1618581599, and after=1606262347.

And the query (<api> signifies that I used either psaw or pmaw):

   <api>.search_comments(
                    after=after,
                    before=before,         
                    subreddit=subreddit,
                    fields=["id","subreddit","link_id","parent_id","is_submitter","author",
                                "author_fullname","body","score","created_utc","permalink"],
                    limit=None
                    )

I didn't check shards, should I execute an api.metadata_.get('shards') to check them?

I got this from psaw - but eventually I got 40,762 comments:

D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 502
  warnings.warn("Got non 200 code %s" % response.status_code)
D:\anaconda3\envs\pytorch_1_4\lib\site-packages\psaw\PushshiftAPI.py:192: UserWarning: Got non 200 code 522
  warnings.warn("Got non 200 code %s" % response.status_code)

I got this from pmaw, and got 40,630 comments:

40730 results available in Pushshift
Checkpoint:: Success Rate: 93.00% - Requests: 100 - Batches: 10 - Items Remaining: 31889
Checkpoint:: Success Rate: 90.50% - Requests: 200 - Batches: 20 - Items Remaining: 23382
Checkpoint:: Success Rate: 88.00% - Requests: 300 - Batches: 30 - Items Remaining: 16397
Checkpoint:: Success Rate: 84.50% - Requests: 400 - Batches: 40 - Items Remaining: 10293
Checkpoint:: Success Rate: 83.20% - Requests: 500 - Batches: 50 - Items Remaining: 4528
Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

I'm currently working on troubleshooting what happened to those 100 missing comments.

The number of comments returned by psaw is likely incorrect, I ran your query directly against Pushshift, the metadata indicates that there are only 40730 total results available.

https://api.pushshift.io/reddit/search/comment/?after=1606262347&before=1618581599&subreddit=CovidVaccinated&metadata=true

"metadata": {
        "after": 1606262347,
        "agg_size": 100,
        "api_version": "3.0",
        "before": 1618581599,
        "es_query": {
            "query": {
                "bool": {
                    "filter": {
                        "bool": {
                            "must": [
                                {
                                    "terms": {
                                        "subreddit": [
                                            "covidvaccinated"
                                        ]
                                    }
                                },
                                {
                                    "range": {
                                        "created_utc": {
                                            "gt": 1606262347
                                        }
                                    }
                                },
                                {
                                    "range": {
                                        "created_utc": {
                                            "lt": 1618581599
                                        }
                                    }
                                }
                            ],
                            "should": []
                        }
                    },
                    "must_not": []
                }
            },
            "size": 25,
            "sort": {
                "created_utc": "asc"
            }
        },
        "execution_time_milliseconds": 38.79,
        "index": "rc_delta3",
        "metadata": "true",
        "ranges": [
            {
                "range": {
                    "created_utc": {
                        "gt": 1606262347
                    }
                }
            },
            {
                "range": {
                    "created_utc": {
                        "lt": 1618581599
                    }
                }
            }
        ],
        "results_returned": 25,
        "shards": {
            "failed": 0,
            "skipped": 0,
            "successful": 4,
            "total": 4
        },
        "size": 25,
        "sort": "asc",
        "sort_type": "created_utc",
        "subreddit": [
            "CovidVaccinated"
        ],
        "timed_out": false,
        "total_results": 40730
    }

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

Additional update, I ran your query with both psaw and pmaw.

psaw returned 40726 comments with unique ids, while pmaw returned 40729 comments with unique ids.

I am currently investigating why there was 1 comment missed, I'll release an update sometime this week once I discover the root cause.

from pmaw.

ChrisPalmerNZ avatar ChrisPalmerNZ commented on June 14, 2024

Thanks for doing this Matthew. I ran the pmaw query a day after the psaw one, and I noticed at that time that it said that fewer (40,730) posts were available than what psaw returned. I measured the number of posts from both libraries by the length of the data, rather than any reporting by the library. I am not currently in front of my PC, but I have saved the data so when I get home tonight I will look at it to see if there were any duplicates returned that might explain the higher psaw number.

from pmaw.

ChrisPalmerNZ avatar ChrisPalmerNZ commented on June 14, 2024

Hi Matthew
I looked at my data, and realize that I transposed the last 2 digits of the psaw data - it was 40,726 rather than 40,762. Which agrees with what you reported for psaw. However, my data from pmaw was less at 40,630. Perhaps I was unlucky and there were shards down - can you advise me how I should have tested for that? I am happy to email the IDs to you if that helps your inquiry.

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

Usually, if shards are down a warning should be printed in both pmaw and psaw.

I'm not too sure why there were 100 missing results as I was unable to re-create this, so it could be data inconsistency with Pushshift. I have in the past partially lost pmaw results when exploring the data directly using the generator before storing in a CSV.

I would refer to the number of items available reported by pmaw: "40730 results available in Pushshift," as a baseline for the number that you should expect to be returned for a query, and you can re-run accordingly.

Based on the logs you provided, it appears that pmaw re-ran the query after 1 result was not found, this has been fixed for the next version which will be released.

Total:: Success Rate: 83.02% - Requests: 583 - Batches: 59 - Items Remaining: 1 # finished the query
Checkpoint:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: -99 # re-tries the query to get the missing item
Total:: Success Rate: 82.48% - Requests: 588 - Batches: 60 - Items Remaining: 0 

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

Two problems discovered thanks to this issue:

  1. Duplicate results can be included if a query with before and after misses result(s) - fixed in v1.0.5
  2. Missing results can occur for queries that specify before and after - will be tracked in #13

from pmaw.

ChrisPalmerNZ avatar ChrisPalmerNZ commented on June 14, 2024

Thanks for all of that Matt - I'm glad, and very impressed, that my issue resulted in your devoted attention, and that it led to an improvement - its a great product! BTW, last night I re-ran the query and got all 40,730 results. And, I am familiar with how generators work, I unpacked it straight to CSV, so that wasn't the issue here...

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

No problem, thanks for reporting the issue. It's worth noting that the 40,730 results that pmaw returned to you likely has a single duplicate.

I'm still working on figuring out the root cause, but v1.0.5 will not add duplicate results.

from pmaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.