Giter Site home page Giter Site logo

list index out of range about pmaw HOT 10 CLOSED

mattpodolak avatar mattpodolak commented on June 14, 2024
list index out of range

from pmaw.

Comments (10)

pavkriz avatar pavkriz commented on June 14, 2024 1

My dirty "hotfix" (probably could be done better) in PushshiftAPIBase._futures_handler:

                        if num > 0:
                            # find minimum `created_utc` to set as the `before` parameter in next timeslices
                            if remaining == 1 and len(data) == 0:
                                log.warning('Remaining 1 records to fetch but 0 data returned now, ignoring')
                            else:
                                before = data[-1]['created_utc']

                                # generate payloads
                                self.req.gen_slices(
                                    url, payload, after, before, num)

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024 1

Thanks @pavkriz, I implemented a modified version of your hotfix!

from pmaw.

pavkriz avatar pavkriz commented on June 14, 2024 1

#32 fix does not work well. In some situation, it stucks in an infinite loop repeating the last request (returning no data):

INFO:pmaw.PushshiftAPIBase:51028 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 100 - Batches: 10 - Items Remaining: 41030
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 200 - Batches: 20 - Items Remaining: 31039
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 300 - Batches: 30 - Items Remaining: 21094
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 400 - Batches: 40 - Items Remaining: 13231
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 500 - Batches: 50 - Items Remaining: 4676
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 600 - Batches: 60 - Items Remaining: 276
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 615 - Batches: 70 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 625 - Batches: 80 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 635 - Batches: 90 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 645 - Batches: 100 - Items Remaining: 266
INFO:pmaw.PushshiftAPIBase:Checkpoint:: Success Rate: 100.00% - Requests: 655 - Batches: 110 - Items Remaining: 266
...

This hotfix works for me:

                        if num > 0:
                            # find minimum `created_utc` to set as the `before` parameter in next timeslices
                            if len(data) > 0:
                                # make sure that index error wont occur
                                # we want to slice using the payload['before'] if we dont get any results
                                before = data[-1]['created_utc']

                                # generate payloads
                                self.req.gen_slices(
                                    url, payload, after, before, num)
                            else:
                                log.warning('Remaining some records to fetch but 0 data returned now, ignoring')

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024 1

@pavkriz just added to v2.1.2 :D

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

Thanks for opening an issue @XavierRigoulet , can you include the minimum amount of code required to recreate this bug?

from pmaw.

XavierRigoulet avatar XavierRigoulet commented on June 14, 2024

Thank you for getting back to me. Sorry, here is the code:

from pmaw import PushshiftAPI


api = PushshiftAPI()


submissions = api.search_submissions(subreddit=['wallstreetbets'], limit=None, num_workers=10, mem_safe=True, safe_exit=True)

submission_list = [submission for submission in submissions]

The error seems only to happen when calling the search_submissions() method. It seems fine when calling search_comments() method...

from pmaw.

sean-doody avatar sean-doody commented on June 14, 2024

I am also now getting this error, randomly, out of nowhere. Except, unlike @XavierRigoulet, it occurs when searching comments as well.

Here's my code:

import time
import sqlite3
import pandas as pd
from pmaw import PushshiftAPI

# Setup logging:
import sys
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(sys.stdout))


# Main scraping function:
def main():
    # Timer:
    start = time.time()

    before = int(time.time())

    DATA = "SQL_DATABASE.db"
    tables = ["comments", "posts"]
    subreddit = "SUBREDDIT"

    #Initialize API:
    api = PushshiftAPI()
    
    for table in tables:
        if table == "comments":
            # Desired data fields for comments:
            fields = ['id', 'permalink', 'body', 'author', 'distinguished', 
                    'author_flair_text', 'created_utc', 'subreddit', 'subreddit_id', 
                    'link_id', 'parent_id', 'score', 'retrieved_on', 'stickied']
            
            threads = 4

            # Get start date:
            conn = sqlite3.connect(DATA)
            after = pd.read_sql("SELECT max(created_utc) AS max_date FROM comments", conn)
            after = int(after["max_date"][0])


            results = api.search_comments(subreddit=subreddit, 
                                            fields=fields,
                                            before=before,
                                            after=after, 
                                            safe_exit=True, 
                                            workers=threads)

            print('Getting all JSON')
            comments = [c for c in results]

            print('Creating dataframe')
            df = pd.DataFrame(comments)

            print(f"Saving {table} data")
            conn = sqlite3.connect(DATA)
            df.to_sql(table, conn, if_exists="append")
            conn.close()

            print(f'Finished {table}')

        elif table == "posts":

            fields = ["id", "author", "title", "selftext", "score", "upvote_ratio", "total_awards_received", "stickied", "pinned", "num_comments", "num_crossposts",
            "subreddit", "subreddit_id", "author_flair_text", "author_fullname", "author_premium", "created_utc", "retrieved_on", "domain", "permalink", "full_link", 
            "url", "is_meta", "is_original_content", "is_reddit_media_domain", "is_self", "locked", "media_only", "over_18", "removed_by_category"]

            threads = 1

            # Get start date:
            conn = sqlite3.connect(DATA)
            after = int(pd.read_sql("SELECT MAX(created_utc) AS after_date FROM posts", conn)["after_date"][0])
            conn.close()

            results = api.search_submissions(subreddit=subreddit,
                                            fields=fields,
                                            before=before,
                                            after=after, 
                                            safe_exit=True, 
                                            workers=threads)
            
            print('Getting all JSON')
            posts = [c for c in results]

            print('Creating dataframe')
            df = pd.DataFrame(posts)

            print(f"Saving {table} data")
            conn = sqlite3.connect(DATA)
            df.to_sql("posts", conn, if_exists="append")
            conn.close()


            print(f'Finished {table}')

    # End timer:
    end = time.time()
    print(f'Finished program in {round((end-start)/60, 3)} minutes')

if __name__ == '__main__':
    main()

from pmaw.

pavkriz avatar pavkriz commented on June 14, 2024

Same error here. I do one api.search_comments for every day (shifting before and after by 24 hours among calls) and it works fine for some days (before+after timerange) and always fails for another particular day (timerange). So it's probably not random, rather systematic.

from pmaw.

mattpodolak avatar mattpodolak commented on June 14, 2024

Thanks for the updates! I'll be pushing out a fix for this later this week

from pmaw.

pavkriz avatar pavkriz commented on June 14, 2024

Adding DEBUG log:

2021-11-29 11:59:15,379 - pmaw.PushshiftAPIBase - INFO - 47 result(s) available in Pushshift
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 42
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637992800 - 1637993160 returned 5 results
2021-11-29 11:59:16,661 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:17,513 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 38
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993160 - 1637993520 returned 4 results
2021-11-29 11:59:17,514 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 33
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993520 - 1637993880 returned 5 results
2021-11-29 11:59:18,341 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 29
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637993880 - 1637994240 returned 4 results
2021-11-29 11:59:19,315 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 22
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994600 - 1637994960 returned 7 results
2021-11-29 11:59:21,416 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 17
2021-11-29 11:59:21,801 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994240 - 1637994600 returned 5 results
2021-11-29 11:59:21,802 - pmaw.PushshiftAPIBase - DEBUG - 5 total results for this time slice
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 10
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637994960 - 1637995320 returned 7 results
2021-11-29 11:59:22,377 - pmaw.PushshiftAPIBase - DEBUG - 7 total results for this time slice
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 6
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995320 - 1637995680 returned 4 results
2021-11-29 11:59:23,440 - pmaw.PushshiftAPIBase - DEBUG - 4 total results for this time slice
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 3
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637996040 - 1637996400 returned 3 results
2021-11-29 11:59:25,337 - pmaw.PushshiftAPIBase - DEBUG - 3 total results for this time slice
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637996040 returned 1 results
2021-11-29 11:59:27,597 - pmaw.PushshiftAPIBase - DEBUG - 2 total results for this time slice
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Remaining limit 2
2021-11-29 11:59:27,810 - pmaw.PushshiftAPIBase - DEBUG - Time slice from 1637995680 - 1637995779 returned 0 results
2021-11-29 11:59:27,811 - pmaw.PushshiftAPIBase - DEBUG - 1 total results for this time slice

then the error is raised because remaining is > 0 but len(data) is 0, so there is no record to get created_utc from.

from pmaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.