I've posted this issue on reddit; transfer it here as a formal issue report. <p di

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Incorrect Number for Items Remaining when Recovering Cache about pmaw HOT 4 CLOSED

mattpodolak commented on June 13, 2024

Incorrect Number for Items Remaining when Recovering Cache

from pmaw.

Comments (4)

YukunYangNPF commented on June 13, 2024

Tested several times.

If initiate with no before parameter and then use the printed timestamp as the "before" to resume, the first try to resume will NOT load the initial requests. It will show no cached requests loaded and start all over. When you re-run the query again with the before value, it will load the first resume attempt, but as the items remaining start over, the scraping also restarted.

So basically, I cannot find a way to recover interrupted scraping.

from pmaw.

mattpodolak commented on June 13, 2024

Hi @YukunYangNPF I investigated this issue, and there appears to be a logging issue when the cache is loaded. I found when I ran your example code that the cache was being loaded, but not being reported as loaded.

I have added a static method in v2.1.0 that will allow you to load responses directly from the cache to work with, as well as I have modified the logging so that its clear when the cache is loaded.

from pmaw.

gauravkhadgi commented on June 13, 2024

@mattpodolak
def submissions(subreddit,after=[2020,1,1],before=[2021,1,1],limit=None,num_workers=20,file_checkpoint=8):

after = int(dt.datetime(after[0], after[1], after[2]).timestamp())
before = int(dt.datetime(before[0], before[1], before[2]).timestamp())

api = PushshiftAPI(num_workers=num_workers,file_checkpoint=file_checkpoint)
 
cache_dir = 'drive/My Drive/scraped data/cache/' + subreddit + '/'           

api.search_submissions(subreddit=subreddit, after=after,before=before,safe_exit=True,limit=limit,sort='desc',cache_dir=cache_dir,mem_safe=True)

def func():
try:
submissions('science')
except:
print('\n FUCK ERRORS \n')
func()

func()

Can you tell me how to resolve this.
After few "FUCK ERRORS", it just starts again from the beginning.

(I dump the cache data later using a different code. I just complete collecting the pickle(gzip) files first.)

from pmaw.

mattpodolak commented on June 13, 2024

Hi @gauravkhadgi thanks for reporting this issue what exception is being thrown when you run this code? Also, can you open a new issue as this doesnt seem related to the parent issue?

from pmaw.

Recommend Projects

Incorrect Number for Items Remaining when Recovering Cache about pmaw HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent