View Code? Open in Web Editor
NEW
Download papers and supplemental materials from open-access paper website, such as AAAI, AISTATS, COLT, CORL, CVPR, ECCV, ICCV, ICLR, ICML, IJCAI, JMLR, NIPS, RSS, WACV.
License: MIT License
paper_downloader's Introduction
paper_downloader's People
Contributors
paper_downloader's Issues
作者有空,可以加一下~
import multiprocessing
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
import os
import re
def validateTitle (title ):
rstr = r"[\/\\\:\*\?\"\<\>\|]"
new_title = re .sub (rstr , "_" , title )
return new_title
def fix_file_path (file_path ):
file_path = re .sub (r'[\r]+' , ' ' , file_path )
file_path = re .sub (r'[\n]+' , ' ' , file_path )
file_path = re .sub (r'\s{2,}' , ' ' , file_path )
return file_path
def download_pdf (args ):
# print("args", args)
pdf_url , pdf_file = args
with open (pdf_file , "wb" ) as f :
f .write (requests .get (pdf_url ).content )
print (f"Downloaded { pdf_file } from { pdf_url } " )
if __name__ == '__main__' :
year = 2020
url = "https://www.ifaamas.org/Proceedings/aamas{}/forms/contents.htm" .format (year )
response = requests .get (url )
soup = BeautifulSoup (response .content , "html.parser" )
if not os .path .exists ("aamas{}" .format (year )):
os .makedirs ("aamas{}" .format (year ))
tasks = []
pdf_urls = []
pdf_files = []
titles = []
for a in soup .find_all ("a" , href = True ):
if a ["href" ].endswith (".pdf" ):
title = a .text .strip ()
title = validateTitle (title )
pdf_url = "https://www.ifaamas.org/Proceedings/aamas{}/" .format (year ) + a ["href" ].split ("../" )[- 1 ]
pdf_file = "aamas{}/" .format (year ) + title + ".pdf"
pdf_file = fix_file_path (pdf_file )
if "https://www.ifaamas.org/Proceedings/aamas{}/pdfs/p" .format (year ) in pdf_url :
pdf_urls .append (pdf_url )
pdf_files .append (pdf_file )
titles .append (title )
# 使用多进程获取文章信息
with Pool (8 ) as p :
p .map (download_pdf , zip (pdf_urls , pdf_files ))
Thanks for your kind sharing.
I intend to run the code to download paper of ICLR 2021, but I get confused about the html_path
in
html_path = r'F:\oral.html' ,
.
Any suggestion on it? Thanks in advance.
如题。我用mac系统,下载pywin32库的时候一直报错,尝试几番也没有解决问题。对应代码在:from win32com.client import Dispatch。可以帮帮我吗?