ppwwyyxx / sopaper Goto Github PK

View Code? Open in Web Editor NEW

195.0 18.0 43.0 3.71 MB

Automatically Search and Download Papers

Home Page: https://pypi.python.org/pypi/sopaper/

License: Other

Python 54.11% Shell 1.98% Makefile 0.38% TeX 7.86% CSS 3.68% HTML 13.56% JavaScript 18.42%

sopaper's Introduction

SoPaper, So Easy

This is a project designed for researchers to conveniently access papers they need.

The command line tool sopaper can automatically search and download paper from Internet, given the title. The downloaded paper will thus have a readable file name (I wrote it at the beginning because I'm tired of seeing the file name being random strings). It mainly supports searching papers in computer science.

How to Use

Install command line dependencies:

pdftk command line executable.
- Using pdftk on OSX10.11 might lead to hangs. See here for more info.
poppler-utils (optional)

Install python package: pip install --user sopaper

Usage:

$ sopaper --help
$ sopaper "Distinctive image features from scale-invariant keypoints"
$ sopaper "https://arxiv.org/abs/1606.06160"

NOTE: If you are not in school, you may need proxy by environment variable http_proxy and https_proxy, to be able to download from certain sites (such as 'dl.acm.org').

Features

The searcher module will fuzzy search and analyse results in

Google Scholar
Google

and the fetcher module will further analyse the results and download papers from the following possible sources:

Searcher and Fetcher are extensible to support more websites.

The command line tool will directly download the paper with a clean filename. All downloaded paper will be compressed using ps2pdf from poppler-utils, if available.

TODO

Fetcher dedup: when arxiv abs/pdf apperas both in search results, page would be downloaded twice (maybe add a cache for requests)
Don't trust arxiv link from google scholar
Is title correctly updated for dlacm?
Extract title from bibtex -- more accurate?
Fetcher for other sites

sopaper's People

Contributors

Stargazers

Watchers

Forkers

mukosame xdcesc leetz geoffreywang1990 akhilmr ronnymakhuddin minganlin yougoforward afakihcpr neostoic the-shadow-thief kayejlli gonzalorodrigo minus-one liujx42 baifengbai afcarl strategist922 zhudaoruyi googol-lab olenet yangwang92 lxt98 depengchen123 herbertabdillah andrewtdop hakanaku2009 jwshn zenny pooyaravari dsphinx jasoncruz-dev rmmilewi gurusura vieozhu jakob-koschel 1qqd zxytim wsp666 gorky8685 hungvo304ml dawnywu

sopaper's Issues

Oops!

$ paper-downloader -t "Efficient
Algorithms for Finding Minimum Spanning Trees in Undirected and"
Searching with Google Scholar
Searching with Google
Found item on google: Efficient algorithms for finding minimum spanning trees in ... - Springer at link.springer.com
Found item on google: Efficient algorithms for finding minimum spanning trees in undirected ... at link.springer.com
Found item on google: Efficient algorithms for finding minimum spanning trees in undirected ... at dl.acm.org
Found item on google: Efficient Algorithms for Finding Minimum Spanning Tree in ... at www.researchgate.net
Found item on google: Efficient algorithms for finding minimum spanning trees in undirected ... at citeseer.uark.edu:8080
Found item on google: Efficient Algorithms for Finding Minimum Spanning Trees in ... at citeseer.uark.edu:8080
Found item on google: Efficient algorithms for finding minimum spanning trees in undirected ... at www.bibsonomy.org
Directly Download to ./Efficient Algorithms for Finding Minimum Spanning Trees in Undirected and.pdf...
URL is http://link.springer.com/content/pdf/10.1007%2FBF02579168.pdf
--2014-03-23 11:51:05-- http://link.springer.com/content/pdf/10.1007%2FBF02579168.pdf
Resolving link.springer.com (link.springer.com)... 211.155.87.20, 211.155.87.26
Connecting to link.springer.com (link.springer.com)|211.155.87.20|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘./Efficient Algorithms for Finding Minimum Spanning Trees in Undirected and.pdf’

[ <=>                                   ] 52,589      --.-K/s   in 0.01s

2014-03-23 11:51:07 (4.65 MB/s) - ‘./Efficient Algorithms for Finding Minimum Spanning Trees in Undirected and.pdf’ saved [52589]

./Efficient Algorithms for Finding Minimum Spanning Trees in Undirected and.pdf: HTML document, UTF-8 Unicode text, with very long lines

Format is not PDF!
Analyzing http://dl.acm.org/citation.cfm?id=18500
Download error: list index out of range
Traceback (most recent call last):
File "/home/tim/software/Paper-Downloader/resources/resource.py", line 38, in download
self.do_download(filename)
File "/home/tim/software/Paper-Downloader/resources/dlacm.py", line 25, in do_download
url = pdf[0].get('href')
IndexError: list index out of range

Test for file type on windows

Code
s = Popen('file "{0}"'.format(f.name),
stdout=PIPE, shell=True).stdout.read()
is platform specific
I suggest to add a switch in the config and to use
if ukconfig.USE_PYPDF2:
try:
fo = open(f.name, "rb")
PyPDF2.PdfFileReader(fo)
s = "PDF document"
except PyPDF2.utils.PdfReadError:
s = "invalid PDF file"
finally:
fo.close()
else:
s = Popen('file "{0}"'.format(f.name),
stdout=PIPE, shell=True).stdout.read()

Syntax Error

Hi,
I get the following error whenever I try to run sopaper:

Traceback (most recent call last):
  File "/home/clw/.local/bin/sopaper", line 11, in <module>
    load_entry_point('sopaper==0.8', 'console_scripts', 'sopaper')()
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 489, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2843, in load_entry_point
    return ep.load()
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2434, in load
    return self.resolve()
  File "/usr/lib/python3.7/site-packages/pkg_resources/__init__.py", line 2440, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "/home/clw/.local/lib/python3.7/site-packages/sopaper/__main__.py", line 23, in <module>
    from sopaper import searcher
  File "/home/clw/.local/lib/python3.7/site-packages/sopaper/searcher/__init__.py", line 8, in <module>
    from ..lib.ukutil import import_all_modules
  File "/home/clw/.local/lib/python3.7/site-packages/sopaper/lib/ukutil.py", line 76
    print check_filetype(open("./ukconfig.py").read(), 'PDF')
                       ^
SyntaxError: invalid syntax

I think I have all packages installed (see below), and have had this error now on two independent systems (Ubuntu 16.04 and ArchLinux). Any help would be appreciated.

Some more info on packages:

Package        Version 
-------------- --------
beautifulsoup4 4.7.1   
certifi        2019.3.9
chardet        3.0.4   
idna           2.8     
requests       2.21.0  
sopaper        0.8     
soupsieve      1.9.1   
termcolor      1.1.0   
urllib3        1.24.3  

extra/poppler 0.76.0-1 [installed]
    PDF rendering library based on xpdf 3.0
extra/poppler-data 0.4.9-1 [installed]
    Encoding data for the poppler PDF rendering library
extra/poppler-glib 0.76.0-1 [installed]

Feature Request

title auto-completion: e.g.
- given title: Object count area graphs for the evaluation'
- should complete to: 'Object count area graphs for the evaluation of object detection and segmentation algorithms'
meta-info extraction: such as conference, year, etc.

paper-downloader.py is not stand-alone

$ ./paper-downloader.py
Traceback (most recent call last):
File "./paper-downloader.py", line 21, in
import fetcher
File "/home/zxytim/software/SoPaper/common/fetcher/init.py", line 16, in
from dbsearch import search_exact
File "/home/zxytim/software/SoPaper/common/dbsearch.py", line 89, in
init_title_for_similar_search()
File "/home/zxytim/software/SoPaper/common/dbsearch.py", line 84, in init_title_for_similar_search
db = get_mongo('paper')
File "/home/zxytim/software/SoPaper/common/ukdbconn.py", line 25, in get_mongo
_db = MongoClient(*ukconfig.mongo_conn)[ukconfig.mongo_db]
File "/home/zxytim/.local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 352, in init
raise ConnectionFailure(str(e))
pymongo.errors.ConnectionFailure: could not connect to 127.0.0.1:27018: [Errno 111] Connection refused

Very slow download on some server and relatively big files

The line in requests_download
for data in resp.iter_content():
seems to be responsible

Replacing this line by
for data in resp.iter_content(1024*1024):
seems to be efficient

Integration

Any thoughts on integrating this with a GUI like paperhero

SyntaxError: invalid syntax

sopaper "Distinctive image features from scale-invariant keypoints"
Traceback (most recent call last):
  File "/Users/jpope/miniconda3/envs/tf10/bin/sopaper", line 7, in <module>
    from sopaper.__main__ import main
  File "/Users/jpope/miniconda3/envs/tf10/lib/python3.7/site-packages/sopaper/__main__.py", line 23, in <module>
    from sopaper import searcher
  File "/Users/jpope/miniconda3/envs/tf10/lib/python3.7/site-packages/sopaper/searcher/__init__.py", line 8, in <module>
    from ..lib.ukutil import import_all_modules
  File "/Users/jpope/miniconda3/envs/tf10/lib/python3.7/site-packages/sopaper/lib/ukutil.py", line 76
    print check_filetype(open("./ukconfig.py").read(), 'PDF')
                       ^
SyntaxError: invalid syntax

I'm using python2

brew install poppler
Error: poppler 0.56.0 is already installed
To upgrade to 0.71.0, run brew upgrade poppler

seems to fix things.

failed to download paper

hello,
as the titles says, sopaper fails to download papers

unfortunately, the error message does not give further specifics.

it apparently fails to find it:

(sopaper) nfg@NI-CA-107962:~$ sopaper -u "Distinctive image features from scale-invariant keypoints"
INFO Searching 'Distinctive Image Features from Scale-invariant Keypoints' with searcher: 'Google Scholar' ...
INFO Searching 'Distinctive Image Features from Scale-invariant Keypoints' with searcher: 'Google' ...
Results for Distinctive Image Features from Scale-invariant Keypoints:

my env specs are:

(sopaper) nfg@NI-CA-107962:~$ conda list
# packages in environment at /home/nfg/anaconda3/envs/sopaper:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
ca-certificates           2023.08.22           h06a4308_0
certifi                   2020.6.20          pyhd3eb1b0_3
libffi                    3.4.4                h6a678d5_0
libgcc-ng                 13.2.0               h807b86a_2    conda-forge
libgomp                   13.2.0               h807b86a_2    conda-forge
libsqlite                 3.43.0               h2797004_0    conda-forge
libstdcxx-ng              13.2.0               h7e041cc_2    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4                  hcb278e6_0    conda-forge
pip                       20.1.1             pyh9f0ad1d_0    conda-forge
python                    2.7.18               h42bf7aa_3
readline                  8.2                  h8228510_1    conda-forge
setuptools                44.0.0                   py27_0    conda-forge
sqlite                    3.43.0               h2c6b66d_0    conda-forge
tk                        8.6.13               h2797004_0    conda-forge
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
zlib                      1.2.13               hd590300_5    conda-forge

any hope would be appreciated!

best,
stas

When I try to download paper "Linear Time Maximally Stable Extremal Regions", it ends up with downloading "Efficient Maximally Stable Extremal Region (MSER) Tracking" as the link to the original paper "http://glorfindel.mavrinac.com/~aaron/school/pdf/nister08_ltmser.pdf" is unavailable.

A stricter check rule may do good.