Giter Site home page Giter Site logo

michelecotrufo / pdf2doi Goto Github PK

View Code? Open in Web Editor NEW
84.0 2.0 12.0 81.67 MB

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.

Python 98.01% Batchfile 1.99%
doi python pdf bibtex arxiv identifiers arxiv-identifiers bibtex-entry extract-doi extract

pdf2doi's Introduction

pdf2doi

pdf2doi is a Python library/command-line tool to automatically extract the DOI or other identifiers (e.g. arXiv ID) starting from the .pdf file of a publication (or from a folder containing several .pdf files), and to retrieve bibliographic information. It exploits several methods (see below for detailed description) to find a valid identifier of a pdf file, and it validates any result via web queries to public archives (e.g. http://dx.doi.org). The validation process also returns raw bibtex infos, which can be used for further processing, such as generating BibTeX entries (pdf2bib) or automatically renaming pdf files (pdf-renamer).

pdf2doi can be used either from command line, or inside your python script or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Downloads Downloads Pip Package

Latest stable version

The latest stable version of pdf2doi is the 1.5.1. See here for the full change log.

[v1.5.1] - 2022-12-31

Main changes

  • The library textract has been removed from the required dependencies because it often creates problems during installation (due to conflicts between library versions), and because it generally requires installing many other dependencies which are not needed by pdf2doi. The user can still decide to install textract==1.6.4 if desired. pdf2doi will use textract only if it is installed.
  • pdf2doi now stores any found identifier into a tag called /pdf2doi_identifier (previously was /identifier).

Added

  • The library pdfminer is now directly used by pdf2doi to extract the text from a pdf file (instead of doing it indirectly via textract)
  • An additional method to find the title of a pdf file, based on the library pymupdf, has been added .
  • [Issue #21]: When an arXiv ID is found, a corresponding DOI is also returned when available. This could be either the standard arXiv DOI (see also here), or the DOI of the corresponding journal publication. This behavior can be disabled by adding the optional command -no_arxiv2doi to the pdf2doi invocation.
  • [Issue #22]: The function get_pdf_text (finders.py) has been modified to allow the library PyPDF2 to extract also the text of any annotation/comment present in the pdf file.

Fixed

  • Potential titles of the papers were often not correctly found, because the function find_possible_titles() (finders.py) would mistakenly disregard all the results if one of the three methods (pdftitle, PyPDF2, filename) generated an error.
  • Fixed bug in the function add_metadata() (finders.py). In previous versions, some of the pre-existing metadata were not preserved when a new one was added (Commit).

Installation

Use the package manager pip to install pdf2doi.

pip install pdf2doi==1.5.1

The library textract provides additional ways to analyze pdf files, and it is sometimes more powerful than PyPDF2, but it comes with a large overhead of additional required dependencies, and sometimes it generates version conflicts. The user can decide whether to install it or not. pdf2doi will only try to use this library if it detects that it is installed. To install it,

pip install textract==1.6.4
pip install pdfminer.six==20191110

Under Windows, after installation of pdf2doi it is also possible to add shortcuts to the right-click context menu.

Used by

Here is a list of applications/repositories that make use of pdf2doi. If you use pdf2doi in your application and you wish to add it to this list, send me a message.

Table of Contents

Description

Automatically associating a DOI or other identifiers (e.g. arXiv ID) to a pdf file can be either a very easy or a very difficult (sometimes nearly impossible) task, depending on how much care was placed in crafting the file. In the simplest case (which typically works with most recent publications) it is enough to look into the file metadata. For older publications, the identifier is often found within the pdf text and it can be extracted with the help of regular expressions. In the unluckiest cases, the only method left is to google some details of the publication (e.g. the title or parts of the text) and hope that a valid identifier is contained in one of the first results.

pdf2doi applies sequentially all these methods (starting from the simplest ones) until a valid identifier is found and validated. Specifically, for a given .pdf file it will, in order,

  1. Look into the metadata of the .pdf file (extracted via the library PyPDF2) and check if any of them contains a string that matches the pattern of a DOI or an arXiv ID. Priority is given to metadata which contain the word 'doi' in their label.

  2. Check if the name of the pdf file contains any sub-string that matches the pattern of a DOI or an arXiv ID.

  3. Scan the text inside the .pdf file, and check for any string that matches the pattern of a DOI or an arXiv ID. The text is extracted with the libraries PyPDF2 and pdfminer. If the library textract is installed, pdf2doi will try to use that too.

  4. Try to find possible titles of the publication. In the current version, possible titles are identified via the libraries pdftitle and PyMuPDF, and by the file name. For each possible title a google search is performed and the plain text of the first results is scanned for valid identifiers.

  5. As a last desperate attempt, the first N=1000 characters of the pdf text are used as a query for a google search. The plain text of the first results is scanned for valid identifiers.

Any time that a potential identifier is found, it is also validated by performing a query to a relevant website (e.g., http://dx.doi.org for DOIs and http://export.arxiv.org for arxiv IDs). This validation process also returns raw BibTeX info when the identifier is valid.

When a valid identifier is found with any method different than the first one, the identifier is also stored inside the metadata of the pdf file. In this way, future lookups of this same file will be able to extract the identifier with the first method, speeding up the search (This feature can be disabled by the user, in case edits to the pdf file are not desired).

The library is far from being perfect. Often, especially for old publications, none of the currently implemented methods will work. Other times the wrong DOI might be extracted: this can happen, for example, if the DOI of another paper is present in the pdf text and it appears before the correct DOI. A quick and dirty solution to this problem is to look up the identifier manually and then add it to the metadata of the file, with the methods shown here (from python console) or here (from command line). In this way, pdf2doi will always retrieve the correct DOI when analyzing this same file in the future, which can be useful when pdf2doi is used to automatize bibliographic procedures for a large number of files (e.g. via pdf2bib or pdf-renamer).

Currently, only the format of arXiv identifiers in use after 1 April 2007 is supported.

Usage

pdf2doi can be used either as a stand-alone application invoked from the command line, or by importing it in your python project or, only for Windows, directly from the right-click context menu of a pdf file or a folder.

Command line usage

pdf2doi can be invoked directly from the command line, without having to open a python console. The simplest command-line invokation is

$ pdf2doi 'path/to/target'

where target is either a valid pdf file or a directory containing pdf files. Adding the optional command '-v' increases the output verbosity, documenting all steps. For example, when targeting the folder examples we get the following output

$ pdf2doi ".\examples" -v
[pdf2doi]: Looking for pdf files in the folder ....
[pdf2doi]: Found 4 pdf files.
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: .\examples\chaumet_JAP_07.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Validating the possible DOI 10.1063/1.2409490 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1063/1.2409490 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: Trying to add the tag '/pdf2doi_identifier'-> '10.1063/1.2409490' into the metadata of the file '.\chaumet_JAP_07.pdf'...
[pdf2doi]: The tag '/pdf2doi_identifier'-> '10.1063/1.2409490' was added succesfully to the metadata of the file '.\chaumet_JAP_07.pdf'...
[pdf2doi]: 10.1063/1.2409490
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: .\examples\paper12.2009_unknown_040916_440842.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf.
[pdf2doi]: Extracting text with the library pdfminer...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Validating the possible DOI 10.1037/a0015278 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1037/a0015278 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: Trying to add the tag '/pdf2doi_identifier'-> '10.1037/a0015278' into the metadata of the file '.\paper12.2009_unknown_040916_440842.pdf'...
[pdf2doi]: The tag '/pdf2doi_identifier'-> '10.1037/a0015278' was added succesfully to the metadata of the file '.\paper12.2009_unknown_040916_440842.pdf'...
[pdf2doi]: 10.1037/a0015278
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: .\examples\PhysRevLett.116.061102.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Standardised DOI: 10.1103/PhysRevLett.116.061102 -> 10.1103/physrevlett.116.061102
[pdf2doi]: Validating the possible DOI 10.1103/physrevlett.116.061102 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1103/physrevlett.116.061102 is validated by dx.doi.org.
[pdf2doi]: Standardised DOI: 10.1103/PhysRevLett.116.061102 -> 10.1103/physrevlett.116.061102
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: Trying to add the tag '/pdf2doi_identifier'-> '10.1103/physrevlett.116.061102' into the metadata of the file '.\PhysRevLett.116.061102.pdf'...
[pdf2doi]: The tag '/pdf2doi_identifier'-> '10.1103/physrevlett.116.061102' was added succesfully to the metadata of the file '.\PhysRevLett.116.061102.pdf'...
[pdf2doi]: 10.1103/physrevlett.116.061102
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: .\examples\s41586-019-1666-5.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Validating the possible DOI 10.1038/s41586-019-1666-5 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1038/s41586-019-1666-5 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: Trying to add the tag '/pdf2doi_identifier'-> '10.1038/s41586-019-1666-5' into the metadata of the file '.\s41586-019-1666-5.pdf'...
[pdf2doi]: The tag '/pdf2doi_identifier'-> '10.1038/s41586-019-1666-5' was added succesfully to the metadata of the file '.\s41586-019-1666-5.pdf'...
[pdf2doi]: 10.1038/s41586-019-1666-5
[pdf2doi]: ................
DOI             10.1063/1.2409490                        .\chaumet_JAP_07.pdf

DOI             10.1037/a0015278                         .\paper12.2009_unknown_040916_440842.pdf

DOI             10.1103/physrevlett.116.061102           .\PhysRevLett.116.061102.pdf

DOI             10.1038/s41586-019-1666-5                .\s41586-019-1666-5.pdf

Every line which begins with [pdf2doi] is omitted when the optional command '-v' is absent. In the final output, the first column specifies the kind of identifier (currently either 'DOI' or 'arxiv'), the second column contains the found DOI/identifier, and the third column contains the file path.

A list of all optional arguments can be generated by pdf2doi --h

$ pdf2doi --h
usage: pdf2doi [-h] [-v] [-nws] [-nwv] [-nostore] [-no_arxiv2doi] [-id IDENTIFIER] [-google GOOGLE_RESULTS] [-s FILENAME_IDENTIFIERS] [-clip] [-install--right--click] [-uninstall--right--click] [path ...]

Retrieves the DOI or other identifiers (e.g. arXiv) from pdf files of a publications.

positional arguments:
  path                  Relative path of the target pdf file or of the targe folder.

options:
  -h, --help            show this help message and exit
  -v, --verbose         Increase verbosity. By default (i.e. when not using -v), only a table with the found identifiers will be printed as output.
  -nws, --no_web_search
                        Disable any method to find identifiers which requires internet searches (e.g. queries to google).
  -nwv, --no_web_validation
                        Disable the online validation of identifiers (e.g., via queries to http://dx.doi.org/).
  -nostore, --no_store_identifier_metadata
                        By default, anytime an identifier is found it is added to the metadata of the pdf file (if not present yet). By using this additional option, the identifier is not stored in the file
                        metadata.
  -no_arxiv2doi         If a valid arXiv ID is found for a given .pdf file, by default pdf2doi will try to also look for a DOI (either because the paper has been published in a journal or because arXiv has
                        assigned to it a DOI of the form "10.48550/arXivID"). By adding this command, the arXiv ID is instead always returned.
  -id IDENTIFIER        Stores the string IDENTIFIER in the metadata of the target pdf file, with key '/pdf2doi_identifier'. Note: when this argument is passed, all other arguments (except for the path to the
                        pdf file) are ignored.
  -google GOOGLE_RESULTS
                        Set how many results should be considered when doing a google search for the DOI (default=6).
  -s FILENAME_IDENTIFIERS, --save_identifiers_file FILENAME_IDENTIFIERS
                        Save all the identifiers found in the target folder in a text file inside the same folder with name specified by FILENAME_IDENTIFIERS. This option is only available when a folder is
                        targeted.
  -clip, --save_doi_clipboard
                        Store all found DOI/identifiers into the clipboard.
  -install--right--click
                        Add a shortcut to pdf2doi in the right-click context menu of Windows. You can copy the identifier and/or bibtex entry of a pdf file (or all pdf files in a folder) into the clipboard by
                        just right clicking on it! NOTE: this feature is only available on Windows.
  -uninstall--right--click
                        Uninstall the right-click context menu functionalities. NOTE: this feature is only available on Windows.```

#### Manually associate the correct identifier to a file from command line
Sometimes it is not possible to retrieve a DOI/identifier automatically, or maybe the one that is retrieved is not the correct one. In these (hopefully rare) occasions
it is possible to manually add the correct DOI/identifier to the pdf metadata, by using the ```-id``` argument,

$ pdf2doi "path\to\pdf" -id "doi1234"

This creates a new metadata in the pdf file with label '/pdf2doi_identifier' and containing the string ```doi1234```.  Future lookups of this same file via ```pdf2doi``` (in particular when used by other tools such as [pdf2bib](https://github.com/MicheleCotrufo/pdf2bib) or
[pdf-renamer](https://github.com/MicheleCotrufo/pdf-renamer)) will then return the correct identifier and BibTeX infos.

### Usage inside a python script
```pdf2doi``` can also be used as a library within a python script. The function ```pdf2doi.pdf2doi``` is the main point of entry. The function looks for the identifier of a pdf file by applying all the available methods. 
The first input argument must be a valid path (either absolute or relative) to a pdf file or to a folder containing pdf files. 
The same optional arguments available in the command line operation are now available via the methods ```set``` and ```get``` of the object ```pdf2doi.config```
For example, we can scan the folder [examples](/examples) while soppressing output verbosity by, 

```python
>>> import pdf2doi
>>> pdf2doi.config.set('verbose',False)
>>> results = pdf2doi.pdf2doi('.\examples')

A full list of the library settings can be printed by the method pdf2doi.config.print()

>>> import pdf2doi
>>> pdf2doi.config.print()
verbose : True (bool)
separator : \ (str)
method_dxdoiorg : application/citeproc+json (str)
webvalidation : True (bool)
websearch : True (bool)
numb_results_google_search : 6 (int)
N_characters_in_pdf : 1000 (int)
save_identifier_metadata : True (bool)
replace_arxivID_by_DOI_when_available : True (bool)

The output of the function pdf2doi is a list of dictionaries (or just a single dictionary if a single file was targeted). Each dictionary has the following keys

result['identifier'] = DOI or other identifier (or None if nothing is found)
result['identifier_type'] = string specifying the type of identifier (e.g. 'doi' or 'arxiv')
result['validation_info'] = Additional info on the paper. If config.get('webvalidation') = True, then result['validation_info']
                            will typically contain raw bibtex data for this paper. Otherwise it will just contain True 
result['path'] = path of the pdf file
result['method'] = method used to find the identifier

For example, the DOIs/identifiers of each file can be printed by

>>> for result in results:
>>>     print(result['identifier'])
10.1016/0021-9991(86)90093-8
10.1063/1.2409490
10.1103/PhysRevLett.116.061102
10.1038/s41586-019-1666-5

By default, everytime that a valid DOI/identifier is found, it is stored in the metadata of the pdf file. In this way, subsequent lookups of the same folder/file will be much faster. This behaviour can be removed (e.g. if the user does not want or cannot edit the files) by setting save_identifier_metadata to False, via

>>> pdf2doi.config.set('save_identifier_metadata',False)

Manually associate the correct identifier to a file

Similarly to what described above, it is possible to associate a (manually found) identifier to a pdf file also from within python, by using the function pdf2doi.add_found_identifier_to_metadata:

>>> import pdf2doi
>>> pdf2doi.add_found_identifier_to_metadata(path_to_pdf_file, identifier)

this creates a new metadata in the pdf file with label '/pdf2doi_identifier' and containing the string identifier.

Installing the shortcuts in the right-click context menu of Windows

This functionality is only available on Windows (and so far it has been tested only on Windows 10). It adds additional commands to the context menu of Windows which appears when right-clicking on a pdf file or on a folder.

The different menu commands allow to copy the paper(s) identifier(s) into the system clipboard, or also to manually set the identifier of a pdf file (see also here).

To install this functionality, first install pdf2doi via pip (as described above), then open a command prompt with administrator rights and execute

$ pdf2doi  -install--right--click

To remove it, simply run (again from a terminal with administrator rights)

$ pdf2doi  -uninstall--right--click

If it is not possible to run this command from a terminal with administrator rights, the batch files here can be alternatively used (see readme.MD file in the same folder for instructions), although it is still required to have admnistrator rights.

NOTE: when multiple pdf files are selected, and the right-click context menu commands are used, pdf2doi will be called separately for each file, and thus only the info of the last file will be stored in the clipboard. In order to copy the info of multiple files it is necessary to save them in a folder and right-click on the folder.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Acknowledgment

I am thankful to my friend and colleague Yarden Mazor for leading the beta-testing efforts for this project.

Donating

If you find this library useful (or amazing!), please consider making donations on my behalf to organizations that advocate for and promote free dissemination of science, such as

arXiv

Sci-Hub

Wikipedia

License

MIT

pdf2doi's People

Contributors

djrhails avatar duzabf avatar michelecotrufo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pdf2doi's Issues

Program returns error on encrypted files, would prefer if it skipped them.

A file in the list wasn't decrypted, and so it returned this error. Ideally, it should log a warning that it's encrypted, and then skip over it.

`[pdf2doi]: Trying to retrieve a DOI/identifier for the file: ...

[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...

Traceback (most recent call last):

File "/usr/local/bin/pdf2doi", line 8, in
sys.exit(main())

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 410, in main
results = pdf2doi(target=target,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 112, in pdf2doi
result = pdf2doi( target=file, verbose=verbose, websearch=websearch, webvalidation=webvalidation,

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 147, in pdf2doi
result = pdf2doi_singlefile(filename)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 190, in pdf2doi_singlefile
result = finders.find_identifier(filename,method="document_infos",keysToCheckFirst=['/doi','/identfier'])

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 487, in find_identifier
identifier, desc, info = finder_methodsmethod

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 587, in find_identifier_in_pdf_info
pdfinfo = get_pdf_info(path)

File "/usr/local/lib/python3.9/site-packages/pdf2doi/finders.py", line 275, in get_pdf_info
info = pdf.getDocumentInfo()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1101, in getDocumentInfo
obj = self.trailer['/Info']

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 516, in getitem
return dict.getitem(self, key).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/generic.py", line 178, in getObject
return self.pdf.getObject(self).getObject()

File "/usr/local/lib/python3.9/site-packages/PyPDF2/pdf.py", line 1617, in getObject
raise utils.PdfReadError("file has not been decrypted")

PyPDF2.utils.PdfReadError: file has not been decrypted`

All arXiv articles now have DOIs

Apparently the arXiv blog has announced that as of Feb 2022 all arxiv articles have DOIs.

Furthermore the DOI's share a unified prefix and the arXiv IDs as a suffix:

"An author can determine their article’s DOI by using the DOI prefix https://doi.org/10.48550/ followed by the arXiv ID (replacing the colon with a period). For example, the arXiv ID arXiv:2202.01037 will translate to the DOI link https://doi.org/10.48550/arXiv.2202.01037"

Perhaps could be grounds for a 2.0 release that returns DOI for arXiv articles.

Possible optimization for main.py

Currently, lines 425 to 428 in main read:

    for result in results:
        if result['identifier']:
            print('{:<15s} {:<40s} {:<10s}\n'.format(result['identifier_type'], result['identifier'],result['path']) ) 

    return

I suspect this could be improved/updated with an f-string:

    for result in results:
        if result['identifier']:
            print(f'{result['identifier_type']}, {result['identifier'], {result['path']} \n')

    return

Not a bug, just results have 'false-positives'? (Test scenario)

Hello,
this is a very nifty tool, but I get the following results on a test case I set up containing a bundle of random PDFs from my library.

[pdf2doi]: ................
DOI             10.1109/MS.2018.2141038                  ./[email protected]

DOI             10.1145/3341227                          ./10.1145@3341227 MUST and MUST NOT.pdf

DOI             10.1145/38807.38824                      ./120158- Use Case Template-20160821_0954877.pdf

DOI             10.1016/j.jss.2016.02.047                ./120216- Software Requirements Specification Template-20160821_0951179.pdf

DOI             10.1007/978-3-319-09816-6                ./2014_Book_Autonomy Requirements Engineering for Space Missions NASA Springer.pdf

DOI             10.1007/978-1-4614-5377-2                ./293233main_62651main_1_pmchallenge_hraster.pdf

The first answer is pretty cool, extracted from filename. The 1st, 2nd and 5th are correct. The rest is false. Specifically the last one is close to target, but I am yet about to understand how. The file is a presentation, without mentioning the extracted DOI, but has similar contents.

Sincerely

Proxy?

Is it possible to add Proxy functionality? I am blocked from using the web requests without it. The code itself looks great.

TypeError: 'NoneType' object is not iterable

There appears to be a type error in "finder.py" that only emerges on certain PDF files. This one, for example:
paper12.2009_unknown_040916_440842.pdf

A miniumn code snippet for reproducing this error:

from pathlib import Path
import pdf2doi

pdf2doi.config.set("verbose", False)
PDF_name = "paper12.2009_unknown_040916_440842.pdf"
results = pdf2doi.pdf2doi(str(Path("examples", PDF_name)))

Where the PDF is placed in the example folder.

Here is the error message:

Traceback (most recent call last):
  File "/Users/donyin/Desktop/pdf2doi-master/main.py", line 15, in <module>
    results = pdf2doi.pdf2doi(str(Path("examples", i)))
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 90, in pdf2doi
    result = pdf2doi_singlefile(filename)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 134, in pdf2doi_singlefile
    result = finders.find_identifier(file,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 548, in find_identifier
    identifier, desc, info = finder_methods[method](file,func_validate,**kwargs)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 586, in find_identifier_in_pdf_info
    identifier,desc,info = find_identifier_in_text(pdfinfo[key],func_validate)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 286, in find_identifier_in_text
    for identifier in identifiers:
TypeError: 'NoneType' object is not iterable

I thought I fixed this error by adding:

if identifiers is None:
     identifiers = []

at line 286 of your "finder.py", so that it becomes:

        #First we look for DOI
        for v in range(len(doi_regexp)):
            identifiers = extract_doi_from_text(text,version=v)
            if identifiers is None: # <- here
                identifiers = [] # <- here
            for identifier in identifiers:
                validation = func_validate(identifier,'doi')
                if validation: 
                    return identifier, 'DOI', validation

But this was a bit hacky and not the proper solution. You'd undoubtedly know more about what's going on, so I thought I'd let you know about this.

And by the way, there are some deprecated syntax that you might want to address:

UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]

cheers,
Don

doi2bib function

Hi, Michele,

Nice work! I am trying to extract the bibTex strings into a .txt file but noticed that you have removed the bibTex_makers module from v1.1, would you make some suggestions on how I can achieve it under current version?

Thanks!

File not closed

Hi,

The function pdf2doi_singlefile does not close the opened pdf file. The close file statement is not executed due to return statements on successful identifier finding.

pdf2doi/pdf2doi/main.py

Lines 161 to 165 in d8e7117

if result['identifier']:
return result
if flag_closefile:
file.close()

This causes the issue with pdf-rename on Windows. The renaming attempt results in access error as file is opened by the script itself.

I'll make PR shortly to fix this.

Add file DOI check to URL paths

Often a search can surface DOI descriptors in the URL path alone, for instance:

[pdf2doi]: Performing google search with key "The Experimental Generation of Interpersonal Closeness: A Procedure and Some Preliminary Findings"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://journals.sagepub.com/doi/pdf/10.1177/0146167297234003
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://journals.sagepub.com/doi/abs/10.1177/0146167297234003

Supporting this would give quicker identifications, but also allow for occasions, such as this, where the DOI can't be extracted from the actual page.

https://doi.org/10.1177/0146167297234003

Not importing for Python 3.10

This is likely known/to be expected, but upon upgrading python to 3.10, pdf2doi no longer imports for VSCode on Mac.

Export/Save to CSV? Import From CSV?

May be a bit involved, but my hunch is that a factory pattern could be used to allow for either importing info from a directory OR a row in a CSV.

Similarly, it'd be great if this exported to a CSV/not just the console. This I've done in my own program. As I've mentioned to you already, however, it's had some problems (namely Pandas isn't happy with some of the keys I've given it).

Option for disabling document text method

First, thanks for this very helpful library!

For many of the papers I read your algorithm works fine and finds the correct doi.
But as you already mention in the README, for some papers the used document_text method results in a wrong doi as the doi of other papers appear first.
Unfortunately this is very often the case for papers of certain conferences I read often as they contain arxiv IDs in the references and do not contain their own doi anywhere else in the text. At the same time, when I comment out the document_text method, I get pretty good results with the fourth method.
I am wondering if one of the following features might help to reduce these type of errors:

  • only using the first pages to look for doi in text
  • having an option to disable certain steps in the search process
  • being able to customize the order of the search methods

Do you think one of these options (or smth else) is something which the library would benefit from and can be implemented with a reasonable effort? If so, I can see if I find the time to turn my current "comment-out-workaround" into a mergable feature.

Running this on Mac Big Sur in VSC, v 0.5 returns this error. v 0.4 does not.

Traceback (most recent call last):
File "/Users/johnfallot/venv/210706_PDN_ScienceAssistant_v16.py", line 3, in
from pdf2doi.finders import validate
File "/usr/local/lib/python3.9/site-packages/pdf2doi/init.py", line 13, in
from .main import pdf2doi
File "/usr/local/lib/python3.9/site-packages/pdf2doi/main.py", line 6, in
import pdf2doi.utils_registry as utils_registry
File "/usr/local/lib/python3.9/site-packages/pdf2doi/utils_registry.py", line 5, in
import winreg
ModuleNotFoundError: No module named 'winreg'

Clash with other pdf extractions libraries

I use a bunch of other pdf extraction tools like tabula, camelot and layout parser and it seems that pdf2doi is using an older version of pdfminer-six which gives problems when coexisting with these libraries. When installing with pip in the same env in which i use layoutparser and camelot i get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pdfplumber 0.6.0 requires pdfminer.six==20211012, but you have pdfminer-six 20181108 which is incompatible.
google-api-core 1.31.5 requires six>=1.13.0, but you have six 1.12.0 which is incompatible.
camelot-py 0.10.1 requires pdfminer.six>=20200726, but you have pdfminer-six 20181108 which is incompatible.

Is there a workaround to this problem?

Pdf reading from file object rather than from path

Hello,

Amazing tool, I love it, is there a way to use a file object rather than an absolute path to feed to pdf2doi? Asking because I am trying to modify an app deployed on google cloud services to incorporate pdf2doi, but I can't find a way that doesn't involve downloading the files to local machine, which would be mildly inconvenient. The pdf files are stored on google clouds and it would be more elegant to open them as file objects and then manipulate them rather than to download it to local, run pdf2doi and re-upload the info.

Thank you very much for your work!

[Suggestion] Look into PDF text-annotations for valid DOIs

First of all, thanks for the awesome tool! It saved me lots of time during my bibliography/SOTA runs, or by batch-renaming 100s of PDF files for easier indexing.

Now, to the point:

a) Some background: I disabled Google-searching (Methods #4 and #5) as they rarely worked on old/no-DOI papers in my field (I am an electromagnetics engineering, working with journals from IEEE, OSA/Optica, AIP, APS, etc.). It's faster for me to open the PDF file w/ Chrome, select title, R-click it to google-search and get the DOI. Now, to pass this DOI to PDF2DOI, I presently rename the file using the DOI as a name-string (replacing slashes with dashes), and then R-clicking it with PDF_renamer, done. So, it works with Method#2.

b) The Suggestion: I sometimes also copy the DOI (as URL or plain DOI, with slashes etc) into the top of the first page, for easier reference, as a text-annotation ("typewriter tool") or inside a bubble/note/comment annotation. Could PDF2DOI be made to look into these first-page annotations for the DOI, e.g., during Method#3? It would be really handy (for me)...

Thanks for your time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.