science-for-nature-and-people / bibscan Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 8.0 319 KB

R package to batch download PDFs from a Web of Science search

R 1.99% TeX 98.01%

literature-review literature-mining webofscience web-of-science bibliographic-database

bibscan's People

Contributors

Stargazers

Watchers

Forkers

brunj7 swood-ecology cristinasparks msleckman timcashion nathanhwangbo renespijker mj163163

bibscan's Issues

Add test about parameter passed and units testing

Need to add several check on the package to make it more robust to users entries:

Add test on the parameters passed by the users, such as type, ....
Add basic units testing on the package

Further investigate the discrepancies between downloads

Steve and I did not get the same number of download on Leslie data.

Check what are the discrepancies and if they are due to university subscriptions or something else

Other package dependencies

It looks like there are three more package dependencies when loading 'BibScan' - rvest, jsonlite, and xml2. I was getting error messages of 'function not found' for a few functions that I think are from those packages - "html_nodes" and "fromJSON" most notably.

parsing failure

I got the below error when trying to run the main function. Attached are the files I used.

article_pdf_download(infilepath='~/Documents/Temporary/Lesley', colandr=screened_abstracts)

Converting your isi collection into a bibliographic dataframe

Articles extracted   100 
Articles extracted   200 
Articles extracted   300 
Articles extracted   326 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!


Converting your isi collection into a bibliographic dataframe

Articles extracted   43 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!

Warning: 1 parsing failure.
row # A tibble: 1 x 5 col     row col   expected   actual    file         expected   <int> <chr> <chr>      <chr>     <chr>        actual 1     5 NA    73 columns 9 columns literal data file # A tibble: 1 x 5

Error in filter_impl(.data, quo) : 
  Evaluation error: object 'citation_screening_status' not found.
In addition: Warning messages:
1: In if (grepl("\n", x)) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("\n", path)) return(path) :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
4: Missing column names filled in: 'X73' [73] 
5: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
7: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 2)

files.zip

issue with Dillon bib file

Dillon was trying to use BibScan to access a bunch of papers and couldn't get a few that seemed like should work. Attached is a .bib of the files that didn't download. I tried them on my machine and BibScan says they don't have DOIs, but when you look at the .bib file some of them clearly do. Does this work for you?

Doesn't filter out selected papers from Colandr

The package downloads all papers in the exported Colandr sheet, but doesn't filter out the ones that were selected through the Colandr process.

`//` in the path of downloaded files

Seems due to how I set up the cache if cr_miner. Find a better way to do this, or a past processing as plan B. Seems to not disturb the doenload and file manipulation (at least on OSX/unix)

modularize the article_pdf_download function

This function does a lot of things that should be relying on subfunctions

Replaces .bib file in output directory

If the output directory is the same as the input directory, and the .bib file is in that directory, running the package will remove the .bib file

PLOS One article are returned as html

With the crminer version of the code, it seems that the article from PLOS One are not accessed as PDF but as html documents. Need to investigate why

getting a 403 with http://www.jswconline.org/

Look into why

add travis

Add travis to the package

Improve PDF filenames from the publisher's default

See if we can rename the file with more explanatory names

Dependencies not loading

From @kanedan29

I’m running into two problems with bibscan. 1.) when I load the library it’s not loading the dependencies, so I get error messages about specific functions.

Error in "select" function

I also had issues with the dependency packages not loading properly.

After manually loading the dependency packages, the following error appeared:

Error in select(., citation_title, citation_authors, citation_journal_name) :
could not find function "select"

Output prior to error:
Converting your isi collection into a bibliographic dataframe

Articles extracted 47
Done!

Genereting affiliation field tag AU_UN from C1: Done!

Parsed with column specification:
cols(
study_id = col_integer(),
deduplication_status = col_character(),
citation_screening_status = col_character(),
fulltext_screening_status = col_character(),
data_extraction_screening_status = col_character(),
data_source_type = col_character(),
data_source_name = col_character(),
data_source_url = col_character(),
citation_title = col_character(),
citation_abstract = col_character(),
citation_authors = col_character(),
citation_journal_name = col_character(),
citation_journal_volume = col_integer(),
citation_pub_year = col_integer(),
citation_keywords = col_character(),
fulltext_filename = col_character(),
fulltext_exclude_reasons = col_character()
)

The installations are really long.

Not sure what all of the other dependencies are. But any idea why it takes so long to install?

Low Retrieval Rate

One user of the Bibscan library is asking on tips on how to improve the retrieval rate so my task for today was to figure out why the retrieval rate was so low. First, I ran the code given to me and got the same number of successful pdf retrievals. Based on the error messages given, it appears that the links don't work (don't know if this is obvious or not due to lack of knowledge about this package). To look into it more, I looked at the first ten documents. Some problems that I noticed was the documents from elsevier and wiley were not working. After trying to figure out why, I landed on this page: CrossRef/rest-api-doc#96. Also, in the crimer package, they said "At least Elsevier and I think Wiley also check your IP address in addition to requiring the authentication token". So maybe that's why these websites aren't working. For springerlink, it says that "Page Not Found". For the cambridge website, it gives me the warning pop up message "Unfortunately you do not have access to this content, please use the Get access link below for information on how to access this content." These are the websites/links that were from the first ten rows. Other than these error messages, I'm not really sure what else to look at since I'm pretty new on how this code (especially crimer) works.

dirname error

from @kanedan29 i get an error message that says Error in dirname(outfilepath) : object 'outfilepath' not found even though my output folder definitely exists

harmonize styling

There are currently many different stylings in the package. We should try to make it more homogenous to help contributions