** fetching of reprint 28341702 failed from error Couldn't find a tree builder with th

Errors downloading articles about pubmed-batch-download HOT 18 CLOSED

sayak1711 commented on July 20, 2024

Errors downloading articles

from pubmed-batch-download.

Comments (18)

billgreenwald commented on July 20, 2024

I just ran

python fetch_pdfs.py -pmids 30374447

and it ran fine and downloaded. Could you let me know the following so I can help troubleshoot

The exact command you ran
If you installed the conda environment included in the repo
If you did not install the conda environment, your currently installed version of lxml.

Thanks! Also, if you have other errors, please throw them in this thread instead of making new threads for each one.

from pubmed-batch-download.

sayak1711 commented on July 20, 2024

I did not install conda.
lxml version is 4.2.5
The command I ran is same as the one you have written above... after "-pmids" I have given many pmids at once comma seperated.
It doesn't fail for all. It succeeds in many. However for many it throws this error I mentioned.
For example for 28645740 it gives me that error.
When I was facing that error I changed the parser from lxml to html.parser in line 99 and still the error would be there for some pmids.

from pubmed-batch-download.

sayak1711 commented on July 20, 2024

Here is some other kinds of error I am facing:

** fetching of reprint 28589772 failed from error Invalid URL '119162c4-9551-4055-99fe-48538e2570bc': No schema supplied. Perhaps you meant http://119162c4-9551-4055-99fe-48538e2570bc?

Trying to fetch pmid 28543980
** fetching of reprint 28543980 failed from error ('Connection aborted.', BadStatusLine("''",))
Its not like they don't have the pdf for the above. This is the link

** Reprint 28514316 cannot be fetched as pubmed does not have a link to its pdf.
Trying to fetch pmid 28510797

Again this has a link

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

To the first point, just added a method to get articles from future_medicine

working on the others

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

Fixed the downloader for pubmed central.

To do this, removed the check to see if a PDF is linked on pubmed. The example you provided yesterday now has a PDF listed, so I can't use it to test behavior. If you come across a PMID w/ no article in the future, let me know.

need to adress 2 more errors

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

1 more error. 28510797 fetched fine for me with the generic_citation_labelled finder

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

The site which hosts the link to 28514316 (Wolters Kluwer) does not host the pdf itself; instead it provides a link to the Wolters Kluwer journal, which it loads dynamically (for some reason) via javascript.

Without porting the app to Selenium (which would require actual web browsers to be installed and set up on the device running the script, and thus I'm not too fond of it), javascript can't seem to be read by python. The requests library certainly cannot read it.

That should be all the errors you hit right now. Let me know if others show up. If you still get the libxml error, try installing the conda environment and let me know if they persist.

New version is 2.4.1

from pubmed-batch-download.

sayak1711 commented on July 20, 2024

1 more error. 28510797 fetched fine for me with the generic_citation_labelled finder

There has been a slight mistake in me pasting the error. See the line above "Trying to fetch pmid 28510797"
28514316 is the one it failed for

from pubmed-batch-download.

sayak1711 commented on July 20, 2024

Here is a complete list of pmids for which I couldn't download so far. Total 6668.

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

For 28514316, it is hosted on Wolters Kluwer, which as I mentioned above, can't be gathered with this program.

I am happy to try to help some with the files you can't get, but I don't have time to manually start checking 6668 pmids.

If you wanted to figure out the journals they belong to (ie what is the box for the hyperlink on pubmed), and then give me a list of journals to figure out to fetch, I'd be happy to try to go through that. The most helpful thing for me would be a list as follows:

Journal    Number of Articles     Example PMID
---------------------------------------------------
Precision Medicine       50           XXXXXXXX
Future Medicine          12           YYYYYYYY
.
.
.

Ranking this by number of articles could help prioritize what to work on. I dont, for example, have time to write 100 different scrapers if these articles belong to some hundreds of journals that each require their own function. Also, if the majority of those articles are Wolters Kluwer articles (for example), its pretty easy to just rule those out.

Alternatively, if you do want to write those functions, I'd be happy to check out a pull request if you edit the ipython notebook and write some new functions for fetchers I have written.

from pubmed-batch-download.

Ramloc commented on July 20, 2024

Hi Bill,
I just started using this today and its of immense value to the kind of work I normally do.
Really love it, got a few issues with it though....
Firstly wouldn't it be easy to just create a list of failed pubmed IDs as we go? I could attempt to do that myself but my skills are really rudimentary. This list could be searched and downloaded by hand.
Also I major in genetics so most of the papers that i need end up at Elsevier and Ive run a few small searches and they make up to roughly 30% of the results. Could you please add a tracker for it? its a pretty large publishing group.

Journal..............................Number of Articles..........................................................Example PMID
Elsevier.............................Approximately 30%...........................................................24985776
The world journal of Biological psychiatry........................................................... 27782767

Thanks a lot!

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

Hey @Ramloc , good suggestion. I just added it; the program as of 2.5.0 now stores the non-downloaded pmids in a file. This is written as a PMF format file (ie both the PMID and the article names are given), so this file can be directly passed to the program in the future with -pmf to try to redownload them directly.

I will look into the Elsevier and Bio psych next

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

Elsevier is the same site as Science Direct, but I inadvertantly broke this earlier and didn't have a way to test it. Recorded the PMID you gave for a test case in the future, and have it working again. Will push the version after Bio phys

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

The Biological psychiatry paper you listed downloaded via the future medicine finder for me. Let me know if it doesnt for you.

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

Version 2.5.1 has the fixes included. If both of these are fixed, let me know and close the topic :) Better to open another one to add more finders than create one super long thread to hold all finder requests forever.

from pubmed-batch-download.

Ramloc commented on July 20, 2024

Thanks for putting in the time and effort Bill, yes we should start a new threat for adding journals that dont work. Elsevier works now but all the papers downloaded from it are 6kb in size...so something is still wrong with it
the biological psychiatry paper does not. this is the error that I get

** fetching of reprint 27782767 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?
** fetching of reprint 28112043 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?

Also the unfetched pmids.tsv shows
pmid pmid
instead of pmid "paper title"
27782767 27782767

Thanks again!

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

if there was no paper title provided, the pmid is listed twice as the paper name (ie, what it is saved as) is pmid.pdf. I don't scrape the names from the pdf, nor can I if the pdf is not obtainable.

Those two articles work for me, I just retried; what versions of the python packages are you using? And just to confirm, you are on the newest version of this tool.

For Elsevier/Science Direct, it looks like they added a new redirect to their pdf hosting service; I updated the code and it now works for me. Let me know if you still have issues.

from pubmed-batch-download.

billgreenwald commented on July 20, 2024

Closing from no response for two months

from pubmed-batch-download.

Errors downloading articles about pubmed-batch-download HOT 18 CLOSED

Comments (18)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent