Comments (18)
I just ran
python fetch_pdfs.py -pmids 30374447
and it ran fine and downloaded. Could you let me know the following so I can help troubleshoot
- The exact command you ran
- If you installed the conda environment included in the repo
- If you did not install the conda environment, your currently installed version of lxml.
Thanks! Also, if you have other errors, please throw them in this thread instead of making new threads for each one.
from pubmed-batch-download.
I did not install conda.
lxml version is 4.2.5
The command I ran is same as the one you have written above... after "-pmids" I have given many pmids at once comma seperated.
It doesn't fail for all. It succeeds in many. However for many it throws this error I mentioned.
For example for 28645740 it gives me that error.
When I was facing that error I changed the parser from lxml to html.parser in line 99 and still the error would be there for some pmids.
from pubmed-batch-download.
Here is some other kinds of error I am facing:
** fetching of reprint 28589772 failed from error Invalid URL '119162c4-9551-4055-99fe-48538e2570bc': No schema supplied. Perhaps you meant http://119162c4-9551-4055-99fe-48538e2570bc?
Trying to fetch pmid 28543980
** fetching of reprint 28543980 failed from error ('Connection aborted.', BadStatusLine("''",))
Its not like they don't have the pdf for the above. This is the link
** Reprint 28514316 cannot be fetched as pubmed does not have a link to its pdf.
Trying to fetch pmid 28510797
Again this has a link
from pubmed-batch-download.
To the first point, just added a method to get articles from future_medicine
working on the others
from pubmed-batch-download.
Fixed the downloader for pubmed central.
To do this, removed the check to see if a PDF is linked on pubmed. The example you provided yesterday now has a PDF listed, so I can't use it to test behavior. If you come across a PMID w/ no article in the future, let me know.
need to adress 2 more errors
from pubmed-batch-download.
1 more error. 28510797 fetched fine for me with the generic_citation_labelled finder
from pubmed-batch-download.
The site which hosts the link to 28514316 (Wolters Kluwer) does not host the pdf itself; instead it provides a link to the Wolters Kluwer journal, which it loads dynamically (for some reason) via javascript.
Without porting the app to Selenium (which would require actual web browsers to be installed and set up on the device running the script, and thus I'm not too fond of it), javascript can't seem to be read by python. The requests library certainly cannot read it.
That should be all the errors you hit right now. Let me know if others show up. If you still get the libxml error, try installing the conda environment and let me know if they persist.
New version is 2.4.1
from pubmed-batch-download.
1 more error. 28510797 fetched fine for me with the generic_citation_labelled finder
There has been a slight mistake in me pasting the error. See the line above "Trying to fetch pmid 28510797"
28514316 is the one it failed for
from pubmed-batch-download.
Here is a complete list of pmids for which I couldn't download so far. Total 6668.
from pubmed-batch-download.
For 28514316, it is hosted on Wolters Kluwer, which as I mentioned above, can't be gathered with this program.
I am happy to try to help some with the files you can't get, but I don't have time to manually start checking 6668 pmids.
If you wanted to figure out the journals they belong to (ie what is the box for the hyperlink on pubmed), and then give me a list of journals to figure out to fetch, I'd be happy to try to go through that. The most helpful thing for me would be a list as follows:
Journal Number of Articles Example PMID
---------------------------------------------------
Precision Medicine 50 XXXXXXXX
Future Medicine 12 YYYYYYYY
.
.
.
Ranking this by number of articles could help prioritize what to work on. I dont, for example, have time to write 100 different scrapers if these articles belong to some hundreds of journals that each require their own function. Also, if the majority of those articles are Wolters Kluwer articles (for example), its pretty easy to just rule those out.
Alternatively, if you do want to write those functions, I'd be happy to check out a pull request if you edit the ipython notebook and write some new functions for fetchers I have written.
from pubmed-batch-download.
Hi Bill,
I just started using this today and its of immense value to the kind of work I normally do.
Really love it, got a few issues with it though....
Firstly wouldn't it be easy to just create a list of failed pubmed IDs as we go? I could attempt to do that myself but my skills are really rudimentary. This list could be searched and downloaded by hand.
Also I major in genetics so most of the papers that i need end up at Elsevier and Ive run a few small searches and they make up to roughly 30% of the results. Could you please add a tracker for it? its a pretty large publishing group.
Journal..............................Number of Articles..........................................................Example PMID
Elsevier.............................Approximately 30%...........................................................24985776
The world journal of Biological psychiatry........................................................... 27782767
Thanks a lot!
from pubmed-batch-download.
Hey @Ramloc , good suggestion. I just added it; the program as of 2.5.0 now stores the non-downloaded pmids in a file. This is written as a PMF format file (ie both the PMID and the article names are given), so this file can be directly passed to the program in the future with -pmf to try to redownload them directly.
I will look into the Elsevier and Bio psych next
from pubmed-batch-download.
Elsevier is the same site as Science Direct, but I inadvertantly broke this earlier and didn't have a way to test it. Recorded the PMID you gave for a test case in the future, and have it working again. Will push the version after Bio phys
from pubmed-batch-download.
The Biological psychiatry paper you listed downloaded via the future medicine finder for me. Let me know if it doesnt for you.
from pubmed-batch-download.
Version 2.5.1 has the fixes included. If both of these are fixed, let me know and close the topic :) Better to open another one to add more finders than create one super long thread to hold all finder requests forever.
from pubmed-batch-download.
Thanks for putting in the time and effort Bill, yes we should start a new threat for adding journals that dont work. Elsevier works now but all the papers downloaded from it are 6kb in size...so something is still wrong with it
the biological psychiatry paper does not. this is the error that I get
** fetching of reprint 27782767 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?
** fetching of reprint 28112043 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?
Also the unfetched pmids.tsv shows
pmid pmid
instead of pmid "paper title"
27782767 27782767
Thanks again!
from pubmed-batch-download.
if there was no paper title provided, the pmid is listed twice as the paper name (ie, what it is saved as) is pmid.pdf. I don't scrape the names from the pdf, nor can I if the pdf is not obtainable.
Those two articles work for me, I just retried; what versions of the python packages are you using? And just to confirm, you are on the newest version of this tool.
For Elsevier/Science Direct, it looks like they added a new redirect to their pdf hosting service; I updated the code and it now works for me. Let me know if you still have issues.
from pubmed-batch-download.
Closing from no response for two months
from pubmed-batch-download.
Related Issues (20)
- Same error message HOT 2
- use pmf with Ruby version? HOT 2
- failed to fetch HOT 7
- Trouble with Elsevier articles HOT 7
- Error with Physiology Free articles
- fetching error HOT 3
- PMID extraction in bulk! HOT 1
- Download fails: NoneType object has no attribute.. HOT 1
- Error: Invalid URL 'DirectEmailBox-inPage'
- Error:
- "failed from error Invalid URL" HOT 1
- Files are downloaded successfully, but they seem corrupt.
- Invalid URL, no scheme supplied. HOT 2
- License HOT 2
- Update to avoid known mechanize error HOT 2
- Damaged PDF & fetching stops HOT 10
- index out of range error HOT 7
- Add interface for Zotero translators HOT 8
- Trying to fetch pmid 30374447 ** fetching of reprint 30374447 failed from error ('Connection aborted.', BadStatusLine("''",)) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pubmed-batch-download.