Comments (7)
Why pdf documents are converted as html files ? is it a bug ?
from goscrape.
thanks for the report, do you have any example website to test?
from goscrape.
https://eetaa722.fr/
pdf links are at the end of main page. Here is one of them:
<p style="padding-left: 30px;"><a href="https://eetaa722.fr/wp-content/uploads/2023/02/4-STI2D-term.pdf" rel="attachment wp-att-6320">4-STI2D Term</a></p>
Below a Screenshot
Also, this one have pb
http://eetaa722.fr/index.php/mentions-legales/
from goscrape.
this has been fixed now in the latest main branch version, let me know if you still see any issues
from goscrape.
Soory but still see issues. Here is a log demostrate that pdf files are downloaded as html
`2023-04-01T19:53:10.986+0200 DEBUG HTML Element relinked {"URL": "https://eetaa722.fr/wp-content/uploads/2023/01/Aide-pour-linscription-dans-demarches-simplifiees.pdf", "Fixed": "wp-content/uploads/2023/01/Aide-pour-linscription-dans-demarches-simplifiees.html"}
2023-04-01T19:53:10.989+0200 DEBUG HTML Element relinked {"URL": "https://eetaa722.fr/wp-content/uploads/2023/01/Declaration-du-representant-legal.pdf", "Fixed": "wp-content/uploads/2023/01/Declaration-du-representant-legal.html"}
2023-04-01T19:53:10.989+0200 DEBUG HTML Element relinked {"URL": "https://eetaa722.fr/wp-content/uploads/2023/01/Attestation-sur-lhonneur.pdf", "Fixed": "wp-content/uploads/2023/01/Attestation-sur-lhonneur.html"}`
from goscrape.
@lotphi make sure to install the latest dev version:
go install github.com/cornelk/goscrape@main
goscrape -v https://eetaa722.fr/
...
2023-04-04 12:21:57 DEBUG HTML Element relinked {"url":"https://eetaa722.fr/wp-content/uploads/2023/01/Declaration-du-representant-legal.pdf","fixed_url":"wp-content/uploads/2023/01/Declaration-du-representant-legal.pdf"}
from goscrape.
Perfect !
Thanks a lot
from goscrape.
Related Issues (18)
- Add option to include external domains
- how to set cookie
- how to set cookie
- Continue scraping after a 404 error
- Handle trailing slash at end as duplicate by default
- Filter fragments at the end of URLs
- error on install HOT 4
- Add concurrent downloading of URLs
- Basic Auth Not Working HOT 3
- Not grabbing all images HOT 3
- Failure to scraping data URI HOT 1
- any plan to keep this uptodate? HOT 2
- Add serving of downloaded website by a http server
- Handle img scrset attributes
- Inline CSS not parsed
- background attribute on body tag is not handled
- Runs out of memory on big scrapes
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from goscrape.