cornelk / goscrape Goto Github PK
View Code? Open in Web Editor NEWWeb scraper that can create an offline readable version of a website
License: MIT License
Web scraper that can create an offline readable version of a website
License: MIT License
Example site :- origami.guide
This tool is unable to get all images of this site and similar sites.
While trying to scrape a website that uses basic auth, it is throwing 401
.
auth := base64.StdEncoding.EncodeToString([]byte(s.Config.Username + ":" + s.Config.Password))
s.browser.AddRequestHeader("Authorization", "Basic "+auth)
The same website throws 200
response when requested using Postman
.
Any ideas why this might not be working?
2024-06-18 10:41:17 ERROR Writing asset file failed {"url":"https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC%7C01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css?AUIClients/AmazonUI#us.not-trident","file":"www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css","error":"creating file 'www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css': open www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css: file name too long"}
i got :
# github.com/cornelk/goscrape/scraper
../../../go/packages/src/github.com/cornelk/goscrape/scraper/images.go:23:24: invalid operation: kind == "gopkg.in/h2non/filetype.v1/types".Unknown (mismatched types "github.com/h2non/filetype/types".Type and "gopkg.in/h2non/filetype.v1/types".Type)
this will fix some problems like images referenced in css
Images referenced from inline CSS are not downloaded.
<!doctype html>
<html lang="en">
<head>
<style>
h1 {
background-image: url('/background.jpg');
}
</style>
</head>
<body>
<h1>Example</h1>
</body>
</html>
Hello , any option(s) to download attachments and save it/them locally too ; some webpages contain ( zip,pdf ... files ) ; it's a great feature to be added .
Thank you.
For example:
https://www.example.com/category/blog-post/
https://www.example.com/category/blog-post
Hi there,
it looks interesting but looks like lots of things have been depreciated and out of date. do you have a plan to keep it uptodated?
URL: https://ssd.eff.org/en/module/privacy-students
Log:
2021-06-29T06:10:25.020Z INFO External URL {"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.025Z INFO Downloading {"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.031Z ERROR Scraping failed {"error": "Get \"data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D\": unsupported protocol scheme \"data\""}
<body style="background-image: url(#);">
Is your feature request related to a problem? Please describe.
I wish I could download the full online Calculus booklet from my university as a single HTML file, but I would need to log in before, so may I need to pass my cookies to goscrape
.
Describe the solution you'd like
Well, could we have a way to get Chrome-esque cookies on this program?
I'm using Vivaldi currently, so I'm willing to test the feature and even help on something if needed --- I'm currently programming in C, but I think I can go to Go (non-intended pun) with little to no effort.
Describe alternatives you've considered
I've tried to download it with the WebScrapBook extension for Chrome, but it kind of breaks the page and still referencing the original URL instead of editing it to match the local, relative path.
Additional context
Well, there are also videos on the page that WebScrapBook is incapable of downloading, so I use the hls downloader extension. I think I wouldn't bother downloading videos outside goscrape if needed anyway.
Example:
<html>
<body background=images/bg.gif>
</body>
</html>
Simple scrape started, after about 55 minutes, crash with OOM from Linux kernel. Using 3.5GB, on a 4GB machine.
My guess, this is because entire "queue" of what to download, and info which html files to process to discover more links, is stored in memory.
This is also problematic, because if process is interrupted, it need to rescrape everything from scratch. Some on-disk queue / db, could solve this. It does not need to be too complex, as simple text file with log of processed files, and separate file to point to offset of already processed files should work well. Then adding to queue is check if file already exists on disk, otherwise append to the log file. When everything appended, select from the head of queue, and update head pointer.
Statistic of what was finished downloading before the crash:
1.5GB total size
24721 files
26331 directories
19769 HTML and JavaScript files (99% is HTML)
at least 79198 unique outgoing links via href and src attributes. (very crude approximation with some sed
, sort
, uniq
and wc
example.org/cdn-cgi/styles/fonts/opensans-600.svg#open_sanssemibold
Is your feature request related to a problem? Please describe.
As a complement to #41, it would be useful that, after downloading the webpage, one could have the ability to also download videos/audios using another program (such as puemos' hls downloader) and then replacing it in the local copy using the filename specified on a text file generated along with the website scrape.
Describe the solution you'd like
Just having a text file called embed.txt
(or media.txt
) along with the downloaded HTML file would help a lot --- or, if on multiple HTMLs, html-filename_media.txt
.
I think, if I'm not underestimating the effort needed, that this could be achieved by fmt.Printf()
'ing the src=""
of every iframe
found along with a suitable name for replacing it --- I do not know about Windows, but on UNIX-compatible systems any filename would work, such as filepath.Base(iframe.src) + ".strm"
. It seems like we can't guess the file type, eh? But I think it could be used with less effort on video
and audio
tags.
By the way, not so related with this issue, but does goscrape unwrap pages that use an iframe to reference another as a way of "blindage" against piracy?
Describe alternatives you've considered
Currently I use the WebScrapBook extension for downloading, as I said in #41.
Additional context
Nothing, I think that's pretty much enough.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.