cornelk / goscrape Goto Github PK

View Code? Open in Web Editor NEW

184.0 184.0 34.0 3.73 MB

Web scraper that can create an offline readable version of a website

License: MIT License

Go 97.87% Makefile 1.97% Dockerfile 0.17%

go golang scraper webscraping

goscrape's Issues

Add option to include external domains

Not grabbing all images

Example site :- origami.guide
This tool is unable to get all images of this site and similar sites.

Basic Auth Not Working

While trying to scrape a website that uses basic auth, it is throwing 401.

auth := base64.StdEncoding.EncodeToString([]byte(s.Config.Username + ":" + s.Config.Password))
s.browser.AddRequestHeader("Authorization", "Basic "+auth)

The same website throws 200 response when requested using Postman.

Any ideas why this might not be working?

Avoid too long filenames

2024-06-18 10:41:17  ERROR   Writing asset file failed {"url":"https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC%7C01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css?AUIClients/AmazonUI#us.not-trident","file":"www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css","error":"creating file 'www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css': open www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css: file name too long"}

error on install

i got :

# github.com/cornelk/goscrape/scraper
../../../go/packages/src/github.com/cornelk/goscrape/scraper/images.go:23:24: invalid operation: kind == "gopkg.in/h2non/filetype.v1/types".Unknown (mismatched types "github.com/h2non/filetype/types".Type and "gopkg.in/h2non/filetype.v1/types".Type)

Add serving of downloaded website by a http server

this will fix some problems like images referenced in css

Inline CSS not parsed

Images referenced from inline CSS are not downloaded.

<!doctype html>
<html lang="en">
<head>

<style>
h1 {
  background-image: url('/background.jpg');
}
</style>

</head>
<body>

<h1>Example</h1>

</body>
</html>

Attachment(s) download

Hello , any option(s) to download attachments and save it/them locally too ; some webpages contain ( zip,pdf ... files ) ; it's a great feature to be added .

Thank you.

Handle trailing slash at end as duplicate by default

For example:

https://www.example.com/category/blog-post/
https://www.example.com/category/blog-post

Continue scraping after a 404 error

any plan to keep this uptodate?

Hi there,

it looks interesting but looks like lots of things have been depreciated and out of date. do you have a plan to keep it uptodated?

Failure to scraping data URI

URL: https://ssd.eff.org/en/module/privacy-students

Log:

2021-06-29T06:10:25.020Z	INFO	External URL	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.025Z	INFO	Downloading	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.031Z	ERROR	Scraping failed	{"error": "Get \"data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D\": unsupported protocol scheme \"data\""}

Add concurrent downloading of URLs

HTML style attributes are not parsed

<body style="background-image: url(#);">

Can this be used with Chrome/Chromium/Vivaldi/Brave cookies?

Is your feature request related to a problem? Please describe.
I wish I could download the full online Calculus booklet from my university as a single HTML file, but I would need to log in before, so may I need to pass my cookies to goscrape.

Describe the solution you'd like
Well, could we have a way to get Chrome-esque cookies on this program?
I'm using Vivaldi currently, so I'm willing to test the feature and even help on something if needed --- I'm currently programming in C, but I think I can go to Go (non-intended pun) with little to no effort.

Describe alternatives you've considered
I've tried to download it with the WebScrapBook extension for Chrome, but it kind of breaks the page and still referencing the original URL instead of editing it to match the local, relative path.

Additional context
Well, there are also videos on the page that WebScrapBook is incapable of downloading, so I use the hls downloader extension. I think I wouldn't bother downloading videos outside goscrape if needed anyway.

Handle img scrset attributes

https://developer.mozilla.org/en-US/docs/Learn/HTML/Multimedia_and_embedding/Responsive_images

how to set cookie

background attribute on body tag is not handled

Example:

<html>
<body background=images/bg.gif>
</body>
</html>

Runs out of memory on big scrapes

Simple scrape started, after about 55 minutes, crash with OOM from Linux kernel. Using 3.5GB, on a 4GB machine.

My guess, this is because entire "queue" of what to download, and info which html files to process to discover more links, is stored in memory.

This is also problematic, because if process is interrupted, it need to rescrape everything from scratch. Some on-disk queue / db, could solve this. It does not need to be too complex, as simple text file with log of processed files, and separate file to point to offset of already processed files should work well. Then adding to queue is check if file already exists on disk, otherwise append to the log file. When everything appended, select from the head of queue, and update head pointer.

Statistic of what was finished downloading before the crash:

1.5GB total size
24721 files
26331 directories
19769 HTML and JavaScript files (99% is HTML)
at least 79198 unique outgoing links via href and src attributes. (very crude approximation with some sed, sort, uniq and wc

Filter fragments at the end of URLs

example.org/cdn-cgi/styles/fonts/opensans-600.svg#open_sanssemibold

Create a text file containing references of media in the scrapped website and the filename that one can use to replace it locally

Is your feature request related to a problem? Please describe.
As a complement to #41, it would be useful that, after downloading the webpage, one could have the ability to also download videos/audios using another program (such as puemos' hls downloader) and then replacing it in the local copy using the filename specified on a text file generated along with the website scrape.

Describe the solution you'd like
Just having a text file called embed.txt (or media.txt) along with the downloaded HTML file would help a lot --- or, if on multiple HTMLs, html-filename_media.txt.
I think, if I'm not underestimating the effort needed, that this could be achieved by fmt.Printf()'ing the src="" of every iframe found along with a suitable name for replacing it --- I do not know about Windows, but on UNIX-compatible systems any filename would work, such as filepath.Base(iframe.src) + ".strm". It seems like we can't guess the file type, eh? But I think it could be used with less effort on video and audio tags.
By the way, not so related with this issue, but does goscrape unwrap pages that use an iframe to reference another as a way of "blindage" against piracy?

Describe alternatives you've considered
Currently I use the WebScrapBook extension for downloading, as I said in #41.

Additional context
Nothing, I think that's pretty much enough.

cornelk / goscrape Goto Github PK

goscrape's Issues

Recommend Projects

Recommend Topics

Recommend Org