cornelk / goscrape Goto Github PK

View Code? Open in Web Editor NEW

184.0 6.0 33.0 3.71 MB

Web scraper that can create an offline readable version of a website

License: MIT License

Go 97.70% Makefile 2.12% Dockerfile 0.18%

scraper golang go webscraping

goscrape's Introduction

goscrape - create offline browsable copies of websites

A web scraper built with Golang. It downloads the content of a website and allows it to be archived and read offline.

Features

Features and advantages over existing tools like wget, httrack, Teleport Pro:

Free and open source
Available for all platforms that Golang supports
JPEG and PNG images can be converted down in quality to save disk space
Excluded URLS will not be fetched (unlike wget)
No incomplete temp files are left on disk
Downloaded asset files are skipped in a new scraper run
Assets from external domains are downloaded automatically
Sane default values

Limitations

No GUI version, console only

Installation

There are 2 options to install goscrape:

Download and unpack a binary release from Releases

Compile the latest release from source:

go install github.com/cornelk/goscrape@latest

Compiling the tool from source code needs to have a recent version of Golang installed.

Usage

Scrape a website by running

goscrape http://website.com

To serve the downloaded website directory in a local run webserver use

goscrape --serve website.com

Options

Scrape a website and create an offline browsable version on the disk.

Usage: goscrape [--include INCLUDE] [--exclude EXCLUDE] [--output OUTPUT] [--depth DEPTH] [--imagequality IMAGEQUALITY] [--timeout TIMEOUT] [--serve SERVE] [--serverport SERVERPORT] [--cookiefile COOKIEFILE] [--savecookiefile SAVECOOKIEFILE] [--header HEADER] [--proxy PROXY] [--user USER] [--useragent USERAGENT] [--verbose] [URLS [URLS ...]]

Positional arguments:
  URLS

Options:
  --include INCLUDE, -n INCLUDE
                         only include URLs with PERL Regular Expressions support
  --exclude EXCLUDE, -x EXCLUDE
                         exclude URLs with PERL Regular Expressions support
  --output OUTPUT, -o OUTPUT
                         output directory to write files to
  --depth DEPTH, -d DEPTH
                         download depth, 0 for unlimited [default: 10]
  --imagequality IMAGEQUALITY, -i IMAGEQUALITY
                         image quality, 0 to disable reencoding
  --timeout TIMEOUT, -t TIMEOUT
                         time limit in seconds for each HTTP request to connect and read the request body
  --serve SERVE, -s SERVE
                         serve the website using a webserver
  --serverport SERVERPORT, -r SERVERPORT
                         port to use for the webserver [default: 8080]
  --cookiefile COOKIEFILE, -c COOKIEFILE
                         file containing the cookie content
  --savecookiefile SAVECOOKIEFILE
                         file to save the cookie content
  --header HEADER, -h HEADER
                         HTTP header to use for scraping
  --proxy PROXY, -p PROXY
                         HTTP proxy to use for scraping
  --user USER, -u USER   user[:password] to use for HTTP authentication
  --useragent USERAGENT, -a USERAGENT
                         user agent to use for scraping
  --verbose, -v          verbose output
  --help, -h             display this help and exit
  --version              display version and exit

Cookies

Cookies can be passed in a file using the --cookiefile parameter and a file containing cookies in the following format:

[{"name":"user","value":"123"},{"name":"sessioe","value":"sid"}]

goscrape's People

Contributors

Stargazers

Watchers

goscrape's Issues

Add serving of downloaded website by a http server

this will fix some problems like images referenced in css

Add option to include external domains

Handle trailing slash at end as duplicate by default

For example:

https://www.example.com/category/blog-post/
https://www.example.com/category/blog-post

Filter fragments at the end of URLs

example.org/cdn-cgi/styles/fonts/opensans-600.svg#open_sanssemibold

Continue scraping after a 404 error

Not grabbing all images

Example site :- origami.guide
This tool is unable to get all images of this site and similar sites.

Basic Auth Not Working

While trying to scrape a website that uses basic auth, it is throwing 401.

auth := base64.StdEncoding.EncodeToString([]byte(s.Config.Username + ":" + s.Config.Password))
s.browser.AddRequestHeader("Authorization", "Basic "+auth)

The same website throws 200 response when requested using Postman.

Any ideas why this might not be working?

Failure to scraping data URI

URL: https://ssd.eff.org/en/module/privacy-students

Log:

2021-06-29T06:10:25.020Z	INFO	External URL	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.025Z	INFO	Downloading	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.031Z	ERROR	Scraping failed	{"error": "Get \"data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D\": unsupported protocol scheme \"data\""}

background attribute on body tag is not handled

Example:

<html>
<body background=images/bg.gif>
</body>
</html>

error on install

i got :

# github.com/cornelk/goscrape/scraper
../../../go/packages/src/github.com/cornelk/goscrape/scraper/images.go:23:24: invalid operation: kind == "gopkg.in/h2non/filetype.v1/types".Unknown (mismatched types "github.com/h2non/filetype/types".Type and "gopkg.in/h2non/filetype.v1/types".Type)

Add concurrent downloading of URLs

Can this be used with Chrome/Chromium/Vivaldi/Brave cookies?

Is your feature request related to a problem? Please describe.
I wish I could download the full online Calculus booklet from my university as a single HTML file, but I would need to log in before, so may I need to pass my cookies to goscrape.

Describe the solution you'd like
Well, could we have a way to get Chrome-esque cookies on this program?
I'm using Vivaldi currently, so I'm willing to test the feature and even help on something if needed --- I'm currently programming in C, but I think I can go to Go (non-intended pun) with little to no effort.

Describe alternatives you've considered
I've tried to download it with the WebScrapBook extension for Chrome, but it kind of breaks the page and still referencing the original URL instead of editing it to match the local, relative path.

Additional context
Well, there are also videos on the page that WebScrapBook is incapable of downloading, so I use the hls downloader extension. I think I wouldn't bother downloading videos outside goscrape if needed anyway.

Avoid too long filenames

2024-06-18 10:41:17  ERROR   Writing asset file failed {"url":"https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC%7C01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css?AUIClients/AmazonUI#us.not-trident","file":"www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css","error":"creating file 'www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css': open www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css: file name too long"}

Inline CSS not parsed

Images referenced from inline CSS are not downloaded.

<!doctype html>
<html lang="en">
<head>

<style>
h1 {
  background-image: url('/background.jpg');
}
</style>

</head>
<body>

<h1>Example</h1>

</body>
</html>

HTML style attributes are not parsed

<body style="background-image: url(#);">

Runs out of memory on big scrapes

Simple scrape started, after about 55 minutes, crash with OOM from Linux kernel. Using 3.5GB, on a 4GB machine.

My guess, this is because entire "queue" of what to download, and info which html files to process to discover more links, is stored in memory.

This is also problematic, because if process is interrupted, it need to rescrape everything from scratch. Some on-disk queue / db, could solve this. It does not need to be too complex, as simple text file with log of processed files, and separate file to point to offset of already processed files should work well. Then adding to queue is check if file already exists on disk, otherwise append to the log file. When everything appended, select from the head of queue, and update head pointer.

Statistic of what was finished downloading before the crash:

1.5GB total size
24721 files
26331 directories
19769 HTML and JavaScript files (99% is HTML)
at least 79198 unique outgoing links via href and src attributes. (very crude approximation with some sed, sort, uniq and wc

Attachment(s) download

Hello , any option(s) to download attachments and save it/them locally too ; some webpages contain ( zip,pdf ... files ) ; it's a great feature to be added .

Thank you.

how to set cookie

Create a text file containing references of media in the scrapped website and the filename that one can use to replace it locally

Is your feature request related to a problem? Please describe.
As a complement to #41, it would be useful that, after downloading the webpage, one could have the ability to also download videos/audios using another program (such as puemos' hls downloader) and then replacing it in the local copy using the filename specified on a text file generated along with the website scrape.

Describe the solution you'd like
Just having a text file called embed.txt (or media.txt) along with the downloaded HTML file would help a lot --- or, if on multiple HTMLs, html-filename_media.txt.
I think, if I'm not underestimating the effort needed, that this could be achieved by fmt.Printf()'ing the src="" of every iframe found along with a suitable name for replacing it --- I do not know about Windows, but on UNIX-compatible systems any filename would work, such as filepath.Base(iframe.src) + ".strm". It seems like we can't guess the file type, eh? But I think it could be used with less effort on video and audio tags.
By the way, not so related with this issue, but does goscrape unwrap pages that use an iframe to reference another as a way of "blindage" against piracy?

Describe alternatives you've considered
Currently I use the WebScrapBook extension for downloading, as I said in #41.

Additional context
Nothing, I think that's pretty much enough.