Giter Site home page Giter Site logo

goscrape's Introduction

goscrape - create offline browsable copies of websites

Build status go.dev reference Go Report Card codecov

A web scraper built with Golang. It downloads the content of a website and allows it to be archived and read offline.

Features

Features and advantages over existing tools like wget, httrack, Teleport Pro:

  • Free and open source
  • Available for all platforms that Golang supports
  • JPEG and PNG images can be converted down in quality to save disk space
  • Excluded URLS will not be fetched (unlike wget)
  • No incomplete temp files are left on disk
  • Downloaded asset files are skipped in a new scraper run
  • Assets from external domains are downloaded automatically
  • Sane default values

Limitations

  • No GUI version, console only

Installation

There are 2 options to install goscrape:

  1. Download and unpack a binary release from Releases

or

  1. Compile the latest release from source:
go install github.com/cornelk/goscrape@latest

Compiling the tool from source code needs to have a recent version of Golang installed.

Usage

Scrape a website by running

goscrape http://website.com

To serve the downloaded website directory in a local run webserver use

goscrape --serve website.com

Options

Scrape a website and create an offline browsable version on the disk.

Usage: goscrape [--include INCLUDE] [--exclude EXCLUDE] [--output OUTPUT] [--depth DEPTH] [--imagequality IMAGEQUALITY] [--timeout TIMEOUT] [--serve SERVE] [--serverport SERVERPORT] [--cookiefile COOKIEFILE] [--savecookiefile SAVECOOKIEFILE] [--header HEADER] [--proxy PROXY] [--user USER] [--useragent USERAGENT] [--verbose] [URLS [URLS ...]]

Positional arguments:
  URLS

Options:
  --include INCLUDE, -n INCLUDE
                         only include URLs with PERL Regular Expressions support
  --exclude EXCLUDE, -x EXCLUDE
                         exclude URLs with PERL Regular Expressions support
  --output OUTPUT, -o OUTPUT
                         output directory to write files to
  --depth DEPTH, -d DEPTH
                         download depth, 0 for unlimited [default: 10]
  --imagequality IMAGEQUALITY, -i IMAGEQUALITY
                         image quality, 0 to disable reencoding
  --timeout TIMEOUT, -t TIMEOUT
                         time limit in seconds for each HTTP request to connect and read the request body
  --serve SERVE, -s SERVE
                         serve the website using a webserver
  --serverport SERVERPORT, -r SERVERPORT
                         port to use for the webserver [default: 8080]
  --cookiefile COOKIEFILE, -c COOKIEFILE
                         file containing the cookie content
  --savecookiefile SAVECOOKIEFILE
                         file to save the cookie content
  --header HEADER, -h HEADER
                         HTTP header to use for scraping
  --proxy PROXY, -p PROXY
                         HTTP proxy to use for scraping
  --user USER, -u USER   user[:password] to use for HTTP authentication
  --useragent USERAGENT, -a USERAGENT
                         user agent to use for scraping
  --verbose, -v          verbose output
  --help, -h             display this help and exit
  --version              display version and exit

Cookies

Cookies can be passed in a file using the --cookiefile parameter and a file containing cookies in the following format:

[{"name":"user","value":"123"},{"name":"sessioe","value":"sid"}]

goscrape's People

Contributors

aagat avatar aorfanos avatar cornelk avatar fanyang89 avatar ikozinov avatar nizarcan avatar yuseferi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

goscrape's Issues

Not grabbing all images

Example site :- origami.guide
This tool is unable to get all images of this site and similar sites.

Basic Auth Not Working

While trying to scrape a website that uses basic auth, it is throwing 401.

auth := base64.StdEncoding.EncodeToString([]byte(s.Config.Username + ":" + s.Config.Password))
s.browser.AddRequestHeader("Authorization", "Basic "+auth)

The same website throws 200 response when requested using Postman.

Any ideas why this might not be working?

Failure to scraping data URI

URL: https://ssd.eff.org/en/module/privacy-students

Log:

2021-06-29T06:10:25.020Z	INFO	External URL	{"URL": "%3D"}
2021-06-29T06:10:25.025Z	INFO	Downloading	{"URL": "%3D"}
2021-06-29T06:10:25.031Z	ERROR	Scraping failed	{"error": "Get \"%3D\": unsupported protocol scheme \"data\""}

error on install

i got :

# github.com/cornelk/goscrape/scraper
../../../go/packages/src/github.com/cornelk/goscrape/scraper/images.go:23:24: invalid operation: kind == "gopkg.in/h2non/filetype.v1/types".Unknown (mismatched types "github.com/h2non/filetype/types".Type and "gopkg.in/h2non/filetype.v1/types".Type)

Can this be used with Chrome/Chromium/Vivaldi/Brave cookies?

Is your feature request related to a problem? Please describe.
I wish I could download the full online Calculus booklet from my university as a single HTML file, but I would need to log in before, so may I need to pass my cookies to goscrape.

Describe the solution you'd like
Well, could we have a way to get Chrome-esque cookies on this program?
I'm using Vivaldi currently, so I'm willing to test the feature and even help on something if needed --- I'm currently programming in C, but I think I can go to Go (non-intended pun) with little to no effort.

Describe alternatives you've considered
I've tried to download it with the WebScrapBook extension for Chrome, but it kind of breaks the page and still referencing the original URL instead of editing it to match the local, relative path.

Additional context
Well, there are also videos on the page that WebScrapBook is incapable of downloading, so I use the hls downloader extension. I think I wouldn't bother downloading videos outside goscrape if needed anyway.

Avoid too long filenames

2024-06-18 10:41:17  ERROR   Writing asset file failed {"url":"https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC%7C01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css?AUIClients/AmazonUI#us.not-trident","file":"www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css","error":"creating file 'www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css': open www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css: file name too long"}

Inline CSS not parsed

Images referenced from inline CSS are not downloaded.

<!doctype html>
<html lang="en">
<head>

<style>
h1 {
  background-image: url('/background.jpg');
}
</style>

</head>
<body>

<h1>Example</h1>

</body>
</html>

Runs out of memory on big scrapes

Simple scrape started, after about 55 minutes, crash with OOM from Linux kernel. Using 3.5GB, on a 4GB machine.

My guess, this is because entire "queue" of what to download, and info which html files to process to discover more links, is stored in memory.

This is also problematic, because if process is interrupted, it need to rescrape everything from scratch. Some on-disk queue / db, could solve this. It does not need to be too complex, as simple text file with log of processed files, and separate file to point to offset of already processed files should work well. Then adding to queue is check if file already exists on disk, otherwise append to the log file. When everything appended, select from the head of queue, and update head pointer.

Statistic of what was finished downloading before the crash:

1.5GB total size
24721 files
26331 directories
19769 HTML and JavaScript files (99% is HTML)
at least 79198 unique outgoing links via href and src attributes. (very crude approximation with some sed, sort, uniq and wc

Attachment(s) download

Hello , any option(s) to download attachments and save it/them locally too ; some webpages contain ( zip,pdf ... files ) ; it's a great feature to be added .

Thank you.

Create a text file containing references of media in the scrapped website and the filename that one can use to replace it locally

Is your feature request related to a problem? Please describe.
As a complement to #41, it would be useful that, after downloading the webpage, one could have the ability to also download videos/audios using another program (such as puemos' hls downloader) and then replacing it in the local copy using the filename specified on a text file generated along with the website scrape.

Describe the solution you'd like
Just having a text file called embed.txt (or media.txt) along with the downloaded HTML file would help a lot --- or, if on multiple HTMLs, html-filename_media.txt.
I think, if I'm not underestimating the effort needed, that this could be achieved by fmt.Printf()'ing the src="" of every iframe found along with a suitable name for replacing it --- I do not know about Windows, but on UNIX-compatible systems any filename would work, such as filepath.Base(iframe.src) + ".strm". It seems like we can't guess the file type, eh? But I think it could be used with less effort on video and audio tags.
By the way, not so related with this issue, but does goscrape unwrap pages that use an iframe to reference another as a way of "blindage" against piracy?

Describe alternatives you've considered
Currently I use the WebScrapBook extension for downloading, as I said in #41.

Additional context
Nothing, I think that's pretty much enough.

any plan to keep this uptodate?

Hi there,

it looks interesting but looks like lots of things have been depreciated and out of date. do you have a plan to keep it uptodated?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.