Giter Site home page Giter Site logo

goscrape's Issues

Not grabbing all images

Example site :- origami.guide
This tool is unable to get all images of this site and similar sites.

Basic Auth Not Working

While trying to scrape a website that uses basic auth, it is throwing 401.

auth := base64.StdEncoding.EncodeToString([]byte(s.Config.Username + ":" + s.Config.Password))
s.browser.AddRequestHeader("Authorization", "Basic "+auth)

The same website throws 200 response when requested using Postman.

Any ideas why this might not be working?

Avoid too long filenames

2024-06-18 10:41:17  ERROR   Writing asset file failed {"url":"https://m.media-amazon.com/images/I/11EIQ5IGqaL._RC%7C01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css?AUIClients/AmazonUI#us.not-trident","file":"www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css","error":"creating file 'www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css': open www.amazon.com/_m.media-amazon.com/images/I/11EIQ5IGqaL._RC|01e5ncglxyL.css,01lF2n-pPaL.css,41SwWPpN5yL.css,31+Z83i6adL.css,01IWMurvs8L.css,01ToTiqCP7L.css,01qPl4hxayL.css,01ITNc8rK9L.css,413Vvv3GONL.css,11TIuySqr6L.css,01Rw4F+QU6L.css,11j54vTBQxL.css,01pbKJp5dbL.css,01IdKcBuAdL.css,01y-XAlI+2L.css,213SZJ8Z+PL.css,01oDR3IULNL.css,51qPa7JG96L.css,01XPHJk60-L.css,01dmkcyJuIL.css,01B9+-hVWxL.css,21Ol27dM9tL.css,11JRZ3s9niL.css,21wA+jAxKjL.css,11U8GXfhueL.css,01CFUgsA-YL.css,316CD+csp-L.css,116t+WD27UL.css,11uWFHlOmWL.css,11v8YDG4ifL.css,11otOAnaYoL.css,01FwL+mJQOL.css,11NDsgnHEZL.css,21RE+gQIxcL.css,11CLXYZ6DRL.css,012f1fcyibL.css,21w-O41p+SL.css,11XH+76vMZL.css,11hvENnYNUL.css,11FRI-QT39L.css,01890+Vwk8L.css,01864Lq457L.css,01cbS3UK11L.css,21F85am0yFL.css,01ySiGRmxlL.css,016Sx2kF1+L.css_.css: file name too long"}

error on install

i got :

# github.com/cornelk/goscrape/scraper
../../../go/packages/src/github.com/cornelk/goscrape/scraper/images.go:23:24: invalid operation: kind == "gopkg.in/h2non/filetype.v1/types".Unknown (mismatched types "github.com/h2non/filetype/types".Type and "gopkg.in/h2non/filetype.v1/types".Type)

Inline CSS not parsed

Images referenced from inline CSS are not downloaded.

<!doctype html>
<html lang="en">
<head>

<style>
h1 {
  background-image: url('/background.jpg');
}
</style>

</head>
<body>

<h1>Example</h1>

</body>
</html>

Attachment(s) download

Hello , any option(s) to download attachments and save it/them locally too ; some webpages contain ( zip,pdf ... files ) ; it's a great feature to be added .

Thank you.

any plan to keep this uptodate?

Hi there,

it looks interesting but looks like lots of things have been depreciated and out of date. do you have a plan to keep it uptodated?

Failure to scraping data URI

URL: https://ssd.eff.org/en/module/privacy-students

Log:

2021-06-29T06:10:25.020Z	INFO	External URL	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.025Z	INFO	Downloading	{"URL": "data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D"}
2021-06-29T06:10:25.031Z	ERROR	Scraping failed	{"error": "Get \"data:image/gif;base64,R0lGODlhAQABAAD/ACwAAAAAAQABAAACADs%3D\": unsupported protocol scheme \"data\""}

Can this be used with Chrome/Chromium/Vivaldi/Brave cookies?

Is your feature request related to a problem? Please describe.
I wish I could download the full online Calculus booklet from my university as a single HTML file, but I would need to log in before, so may I need to pass my cookies to goscrape.

Describe the solution you'd like
Well, could we have a way to get Chrome-esque cookies on this program?
I'm using Vivaldi currently, so I'm willing to test the feature and even help on something if needed --- I'm currently programming in C, but I think I can go to Go (non-intended pun) with little to no effort.

Describe alternatives you've considered
I've tried to download it with the WebScrapBook extension for Chrome, but it kind of breaks the page and still referencing the original URL instead of editing it to match the local, relative path.

Additional context
Well, there are also videos on the page that WebScrapBook is incapable of downloading, so I use the hls downloader extension. I think I wouldn't bother downloading videos outside goscrape if needed anyway.

Runs out of memory on big scrapes

Simple scrape started, after about 55 minutes, crash with OOM from Linux kernel. Using 3.5GB, on a 4GB machine.

My guess, this is because entire "queue" of what to download, and info which html files to process to discover more links, is stored in memory.

This is also problematic, because if process is interrupted, it need to rescrape everything from scratch. Some on-disk queue / db, could solve this. It does not need to be too complex, as simple text file with log of processed files, and separate file to point to offset of already processed files should work well. Then adding to queue is check if file already exists on disk, otherwise append to the log file. When everything appended, select from the head of queue, and update head pointer.

Statistic of what was finished downloading before the crash:

1.5GB total size
24721 files
26331 directories
19769 HTML and JavaScript files (99% is HTML)
at least 79198 unique outgoing links via href and src attributes. (very crude approximation with some sed, sort, uniq and wc

Create a text file containing references of media in the scrapped website and the filename that one can use to replace it locally

Is your feature request related to a problem? Please describe.
As a complement to #41, it would be useful that, after downloading the webpage, one could have the ability to also download videos/audios using another program (such as puemos' hls downloader) and then replacing it in the local copy using the filename specified on a text file generated along with the website scrape.

Describe the solution you'd like
Just having a text file called embed.txt (or media.txt) along with the downloaded HTML file would help a lot --- or, if on multiple HTMLs, html-filename_media.txt.
I think, if I'm not underestimating the effort needed, that this could be achieved by fmt.Printf()'ing the src="" of every iframe found along with a suitable name for replacing it --- I do not know about Windows, but on UNIX-compatible systems any filename would work, such as filepath.Base(iframe.src) + ".strm". It seems like we can't guess the file type, eh? But I think it could be used with less effort on video and audio tags.
By the way, not so related with this issue, but does goscrape unwrap pages that use an iframe to reference another as a way of "blindage" against piracy?

Describe alternatives you've considered
Currently I use the WebScrapBook extension for downloading, as I said in #41.

Additional context
Nothing, I think that's pretty much enough.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.