dosyago / downloadnet Goto Github PK

💾 DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!

Home Page: https://localhost:22120

License: Other

JavaScript 91.82% HTML 4.93% Shell 2.28% CSS 0.97%

archive web-archive archiver internet web-browsing disk

downloadnet's Introduction

DownloadNet: An Archive of Your Online Journey

DownloadNet empowers you to be the master archivist of your own internet browsing. As a robust, lightweight tool, DownloadNet seamlessly connects to your browser, saving and organizing your online discoveries in real-time. With an option to archive everything or only bookmark-worthy content, DownloadNet places you in full control of your browsing history. No special plugins or extensions required.

Why DownloadNet?

Access: Keep track of your online finds without breaking a sweat.
Efficiency: Find your saved content fast, saving you time for more exploration.
Flexibility: Share your archive with others or maintain your digital solitude.
Simplicity: No frills, no fuss. DownloadNet is straightforward to use, requiring no extra tools or plugins.
Organization: Search through everything you've archived with full text search of all archived content. Your own personal search engine.

Latest Updates

Local SSL Certificates Now Supported! 🔒 🎉

Ensure your DownloadNet server runs over TLS with our support for local SSL certificates.

Licensing

DownloadNet is protected under the APGL-3.0

Get DownloadNet

Download a release

or ...

Install via npm:

$ npm i -g diskernet@latest

or...

Build your own binaries:

$ git clone https://github.com/dosyago/DownloadNet
$ cd DownloadNet
$ npm i

Then:

$ ./scripts/build_setup.sh
$ ./scripts/compile.sh
$ cd bin/

Contributions!

Welcome! Get involved. :)

Navigate your digital world with DownloadNet. Download and start archiving today!

downloadnet's People

Stargazers

Watchers

Forkers

kustomzone kfrz febinjohnjames etewiah muzakparov sekmet ashbt xtracool far-ahead distropy napstar bradparks tarsbase shafiahmed alexkwood ksitko leethobbit nditah mingyaoxiao lyrl sahi7631 nkgfirecream rezaduty loongel nonomal mallikarjunece prohosterzit wooodhead mdheller rdpli sumonst21 rob-rychs dusan011 pierrenel xbl3 hangxun hirajanwin infocusgames akashd7 brandongalbraith pagevault johnjboren tbt294 uakbr designium wolfgarbe boppreh patmosxx-v2 kokizzu sbrichardson legogris amitahire qnoum slinjez hbcbh1999 teashawn usbku aureooms-contrib bilalesi jkkara uchihasr leoossa joanlopez rheehot mbtamuli jqueryalmeida biyanisuraj dwighthouse nazuramie1 kcahir khady codingspiderfox do4 dmwyatt appsplash99 jeffgarc1a danny-laa backup-fork joegle demagnevalcegelec mikeon lockcp openwrld namaljayathunga kvtb ish4ra fossabot djd0723 phanirithvij gomberg5264 gptubpkcshkzjc8fkcrxudk6sbecpm49p5xu46u theclonedude baocin m8e sagar19raorane d68fbe50 amtech aspenmayer cris691 c9fe

downloadnet's Issues

[Feature] Selectively track like bookmarks

Is there a way to selectively track where I browse. I'd like to have a way that is similar to saving bookmarks. E.g. hit a button to save the current page to make it re-searchable again.

Handle SPA URL changes

When history push is used, you don't get a new top level fetch request for the URL, but the URL can change.

This means, the URL you have in your address bar may not be the one we have saved, even tho we have saved all the resources required to actually click through to that URL from the original URL.

Example:

start at CNN
Click a few articles, and a SPA updates the URL without reloading the top frame
Switch mode to serve and reload the top frame (the page),
See "we have not saved this data"

What's a good solution to this that doesn't impact people's browsing experience but gets them the saves they would expect from what their URL looks like?

The problem is really URLs are no longer about "location" but people still think they are. In reality, they sometimes are about location, but people think they always mean location. But since SPA and history push URL is more a symbolic representation of location and view rather than an actual representation of location.

How to do this?

Not sure right now.

Adding a domain blacklist

Hi, just took a look at your project, and would love to try it out, but I've noticed there's no mention of a configurable domain blacklist anywhere. It'd be pretty important for someone who uses this tool for most of their browsing, since many people I'd imagine wouldn't like to save content from certain... NSFW domains. It would be excellent if you could add something like that.
Good luck with the project!

Documentation: how to execute binary?

How is the binary supposed to be executed? Can not see any documentation on it

➜  file 22120.macos
22120.macos: Mach-O 64-bit executable x86_64
➜  open 22120.macos
No application knows how to open 22120.macos.

add full text search to archives

what indexing and search should I use?

I don't think it has to be too high performance at the start, since most archives are not that big.

hopefully just easy to use, provides "instant" or at least suggested search and can handle misspellings and languages other than English.

Use nexe as packager?

Seems pretty good.

Remove launcher dependency

We can just use our own code to do this.

Make selected-save mode

In this mode, not all pages are saved. Only ones where you click on a specific button on the screen.

Helper extension to save a page ONLY when a bookmark is created

Chrome extensions can hook the onCreated event of the bookmarks API.

We could then communicate with the controller via some endpoint on localhost:22120 and instruct it to save resources whose fetches originated from this target (the active tab), and reload that tab to begin saving.

The button could continue to remain in 'record' mode for that tab as long as you that tab is on the same domain, or until it is switched off.

If you change tabs, the record mode indicator may change, because each tab will have its own record mode state.

I don't think it's so simple as saying "only record the page load" because you can have dynamic content (single page app routes that don't create a navigation request, but change the view, and things like 'page preview' tooltips and hovercards that create views on the page from dynamic content that a person may want to explore from the saved copy, not to mention the 'when i scroll down lazily load images' behavior many pages now have).

It's not as simple as recording a single page load. In effect this feature is about defining the 'extent' of a single page, which, considering dynamic content and all the (possibly cross-origin) resources it can pull, a sort of vague and ambiguous concept.

That being said, I think the current idea captures it in the best possible way so far: a record button that starts when you add a bookmark, and keeps going for that tab as long as the domain is the same and until you switch it off.

Actually probably better to make it "record until a top level navigation" is triggered. Then there's a closer alignment between the idea of bookmark and what's saved. A bookmark doesn't stretch to all other pages on the same domain. You need to rebookmark each one.

This lets me think of another idea, we have a blacklist, perhaps we also need a whitelist.

Save everything from domains that you whitelist. You can add a domain to the whitelist at any time via the control panel @ http://localhost:22120

This was originally inspired by #17

@alber70g please jump in and give feedback if you think this is straying from what you want or if anything sounds weird to you here.

Issues installing over npm

Hey! Really interesting project. Had some issues installing the latest over NPM, but the previous version installed fine.

Running npm i archivist1 shows errors like:

npm ERR! code ENOENT
npm ERR! syscall chmod
npm ERR! path /path/to/folder/node_modules/archivist1/22120.js
npm ERR! errno -2
npm ERR! enoent ENOENT: no such file or directory, chmod '/path/to/folder/node_modules/archivist1/22120.js'
npm ERR! enoent This is related to npm not being able to find a file.
npm ERR! enoent

npm: 6.14.4
Node.js: v14.0.0
macOS

Multiple-request handling suggestion: store content with timestamps as part of the filename

The internet archive famously does this and it works quite well.
It does have the tradeoff of increasing the overall size when duplicates are stored, but it’s a more accurate way of recording pages as requesting the same page even an hour apart can result in a totally different page.

Chrome Launcher - Help users launch Chrome/Canary/Chromium automagically

https://github.com/GoogleChrome/chrome-launcher

Just an idea. : )

I've used this project before to automate launching Chrome & friends. This could help lower some barriers to entry for this project, for users not so comfortable launching browsers from a CLI. It would give us a place to bake in the dev port number selection.

"Normal Browsing" mode

First, thanks a lot for open sourcing this. The direct integration with the browser is brilliant, it makes archiving effortless.

I noticed that pages load slightly slower when the browser is controlled with this app. I might be wrong as I didn't do any measurement, but wondering if some settings make it so? If that's the case I'd like to propose a "Normal Browsing" mode, i.e. saving is disabled and the app doesn't tweak or interfere with the browser unless the save mode is enabled. That would also allow saving only the pages I care about.

xapian inclusion

Use the ideas from 'sharp' install script to be able to pull either a binary or source and build for the platform we are installing to so we can use xapian

Also, for Microsoft, Linux and Mac let's pre-build and include in the binary

Finish extension

Need to handle sessions/tabs independently, rather than here we can attach to browser target and capture everything, so there's a little more plumbing.

Also we can't write to disk so format will be different and stored in chrome.storage.sync

path 'crawl' mode

basically just point it at a domain (and optionally a path prefix) and it will pull out everything on that domain under there.

with an option for "include subdomains", so if you point it at hotness.com with include subdomains, then it will also follow to goodness.hotness.com etc

implement library server

must include fts

use bepis for the control page

append-only chronological archive

currently I think re-navigating to a page already in archive will overwrite the archive's copy,

it would be nice to timestamp it instead.

questions:

how can versions be organized in the index?
how will this make sense and not be confusing?

"Open your browser with --remote-debugging-port=9222"

acquire all page targets as they appear

currently, there's an issue with timing. some page targets appear too quickly for us to capture them for the first time. they are ours when they reload...but we want them first time.

how? and why not now?

SPA JavaScript

Are there plans to support apps like outlook.com, Reddit, and Twitter with complex JavaScript that looks at the current url?

Save redirects

Right now we're just discarding redirects. This breaks a lot of sites.

We should save them properly (just don't get a response body for them).

Domain whitelist: only save resources fetch by pages from certain domains

You can add domains to the whitelist at any time.

Video support

How much effort would it be to get a site like youtube to work with this? I noticed the What about streaming content? section in your readme, but wondering what it'd take and what the limitations are?

Current the page loads, but video does not play.

Performance idea: change hash function

We only hash the resource path, but still, this can be extremely slow in aggregate because of the hash functions we use. I could switch to a faster but still secure and good hash (like discohash). This might provide some speedup in archiving.

consider fetch for solving the early stage target interception unreliability issue

this is an issue with chrome dev tools protocol many are facing (including puppeteer).

But what if:

Fetch is a browser domain (not specific to page) --- I think it is since I use it without session here I think, and it works, and
We add a small ms delay to all fetch pauses to give time for the target to be intercepted before we allow those requests to be completed, and therefore before the first load

I think this could work

Are you aware of webrecorder/conifer?

I think your projects overlap, and would be great to see you working together :)
https://conifer.rhizome.org/

add `--new-window` flag to chrome to remove need to close chrome first?

Sounds like a great idea if it works! :p ;) xx

Page enable must be called first

before using Page commands

improve delay efficiency on fetch

the delay is there to let us:

attach to a target before a page really loads, and
determine if we added the injection script before the new document loads (or not and so require a reload to make sure the tab is running with the injection)

It's not possible to avoid all reloads ( I guess some things like persistent caches, service workers and so on ), just get in the way of this, even when passing in a cache directory parameter and turning off cache and service workers, these can still have effects if they are engaged before we attach, which can happen.

Even so, a blanket 500ms delay adds lag, and in fact, after the tab is set up, we can switch off the delay for all future fetches from that tab, because we know it is loaded.

How to determine this?

Well a request (in the paused request data) comes with a frameId. If we store frameIds somehow, or alternately correlate between a NetworkID (obtained through, say, requestWillBeSent), we can work out which requests come from tabs that are already installed, and let those continue with no delay.

This should decrease lag, without decreasing our interception performance.

The delay basically only needs to be long enough (and exist until) we can attach to the page. It just prevents too much from happening in the page before we attach...but even so, like I said, sometimes things get through owing to caches, and so on. Caches do not come through the Fetch domain.

search

Creating fresh cache doesn't seem to be working to avoid cache hits

Cache hits mean we don't save it as no network request occurs.

We could always tap into the cache API domain in devtools, but I think it's simpler just to tell the browser not to use cache.

This manifests as, for example, No favicons on any page in serve mode (because favicon are mostly pulled via 304 or browser cache)

How to solve this?

No idea right now.

Runs, does not save pages, errors in the shell [maintainer: Please rename this bug]

First, my browser is ungoogled-chromium, which might be too far from chrome?

Second, I don't think that this is the same as #55, but it might be(?)

Thirdly:

Browser:

Version 84.0.4147.135 (Developer Build) built on Debian buster/10, running on Debian buster/10 (64-bit)

but ungoogled-chromium fakes everything to be identical to 'googled' chromium, so make of that what you will

node.js

15.2.0

nvm list
-> v15.2.0
system
default -> node (-> v15.2.0)
node -> stable (-> v15.2.0) (default)
stable -> 15.2 (-> v15.2.0) (default)
iojs -> N/A (default)
unstable -> N/A (default)
lts/* -> lts/fermium (-> N/A)
lts/argon -> v4.9.1 (-> N/A)
lts/boron -> v6.17.1 (-> N/A)
lts/carbon -> v8.17.0 (-> N/A)
lts/dubnium -> v10.23.0 (-> N/A)
lts/erbium -> v12.19.0 (-> N/A)
lts/fermium -> v14.15.0 (-> N/A)

Machine

Linux Isis 4.19.0-4-amd64 #1 SMP Debian 4.19.28-2 (2019-03-15) x86_64 GNU/Linux

Error


> npm start

> [email protected] start /bench/github_local/DOWN_ONLY/22120
> node index.js

Args usage: <server_port> <save|serve> <chrome_port> <library_path>
Running in node...
Importing dependencies...
Attempting to shut running chrome...
There was no running chrome.
Removing 22120's existing temporary browser cache if it exists...
Launching library server...
Library server started.
Waiting 1 second...
{"server_up":{"upAt":"2020-11-11T23:31:03.469Z","port":22120}}
Launching chrome...
(node:6104) UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:9222
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1107:14)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6104) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:6104) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at saveCache (/bench/github_local/DOWN_ONLY/22120/archivist.js:436:8)
    at Object.changeMode (/bench/github_local/DOWN_ONLY/22120/archivist.js:416:3)
    at app.post (/bench/github_local/DOWN_ONLY/22120/libraryServer.js:59:21)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:137:13)
    at Route.dispatch (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:112:3)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at /bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:281:22
    at Function.process_params (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:335:12)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:275:10)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:6104) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at saveCache (/bench/github_local/DOWN_ONLY/22120/archivist.js:436:8)
    at Object.changeMode (/bench/github_local/DOWN_ONLY/22120/archivist.js:416:3)
    at app.post (/bench/github_local/DOWN_ONLY/22120/libraryServer.js:59:21)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:137:13)
    at Route.dispatch (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:112:3)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at /bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:281:22
    at Function.process_params (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:335:12)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:275:10)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)

Let's add ignore default flags to chrome-launcher

Otherwise browsing becomes weird as mute audio is on.

binary release does not work on linux

I tried to download the latest .nix binary release but when I tried to execute it on my manjaro linux box I got the following error:

internal/bootstrap/pre_execution.js:81
    throw 'Invalid Nexe binary';
    ^
Invalid Nexe binary

Am I missing something?

Thanks!

EDIT: add env info.

'self-running' portable archive

In other words. You can take an archive. And then package it into a binary.

What will happen is when you run that binary, you can operate all pages saved in that archive.

So it gives you a way to distribute archives together with a way to view them.

Ignore browser cache option?

The issue is when the browser uses its cache, we don't get a Fetch request I think so we are not saving those things.

add vanity url for library server

becuase we control fetch it's possible to set

http://localhost:22120

to another URL

such as

https://archive.internet.arpa or whatever

I think this would be fun and cool

saner internet

https://news.ycombinator.com/item?id=22125077

consider a conditional for checking if reload required

See if we can get ideas for how to do this from puppeteer code for ._requiresReload (or something like that)

they use this in their code, I guess for the same reason that I am considering it, because early stage target interception is unreliable.

Custom archive directory ?

Hi,

Great project!
Is there any way to custom the archive directory? I would like to use a another directory on another disk.

Make it possible to relaunch a running chrome

Currently you have to close chrome before you start this, otherwise it won't connect.

It should be possible to close chrome if it's open and relaunch it.

style the control page better

select which pages to archive somehow

helper extension
tab overlay

some other way

as alternative to archive everything

add archive stats feature

You can then see:

how many pages you saved
how much data it takes
average size of page
spread between types (html, js, css, media, etc)

and so on

just intersting stuff from your archive

Here's the whole enchilada:

❯ npx archivist1
npx: installed 76 in 17.841s
Args usage: <server_port> <save|serve> <chrome_port> <library_path>
Running in node...
Importing dependencies...
Attempting to shut running chrome...
Running chrome shut down.
Waiting 1 second...
Removing 22120's existing temporary browser cache if it exists...
Launching library server...
Library server started.
Waiting 1 second...
{"server_up":{"upAt":"2020-02-09T06:18:06.414Z","port":22120}}
Launching chrome...
(node:27437) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at K (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:11456)
    at Object.changeMode (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:7502)
    at /Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:13528
    at s.handle_request (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:128:783)
    at s (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:121:879)
    at p.dispatch (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:121:901)
    at s.handle_request (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:128:783)
    at /Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:2530
    at Function.v.process_params (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:3433)
    at g (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:2473)
(node:27437) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:27437) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:27437) UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:9222
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)
(node:27437) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)

new target interception is causing hangs sometimes when loading in serve from index

damn