Giter Site home page Giter Site logo

dosyago / downloadnet Goto Github PK

View Code? Open in Web Editor NEW
3.6K 42.0 137.0 10.89 MB

πŸ’Ύ DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!

Home Page: https://localhost:22120

License: Other

JavaScript 91.82% HTML 4.93% Shell 2.28% CSS 0.97%
archive web-archive archiver internet web-browsing disk

downloadnet's Introduction

DownloadNet: An Archive of Your Online Journey

source lines of code npm downloads (22120) npm downloads (diskernet, since Jan 2022) binary downloads visitors+++ version DownloadNet slogan

DownloadNet empowers you to be the master archivist of your own internet browsing. As a robust, lightweight tool, DownloadNet seamlessly connects to your browser, saving and organizing your online discoveries in real-time. With an option to archive everything or only bookmark-worthy content, DownloadNet places you in full control of your browsing history. No special plugins or extensions required.

Why DownloadNet?

  • Access: Keep track of your online finds without breaking a sweat.
  • Efficiency: Find your saved content fast, saving you time for more exploration.
  • Flexibility: Share your archive with others or maintain your digital solitude.
  • Simplicity: No frills, no fuss. DownloadNet is straightforward to use, requiring no extra tools or plugins.
  • Organization: Search through everything you've archived with full text search of all archived content. Your own personal search engine.

Latest Updates

Local SSL Certificates Now Supported! πŸ”’ πŸŽ‰

Ensure your DownloadNet server runs over TLS with our support for local SSL certificates.

Licensing

DownloadNet is protected under the APGL-3.0

Get DownloadNet

Download a release

or ...

Install via npm:

$ npm i -g diskernet@latest

or...

Build your own binaries:

$ git clone https://github.com/dosyago/DownloadNet
$ cd DownloadNet
$ npm i

Then:

$ ./scripts/build_setup.sh
$ ./scripts/compile.sh
$ cd bin/

Contributions!

Welcome! Get involved. :)


Navigate your digital world with DownloadNet. Download and start archiving today!

downloadnet's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

downloadnet's Issues

[Feature] Selectively track like bookmarks

Is there a way to selectively track where I browse. I'd like to have a way that is similar to saving bookmarks. E.g. hit a button to save the current page to make it re-searchable again.

Handle SPA URL changes

When history push is used, you don't get a new top level fetch request for the URL, but the URL can change.

This means, the URL you have in your address bar may not be the one we have saved, even tho we have saved all the resources required to actually click through to that URL from the original URL.

Example:

  • start at CNN
  • Click a few articles, and a SPA updates the URL without reloading the top frame
  • Switch mode to serve and reload the top frame (the page),
  • See "we have not saved this data"

What's a good solution to this that doesn't impact people's browsing experience but gets them the saves they would expect from what their URL looks like?

The problem is really URLs are no longer about "location" but people still think they are. In reality, they sometimes are about location, but people think they always mean location. But since SPA and history push URL is more a symbolic representation of location and view rather than an actual representation of location.

How to do this?

Not sure right now.

Adding a domain blacklist

Hi, just took a look at your project, and would love to try it out, but I've noticed there's no mention of a configurable domain blacklist anywhere. It'd be pretty important for someone who uses this tool for most of their browsing, since many people I'd imagine wouldn't like to save content from certain... NSFW domains. It would be excellent if you could add something like that.
Good luck with the project!

Documentation: how to execute binary?

How is the binary supposed to be executed? Can not see any documentation on it

➜  file 22120.macos
22120.macos: Mach-O 64-bit executable x86_64
➜  open 22120.macos
No application knows how to open 22120.macos.

add full text search to archives

what indexing and search should I use?

I don't think it has to be too high performance at the start, since most archives are not that big.

hopefully just easy to use, provides "instant" or at least suggested search and can handle misspellings and languages other than English.

Make selected-save mode

In this mode, not all pages are saved. Only ones where you click on a specific button on the screen.

Helper extension to save a page ONLY when a bookmark is created

Chrome extensions can hook the onCreated event of the bookmarks API.

We could then communicate with the controller via some endpoint on localhost:22120 and instruct it to save resources whose fetches originated from this target (the active tab), and reload that tab to begin saving.

The button could continue to remain in 'record' mode for that tab as long as you that tab is on the same domain, or until it is switched off.

If you change tabs, the record mode indicator may change, because each tab will have its own record mode state.

I don't think it's so simple as saying "only record the page load" because you can have dynamic content (single page app routes that don't create a navigation request, but change the view, and things like 'page preview' tooltips and hovercards that create views on the page from dynamic content that a person may want to explore from the saved copy, not to mention the 'when i scroll down lazily load images' behavior many pages now have).

It's not as simple as recording a single page load. In effect this feature is about defining the 'extent' of a single page, which, considering dynamic content and all the (possibly cross-origin) resources it can pull, a sort of vague and ambiguous concept.

That being said, I think the current idea captures it in the best possible way so far: a record button that starts when you add a bookmark, and keeps going for that tab as long as the domain is the same and until you switch it off.

Actually probably better to make it "record until a top level navigation" is triggered. Then there's a closer alignment between the idea of bookmark and what's saved. A bookmark doesn't stretch to all other pages on the same domain. You need to rebookmark each one.

This lets me think of another idea, we have a blacklist, perhaps we also need a whitelist.

Save everything from domains that you whitelist. You can add a domain to the whitelist at any time via the control panel @ http://localhost:22120

This was originally inspired by #17

@alber70g please jump in and give feedback if you think this is straying from what you want or if anything sounds weird to you here.

Issues installing over npm

Hey! Really interesting project. Had some issues installing the latest over NPM, but the previous version installed fine.

Running npm i archivist1 shows errors like:

npm ERR! code ENOENT
npm ERR! syscall chmod
npm ERR! path /path/to/folder/node_modules/archivist1/22120.js
npm ERR! errno -2
npm ERR! enoent ENOENT: no such file or directory, chmod '/path/to/folder/node_modules/archivist1/22120.js'
npm ERR! enoent This is related to npm not being able to find a file.
npm ERR! enoent

npm: 6.14.4
Node.js: v14.0.0
macOS

"Normal Browsing" mode

First, thanks a lot for open sourcing this. The direct integration with the browser is brilliant, it makes archiving effortless.

I noticed that pages load slightly slower when the browser is controlled with this app. I might be wrong as I didn't do any measurement, but wondering if some settings make it so? If that's the case I'd like to propose a "Normal Browsing" mode, i.e. saving is disabled and the app doesn't tweak or interfere with the browser unless the save mode is enabled. That would also allow saving only the pages I care about.

xapian inclusion

Use the ideas from 'sharp' install script to be able to pull either a binary or source and build for the platform we are installing to so we can use xapian

Also, for Microsoft, Linux and Mac let's pre-build and include in the binary

Finish extension

Need to handle sessions/tabs independently, rather than here we can attach to browser target and capture everything, so there's a little more plumbing.

Also we can't write to disk so format will be different and stored in chrome.storage.sync

path 'crawl' mode

basically just point it at a domain (and optionally a path prefix) and it will pull out everything on that domain under there.

with an option for "include subdomains", so if you point it at hotness.com with include subdomains, then it will also follow to goodness.hotness.com etc

append-only chronological archive

currently I think re-navigating to a page already in archive will overwrite the archive's copy,

it would be nice to timestamp it instead.

questions:

  • how can versions be organized in the index?
  • how will this make sense and not be confusing?

acquire all page targets as they appear

currently, there's an issue with timing. some page targets appear too quickly for us to capture them for the first time. they are ours when they reload...but we want them first time.

how? and why not now?

SPA JavaScript

Are there plans to support apps like outlook.com, Reddit, and Twitter with complex JavaScript that looks at the current url?

Save redirects

Right now we're just discarding redirects. This breaks a lot of sites.

We should save them properly (just don't get a response body for them).

Video support

How much effort would it be to get a site like youtube to work with this? I noticed the What about streaming content? section in your readme, but wondering what it'd take and what the limitations are?

Current the page loads, but video does not play.

Performance idea: change hash function

We only hash the resource path, but still, this can be extremely slow in aggregate because of the hash functions we use. I could switch to a faster but still secure and good hash (like discohash). This might provide some speedup in archiving.

consider fetch for solving the early stage target interception unreliability issue

this is an issue with chrome dev tools protocol many are facing (including puppeteer).

But what if:

  1. Fetch is a browser domain (not specific to page) --- I think it is since I use it without session here I think, and it works, and
  2. We add a small ms delay to all fetch pauses to give time for the target to be intercepted before we allow those requests to be completed, and therefore before the first load

I think this could work

improve delay efficiency on fetch

the delay is there to let us:

  1. attach to a target before a page really loads, and
  2. determine if we added the injection script before the new document loads (or not and so require a reload to make sure the tab is running with the injection)

It's not possible to avoid all reloads ( I guess some things like persistent caches, service workers and so on ), just get in the way of this, even when passing in a cache directory parameter and turning off cache and service workers, these can still have effects if they are engaged before we attach, which can happen.

Even so, a blanket 500ms delay adds lag, and in fact, after the tab is set up, we can switch off the delay for all future fetches from that tab, because we know it is loaded.

How to determine this?

Well a request (in the paused request data) comes with a frameId. If we store frameIds somehow, or alternately correlate between a NetworkID (obtained through, say, requestWillBeSent), we can work out which requests come from tabs that are already installed, and let those continue with no delay.

This should decrease lag, without decreasing our interception performance.

The delay basically only needs to be long enough (and exist until) we can attach to the page. It just prevents too much from happening in the page before we attach...but even so, like I said, sometimes things get through owing to caches, and so on. Caches do not come through the Fetch domain.

Creating fresh cache doesn't seem to be working to avoid cache hits

Cache hits mean we don't save it as no network request occurs.

We could always tap into the cache API domain in devtools, but I think it's simpler just to tell the browser not to use cache.

This manifests as, for example, No favicons on any page in serve mode (because favicon are mostly pulled via 304 or browser cache)

How to solve this?

No idea right now.

Runs, does not save pages, errors in the shell [maintainer: Please rename this bug]

First, my browser is ungoogled-chromium, which might be too far from chrome?

Second, I don't think that this is the same as #55, but it might be(?)

Thirdly:

Browser:

  • Version 84.0.4147.135 (Developer Build) built on Debian buster/10, running on Debian buster/10 (64-bit)

but ungoogled-chromium fakes everything to be identical to 'googled' chromium, so make of that what you will

node.js

  • 15.2.0

nvm list
-> v15.2.0
system
default -> node (-> v15.2.0)
node -> stable (-> v15.2.0) (default)
stable -> 15.2 (-> v15.2.0) (default)
iojs -> N/A (default)
unstable -> N/A (default)
lts/* -> lts/fermium (-> N/A)
lts/argon -> v4.9.1 (-> N/A)
lts/boron -> v6.17.1 (-> N/A)
lts/carbon -> v8.17.0 (-> N/A)
lts/dubnium -> v10.23.0 (-> N/A)
lts/erbium -> v12.19.0 (-> N/A)
lts/fermium -> v14.15.0 (-> N/A)

Machine

  • Linux Isis 4.19.0-4-amd64 #1 SMP Debian 4.19.28-2 (2019-03-15) x86_64 GNU/Linux

Error


> npm start

> [email protected] start /bench/github_local/DOWN_ONLY/22120
> node index.js

Args usage: <server_port> <save|serve> <chrome_port> <library_path>
Running in node...
Importing dependencies...
Attempting to shut running chrome...
There was no running chrome.
Removing 22120's existing temporary browser cache if it exists...
Launching library server...
Library server started.
Waiting 1 second...
{"server_up":{"upAt":"2020-11-11T23:31:03.469Z","port":22120}}
Launching chrome...
(node:6104) UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:9222
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1107:14)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:6104) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:6104) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at saveCache (/bench/github_local/DOWN_ONLY/22120/archivist.js:436:8)
    at Object.changeMode (/bench/github_local/DOWN_ONLY/22120/archivist.js:416:3)
    at app.post (/bench/github_local/DOWN_ONLY/22120/libraryServer.js:59:21)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:137:13)
    at Route.dispatch (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:112:3)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at /bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:281:22
    at Function.process_params (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:335:12)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:275:10)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)
(node:6104) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at saveCache (/bench/github_local/DOWN_ONLY/22120/archivist.js:436:8)
    at Object.changeMode (/bench/github_local/DOWN_ONLY/22120/archivist.js:416:3)
    at app.post (/bench/github_local/DOWN_ONLY/22120/libraryServer.js:59:21)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:137:13)
    at Route.dispatch (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/route.js:112:3)
    at Layer.handle [as handle_request] (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/layer.js:95:5)
    at /bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:281:22
    at Function.process_params (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:335:12)
    at next (/bench/github_local/DOWN_ONLY/22120/node_modules/express/lib/router/index.js:275:10)
(node:6104) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 5)

binary release does not work on linux

I tried to download the latest .nix binary release but when I tried to execute it on my manjaro linux box I got the following error:

internal/bootstrap/pre_execution.js:81
    throw 'Invalid Nexe binary';
    ^
Invalid Nexe binary

Am I missing something?

Thanks!

EDIT: add env info.

'self-running' portable archive

In other words. You can take an archive. And then package it into a binary.

What will happen is when you run that binary, you can operate all pages saved in that archive.

So it gives you a way to distribute archives together with a way to view them.

Ignore browser cache option?

The issue is when the browser uses its cache, we don't get a Fetch request I think so we are not saving those things.

consider a conditional for checking if reload required

See if we can get ideas for how to do this from puppeteer code for ._requiresReload (or something like that)

they use this in their code, I guess for the same reason that I am considering it, because early stage target interception is unreliable.

Custom archive directory ?

Hi,

Great project!
Is there any way to custom the archive directory? I would like to use a another directory on another disk.

add archive stats feature

You can then see:

  • how many pages you saved
  • how much data it takes
  • average size of page
  • spread between types (html, js, css, media, etc)

and so on

just intersting stuff from your archive

32bit version

Hey, is it may possible to make an 32bit version too? Some archivist computers still run under this (like my own). :)

(node:27437) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined

Tried node 11 and 12, and also tried a clone of the repo, and via npx - but I get this message when I try to switch over to save mode:

(node:27437) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined

Here's the whole enchilada:

❯ npx archivist1
npx: installed 76 in 17.841s
Args usage: <server_port> <save|serve> <chrome_port> <library_path>
Running in node...
Importing dependencies...
Attempting to shut running chrome...
Running chrome shut down.
Waiting 1 second...
Removing 22120's existing temporary browser cache if it exists...
Launching library server...
Library server started.
Waiting 1 second...
{"server_up":{"upAt":"2020-02-09T06:18:06.414Z","port":22120}}
Launching chrome...
(node:27437) UnhandledPromiseRejectionWarning: TypeError: Cannot read property 'writeFileSync' of undefined
    at K (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:11456)
    at Object.changeMode (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:7502)
    at /Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:321:13528
    at s.handle_request (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:128:783)
    at s (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:121:879)
    at p.dispatch (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:121:901)
    at s.handle_request (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:128:783)
    at /Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:2530
    at Function.v.process_params (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:3433)
    at g (/Users/pierre/.npm/_npx/27437/lib/node_modules/archivist1/22120.js:114:2473)
(node:27437) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)
(node:27437) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:27437) UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED 127.0.0.1:9222
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1128:14)
(node:27437) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.