dat-ecosystem-archive / hypercloud Goto Github PK

A hosting server for Dat. [ DEPRECATED - see github.com/beakerbrowser/hashbase for similar functionality. More info on active projects and modules at https://dat-ecosystem.org/ ]

Home Page: https://github.com/beakerbrowser/hashbase

JavaScript 100.00%

dat p2p server

hypercloud's Introduction

See hashbase for similar functionality.

More info on active projects and modules at dat-ecosystem.org

Hypercloud ☁

Hypercloud is a public peer service for Dat archives. It provides a HTTP-accessible interface for creating an account and uploading Dats.

Features:

Simple Dat uploading and hosting
Easy to replicate Dats, Users, or entire datasets between Hypercloud deployments
Configurable user management
Easy to self-deploy

Links:

Get Involved
Documentation

Setup

Clone this repository, then run

npm install
cp config.defaults.yml config.development.yml

Modify config.development.yml to fit your needs, then start the server with npm start.

Configuration

Before deploying the service, you absolutely must modify the following config.

Basics

dir: ./.hypercloud            # where to store the data
brandname: Hypercloud         # the title of your service
hostname: hypercloud.local    # the hostname of your service
port: 8080                    # the port to run the service on
rateLimiting: true            # rate limit the HTTP requests?
defaultDiskUsageLimit: 100mb  # default maximum disk usage for each user
pm2: false                    # set to true if you're using https://keymetrics.io/

Lets Encrypt

You can enable lets-encrypt to automatically provision TLS certs using this config:

letsencrypt:
  debug: false          # debug mode? must be set to 'false' to use live config
  email: '[email protected]'  # email to register domains under

If enabled, port will be ignored and the server will register at ports 80 and 443.

Admin Account

The admin user has its credentials set by the config yaml at load. If you change the password while the server is running, then restart the server, the password will be reset to whatever is in the config.

admin:
  email: '[email protected]'
  password: myverysecretpassword

UI Module

The frontend can be replaced with a custom npm module. The default is hypercloud-ui-vanilla.

ui: hypercloud-ui-vanilla

HTTP Sites

Hypercloud can host the archives as HTTP sites. This has the added benefit of enabling dat-dns shortnames for the archives. There are two possible schemes:

sites: per-user

Per-user will host archives at username.hostname/archivename, in a scheme similar to GitHub Pages. If the archive-name is == to the username, it will be hosted at username.hostname.

Note that, in this scheme, a DNS shortname is only provided for the user archive (username.hostname).

sites: per-archive

Per-archive will host archives at archivename-username.hostname. If the archive-name is == to the username, it will be hosted at username.hostname.

By default, HTTP Sites are disabled.

Closed Registration

For a private instance, use closed registration with a whitelist of allowed emails:

registration:
  open: false
  allowed:
    - [email protected]
    - [email protected]

Reserved Usernames

Use reserved usernames to blacklist usernames which collide with frontend routes, or which might be used maliciously.

registration:
  reservedNames:
    - admin
    - root
    - support
    - noreply
    - users
    - archives

Session Tokens

Hypercloud uses Json Web Tokens to manage sessions. You absolutely must replace the secret with a random string before deployment.

sessions:
  algorithm: HS256                # probably dont update this
  secret: THIS MUST BE REPLACED!  # put something random here
  expiresIn: 1h                   # how long do sessions live?

Jobs

Hypercloud runs some jobs periodically. You can configure how frequently they run.

# processing jobs
jobs:
  popularArchivesIndex: 30s  # compute the index of archives sorted by num peers
  userDiskUsage: 5m          # compute how much disk space each user is using
  deleteDeadArchives: 5m     # delete removed archives from disk

Emailer

Todo, sorry

Tests

Run the tests with

npm test

To run the tests against a running server, specify the env var:

REMOTE_URL=http://{hostname}/ npm test

License

MIT

hypercloud's People

Contributors

Stargazers

Watchers

Forkers

packetlost nocenter taravancil puppycodes kustomzone garbados ninabreznik

hypercloud's Issues

GET /

https://github.com/joehand/hypercloud/wiki/Web-API#get-

POST /v1/admin/users/:id/suspend & ./unsuspend

https://github.com/joehand/hypercloud/wiki/Web-API#post-v1adminusersidsuspend

https://github.com/joehand/hypercloud/wiki/Web-API#post-v1adminusersidunsuspend

Default admin user

We need a way to bootstrap server config (when no users are set) and to recover the server if something goes wrong. One solution might be to give admin rights to requests that come from the localhost. Another option is to have a special admin account which gets its credentials set by the config on load.

Need to escape values that are inserted into HTML templates

Will require an update to https://github.com/pfrazee/express-es6-template-engine

Implement Jobs Manager

Per https://github.com/joehand/hypercloud/wiki/Jobs

Users need to have a fixed numeric ID assigned to them

Township is using emails as the primary ID, but emails will (eventually) be changeable for users. We need to create a fixed internal ID.

Error needs to be emitted if replication connections destroy with a failure

Quite related to #16

Turns out there are some conditions occurring which cause replication feeds to stop with an error. Replication streams should end with an error when that happens. Where-ever the replication stream is created, we need to be sure there's code listening for its end and emitting the error in the non-debug logs.

cc @mafintosh

Dead archives cleanup job

Per https://github.com/joehand/hypercloud/wiki/Jobs#dead-archive-cleanup

Production logging

We'll need to decide how much we want to log, and how. I think we'd be smart to log the heck out of all system changes, and even smarter to use a SaaS product that makes the logs easy to watch and filter and etc. But, we might need to support (and abstract over) multiple options so let's figure out which options we want to use.

Architecture decisions - initial thoughts

I think we should take a moment to talk out the architecture of the hypercloud system. It's tempting to build a simple core server around leveldb and then sharded archives servers - and maybe that's the right call - but I dont think that will scale well, and I get really nervous about using custom software at scale without a very experienced devops team.

My analysis here involves some guesswork. I may get some things wrong.

We should have these questions in mind:

What is our scale target?
How will we scale horizontally?
How will we monitor nodes, and the network?
How will we handle failover and redundancy?
When can we use off-the-shelf solutions?

Scale target

This determines almost every other decision.

If we don't use any off-the-shelf solutions (Postgres, Cassandra, Kafka, etc), I think we could handle 1k users with about 10k archives. If we expect that the archives will average ~5mb in size, that's 50GB of data.

But, 1k users is not that far off, and things get challenging pretty quickly. If we plan for it, and use some off-the-shelf software, I think we could reasonably target 10k-100k users / 100k-1mm archives (5TB of data), without having to significantly re-architect hypercloud.

The tradeoff is system complexity. A 1k target can probably be made into an easily-deployed bundle, whereas a 10-100k target requires some ops work. If we want hypercloud to be something that anybody can conveniently deploy ("your own hypercloud is easy!") then 1k is the right target. But, we'll have to rewrite a lot once we get to 1k.

As a quick reference, here's everything hypercloud will require, off the top of my head.

Load balancer
Caching server
HTTP server
User DB
Archive DB
Computed view DB
Changes feed
Indexer / Data-processing pipeline
Monitoring

Improve archive name validation rules

Currently only alphanumeric. Should also allow - and ..

Move Wiki docs into files in the repo

To support collaborators, we should move the docs into files in the repo, so that contributors can update documentation as they work.

Archive progress calculation issues

Two issues observed:

Because hypercore will stall a .get() call until some replication occurs, an archive that isn't found can hang the request indefinitely. We need a version of .get() that returns immediately with an error if not found.
The progress report can vacillate between 0 and 100 and back to 0, no idea why.

put example cloud up that people can clone

Run a cloud so people can see what it looks!

GET /:username/:datname

https://github.com/joehand/hypercloud/wiki/Web-API#get-usernamedatname

Watch profile-dat for updates and index data in Archives DB

See spec. When the user writes new entries in /dats, update the archives DB and active swarms accordingly.

Archives should only be removed from the swarm after all hosting users remove it

Currently the /v1/dats/remove route will automatically remove an archive from the swarm. It should double check whether there are other hosting users, first.

Static pages (about, terms, privacy, support)

Add https support

POST /v1/login

https://github.com/joehand/hypercloud/wiki/Web-API#post-v1login

User-management for admins

Admins need a toolset for viewing the users and enacting admin decisions. Will need to:

Run sorting and filter queries on the userbase (eg "who recently joined?" and "who is past their quota limits?")
View all user information, including payment history
Suspend and resume accounts
Change the user's plan or set custom quota limits
Add/remove access scopes

Spec the API using Swagger

Swagger is a spec and toolset for specifying Web APIs using YAML. It's also part of the OpenAPI standard. The advantages of using it are:

Concise specification language, in yaml (yay)
Automatic code generation for both client and server, if you want it
Automatic documentation generation (very handy)
Automatic test generation

You can see example specs, and the auto-generated documentation (on right), in their editor app: http://editor.swagger.io/#/

We could host the auto-generated docs as the HTML response to the /v1/ endpoint.

Implement Scheduler

Per https://github.com/joehand/hypercloud/wiki/Jobs#scheduler-api

Do we really want to use leveldb for users?

I'm looking at stuff like #21 and thinking, you know, leveldb can't run arbitrary queries. Why not use postgres?

User roles

Per #13 and others, users will need to have roles assigned which determine their rights to various APIs and datasets.

Need to close() and possibly pool hypercore instances

AFAICT, via random-access-file, hypercore opens a file-descriptor for every instance, and doesn't close the FD until .close() is called. We definitely don't want to leak those.

Any time hypercore-archiver's get() method is called (eg in archiver-api's archiveProgress() method) a hypercore instance is created. Thus, another FD is created, and not cleaned up.

Two thoughts on that:

We need to be sure to clean up all hypercores by .close()ing them
Should we be pooling hypercore instances instead of creating them with .get() every time?

User quotas

Need to:

Assign usage limits on diskspace to each user.
Track diskspace used by each user.
Provide an API for checking the current stats.
Provide an API for querying which users are up against their limit.
Automatically send email alerts to users who are nearing their limit.

I suggest we enforce the quotas using the admin dashboard, rather than automatically disabling accounts (#21)

/v1/verify should rate limit to 5 tries per day

Rate limit the login attempts by IP to avoid verification probe attempts.

Unverified users cleanup job

https://github.com/joehand/hypercloud/wiki/Jobs#unverified-user-cleanup

Replication may be pushing too many feeds into the connection

If you look in hypercore-archiver, the replication code adds all stored feeds to the connection. (Its current usage, in archiver-server, does not set passive to false.)

I'm guessing this means that hypercloud will, at minimum, announce all currently stored archives at the time of connect. That can't scale. Shouldn't the hypercloud sit and wait for requests, passively?

Test suite doesnt exit on finish

I've tracked this down to archiver-server. It keeps some handle open.

POST /v1/admin/users/:id

https://github.com/joehand/hypercloud/wiki/Web-API#post-v1adminusersid

Update the Register -> Add user flow

In the spec we had register and read user archive as a single flow.

I think it makes sense to keep those as separate events. The register could be over HTTP eventually. But the read user archive will always need to happen on the user's machine (where the dat is).

I made an add user endpoint we could use to read the archive and do all the other fun stuff.

Put /v1/{add,remove} behind admin rights

These routes should not be publicly accessible

Automated alert emails for users

Per #22, we're going to need a system for triggering emails to users on certain conditions. For instance, to warn of a quota limit.

GET /v1/account

https://github.com/joehand/hypercloud/wiki/Web-API#get-v1account

/v1/login should rate limit to 5 POSTS per hour

Rate limit the login attempts by IP to avoid password probe attempts.

Old idea: The user record should track failed attempts, and lock the account after 5 failures. When it locks, it should email the user letting them know.

Debugging transfer issues between Beaker and Hypercloud

In my debugging between beaker and a localhost hypercloud, I'm finding the first issue is that hypercloud only requests dats when the connection is first established.

What's the setup? I start Beaker with two dats being hosted from it. I open a fresh install of hypercloud, then POST dat1 to hypercloud. That replicates fine. Then I POST dat2. That does not replicate for an inconsistent number of minutes (2-5).

Why this breaks replication: Dat1 establishes a replication-connection btwn Beaker and Hypercloud. Hypercloud only requests Dats at the beginning of the connection. Therefore it's not until the first connection is dropped, and a new connection is made, that replication of dat2 occurs. See https://github.com/mafintosh/hypercore-archiver/blob/master/index.js#L92-L99

Why isnt a second connection created for the newly-POSTed dat? I could be wrong, but discovery-swarm appears to avoid opening multiple connections between two peers, using the _peersSign hash. See https://github.com/mafintosh/discovery-swarm/blob/master/index.js#L200

Should the first connection take 2-5 minutes to end? I'm unsure. @mafintosh, should it? I can understand why the connection would stay alive for more data to come through. I could also understand if the connection was closed after all work was finished.

What are our options?

Make connections close as soon as all work finishes. This isn't a very good solution. If you happen to be transferring a big archive, the connection won't close quickly.
Have hypercloud ask for newly-added archives during replication. This is what I suggest in this issue. However, that only works for an active-replication strategy. As discussed in this issue, active-replication is a bad strategy for interacting with clients. It means that, if a hypercloud has N archives, that it will make N requests per connection. That becomes a problem really fast.
Allow multiple connections to occur between peers, simultaneously. AKA, don't multiplex, and stop tracking peersSeen in discovery-swarm. I'll explain the advantage of this below.
A variant of 3. If a peer is discovered by discovery-swarm that's in peersSeen, and it was for a different archive's swarm, emit an event so we can trigger replication. I think this is the winner

Why is option 4 the winner, in my opinion? Connections created by discovery are sort of like HTTP requests that have the GET /path at the head. They enable us to know exactly why the peers connected in the first place. Blindly multiplexing additional requests is pretty much a shot in the dark. It's akin to saying, "hey while we're at it, do you happen to have archive 2?"

Option 3 has the same advantage, but it involves creating more connections. Option 4 avoids that.

Thoughts? @mafintosh @joehand @maxogden

Can we expect connections initiated by hypercloud, to user devices, via the DHT, to succeed?

I'm currently debugging the connectivity between Beaker and Hypercloud, when Hypercloud is on a VPS.

Condition Hypercloud discovered my device via the DHT, but my device did not discover Hypercloud. Hypercloud attempted to connect to my device, but failed. This is, presumably, because my device is behind the NAT.

Question Since there's no hole-punching involved there, it's just impossible that the connection would succeed. For DHT discovery, the only option is that my device initiates the connection. Correct?

POST /v1/register
POST /v1/verify
POST /v1/account