Giter Site home page Giter Site logo

dat-ecosystem-archive / hypercloud Goto Github PK

View Code? Open in Web Editor NEW
97.0 3.0 12.0 401 KB

A hosting server for Dat. [ DEPRECATED - see github.com/beakerbrowser/hashbase for similar functionality. More info on active projects and modules at https://dat-ecosystem.org/ ]

Home Page: https://github.com/beakerbrowser/hashbase

JavaScript 100.00%
dat p2p server

hypercloud's Introduction

deprecated See hashbase for similar functionality.

More info on active projects and modules at dat-ecosystem.org


Hypercloud โ˜

Hypercloud is a public peer service for Dat archives. It provides a HTTP-accessible interface for creating an account and uploading Dats.

Features:

  • Simple Dat uploading and hosting
  • Easy to replicate Dats, Users, or entire datasets between Hypercloud deployments
  • Configurable user management
  • Easy to self-deploy

Links:

Setup

Clone this repository, then run

npm install
cp config.defaults.yml config.development.yml

Modify config.development.yml to fit your needs, then start the server with npm start.

Configuration

Before deploying the service, you absolutely must modify the following config.

Basics

dir: ./.hypercloud            # where to store the data
brandname: Hypercloud         # the title of your service
hostname: hypercloud.local    # the hostname of your service
port: 8080                    # the port to run the service on
rateLimiting: true            # rate limit the HTTP requests?
defaultDiskUsageLimit: 100mb  # default maximum disk usage for each user
pm2: false                    # set to true if you're using https://keymetrics.io/

Lets Encrypt

You can enable lets-encrypt to automatically provision TLS certs using this config:

letsencrypt:
  debug: false          # debug mode? must be set to 'false' to use live config
  email: '[email protected]'  # email to register domains under

If enabled, port will be ignored and the server will register at ports 80 and 443.

Admin Account

The admin user has its credentials set by the config yaml at load. If you change the password while the server is running, then restart the server, the password will be reset to whatever is in the config.

admin:
  email: '[email protected]'
  password: myverysecretpassword

UI Module

The frontend can be replaced with a custom npm module. The default is hypercloud-ui-vanilla.

ui: hypercloud-ui-vanilla

HTTP Sites

Hypercloud can host the archives as HTTP sites. This has the added benefit of enabling dat-dns shortnames for the archives. There are two possible schemes:

sites: per-user

Per-user will host archives at username.hostname/archivename, in a scheme similar to GitHub Pages. If the archive-name is == to the username, it will be hosted at username.hostname.

Note that, in this scheme, a DNS shortname is only provided for the user archive (username.hostname).

sites: per-archive

Per-archive will host archives at archivename-username.hostname. If the archive-name is == to the username, it will be hosted at username.hostname.

By default, HTTP Sites are disabled.

Closed Registration

For a private instance, use closed registration with a whitelist of allowed emails:

registration:
  open: false
  allowed:
    - [email protected]
    - [email protected]

Reserved Usernames

Use reserved usernames to blacklist usernames which collide with frontend routes, or which might be used maliciously.

registration:
  reservedNames:
    - admin
    - root
    - support
    - noreply
    - users
    - archives

Session Tokens

Hypercloud uses Json Web Tokens to manage sessions. You absolutely must replace the secret with a random string before deployment.

sessions:
  algorithm: HS256                # probably dont update this
  secret: THIS MUST BE REPLACED!  # put something random here
  expiresIn: 1h                   # how long do sessions live?

Jobs

Hypercloud runs some jobs periodically. You can configure how frequently they run.

# processing jobs
jobs:
  popularArchivesIndex: 30s  # compute the index of archives sorted by num peers
  userDiskUsage: 5m          # compute how much disk space each user is using
  deleteDeadArchives: 5m     # delete removed archives from disk

Emailer

Todo, sorry

Tests

Run the tests with

npm test

To run the tests against a running server, specify the env var:

REMOTE_URL=http://{hostname}/ npm test

License

MIT

hypercloud's People

Contributors

joehand avatar ninabreznik avatar pfrazee avatar taravancil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

hypercloud's Issues

Default admin user

We need a way to bootstrap server config (when no users are set) and to recover the server if something goes wrong. One solution might be to give admin rights to requests that come from the localhost. Another option is to have a special admin account which gets its credentials set by the config on load.

Production logging

We'll need to decide how much we want to log, and how. I think we'd be smart to log the heck out of all system changes, and even smarter to use a SaaS product that makes the logs easy to watch and filter and etc. But, we might need to support (and abstract over) multiple options so let's figure out which options we want to use.

Architecture decisions - initial thoughts

I think we should take a moment to talk out the architecture of the hypercloud system. It's tempting to build a simple core server around leveldb and then sharded archives servers - and maybe that's the right call - but I dont think that will scale well, and I get really nervous about using custom software at scale without a very experienced devops team.

My analysis here involves some guesswork. I may get some things wrong.

We should have these questions in mind:

  • What is our scale target?
  • How will we scale horizontally?
  • How will we monitor nodes, and the network?
  • How will we handle failover and redundancy?
  • When can we use off-the-shelf solutions?

Scale target

This determines almost every other decision.

If we don't use any off-the-shelf solutions (Postgres, Cassandra, Kafka, etc), I think we could handle 1k users with about 10k archives. If we expect that the archives will average ~5mb in size, that's 50GB of data.

But, 1k users is not that far off, and things get challenging pretty quickly. If we plan for it, and use some off-the-shelf software, I think we could reasonably target 10k-100k users / 100k-1mm archives (5TB of data), without having to significantly re-architect hypercloud.

The tradeoff is system complexity. A 1k target can probably be made into an easily-deployed bundle, whereas a 10-100k target requires some ops work. If we want hypercloud to be something that anybody can conveniently deploy ("your own hypercloud is easy!") then 1k is the right target. But, we'll have to rewrite a lot once we get to 1k.

As a quick reference, here's everything hypercloud will require, off the top of my head.

  • Load balancer
  • Caching server
  • HTTP server
  • User DB
  • Archive DB
  • Computed view DB
  • Changes feed
  • Indexer / Data-processing pipeline
  • Monitoring

Archive progress calculation issues

Two issues observed:

  1. Because hypercore will stall a .get() call until some replication occurs, an archive that isn't found can hang the request indefinitely. We need a version of .get() that returns immediately with an error if not found.
  2. The progress report can vacillate between 0 and 100 and back to 0, no idea why.

User-management for admins

Admins need a toolset for viewing the users and enacting admin decisions. Will need to:

  1. Run sorting and filter queries on the userbase (eg "who recently joined?" and "who is past their quota limits?")
  2. View all user information, including payment history
  3. Suspend and resume accounts
  4. Change the user's plan or set custom quota limits
  5. Add/remove access scopes

Spec the API using Swagger

Swagger is a spec and toolset for specifying Web APIs using YAML. It's also part of the OpenAPI standard. The advantages of using it are:

  • Concise specification language, in yaml (yay)
  • Automatic code generation for both client and server, if you want it
  • Automatic documentation generation (very handy)
  • Automatic test generation

You can see example specs, and the auto-generated documentation (on right), in their editor app: http://editor.swagger.io/#/

We could host the auto-generated docs as the HTML response to the /v1/ endpoint.

User roles

Per #13 and others, users will need to have roles assigned which determine their rights to various APIs and datasets.

Need to close() and possibly pool hypercore instances

AFAICT, via random-access-file, hypercore opens a file-descriptor for every instance, and doesn't close the FD until .close() is called. We definitely don't want to leak those.

Any time hypercore-archiver's get() method is called (eg in archiver-api's archiveProgress() method) a hypercore instance is created. Thus, another FD is created, and not cleaned up.

Two thoughts on that:

  1. We need to be sure to clean up all hypercores by .close()ing them
  2. Should we be pooling hypercore instances instead of creating them with .get() every time?

User quotas

Need to:

  1. Assign usage limits on diskspace to each user.
  2. Track diskspace used by each user.
  3. Provide an API for checking the current stats.
  4. Provide an API for querying which users are up against their limit.
  5. Automatically send email alerts to users who are nearing their limit.

I suggest we enforce the quotas using the admin dashboard, rather than automatically disabling accounts (#21)

Update the Register -> Add user flow

In the spec we had register and read user archive as a single flow.

I think it makes sense to keep those as separate events. The register could be over HTTP eventually. But the read user archive will always need to happen on the user's machine (where the dat is).

I made an add user endpoint we could use to read the archive and do all the other fun stuff.

/v1/login should rate limit to 5 POSTS per hour

Rate limit the login attempts by IP to avoid password probe attempts.


Old idea: The user record should track failed attempts, and lock the account after 5 failures. When it locks, it should email the user letting them know.

Debugging transfer issues between Beaker and Hypercloud

In my debugging between beaker and a localhost hypercloud, I'm finding the first issue is that hypercloud only requests dats when the connection is first established.

What's the setup? I start Beaker with two dats being hosted from it. I open a fresh install of hypercloud, then POST dat1 to hypercloud. That replicates fine. Then I POST dat2. That does not replicate for an inconsistent number of minutes (2-5).

Why this breaks replication: Dat1 establishes a replication-connection btwn Beaker and Hypercloud. Hypercloud only requests Dats at the beginning of the connection. Therefore it's not until the first connection is dropped, and a new connection is made, that replication of dat2 occurs. See https://github.com/mafintosh/hypercore-archiver/blob/master/index.js#L92-L99

Why isnt a second connection created for the newly-POSTed dat? I could be wrong, but discovery-swarm appears to avoid opening multiple connections between two peers, using the _peersSign hash. See https://github.com/mafintosh/discovery-swarm/blob/master/index.js#L200

Should the first connection take 2-5 minutes to end? I'm unsure. @mafintosh, should it? I can understand why the connection would stay alive for more data to come through. I could also understand if the connection was closed after all work was finished.

What are our options?

  1. Make connections close as soon as all work finishes. This isn't a very good solution. If you happen to be transferring a big archive, the connection won't close quickly.
  2. Have hypercloud ask for newly-added archives during replication. This is what I suggest in this issue. However, that only works for an active-replication strategy. As discussed in this issue, active-replication is a bad strategy for interacting with clients. It means that, if a hypercloud has N archives, that it will make N requests per connection. That becomes a problem really fast.
  3. Allow multiple connections to occur between peers, simultaneously. AKA, don't multiplex, and stop tracking peersSeen in discovery-swarm. I'll explain the advantage of this below.
  4. A variant of 3. If a peer is discovered by discovery-swarm that's in peersSeen, and it was for a different archive's swarm, emit an event so we can trigger replication. I think this is the winner

Why is option 4 the winner, in my opinion? Connections created by discovery are sort of like HTTP requests that have the GET /path at the head. They enable us to know exactly why the peers connected in the first place. Blindly multiplexing additional requests is pretty much a shot in the dark. It's akin to saying, "hey while we're at it, do you happen to have archive 2?"

Option 3 has the same advantage, but it involves creating more connections. Option 4 avoids that.

Thoughts? @mafintosh @joehand @maxogden

Can we expect connections initiated by hypercloud, to user devices, via the DHT, to succeed?

I'm currently debugging the connectivity between Beaker and Hypercloud, when Hypercloud is on a VPS.

Condition Hypercloud discovered my device via the DHT, but my device did not discover Hypercloud. Hypercloud attempted to connect to my device, but failed. This is, presumably, because my device is behind the NAT.

Question Since there's no hole-punching involved there, it's just impossible that the connection would succeed. For DHT discovery, the only option is that my device initiates the connection. Correct?

Rate limiting

Need to put in a basic IP-based throttle, to avoid DoS issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.