The cmd from gliderlabs

builtin docker

* Optionally have commands come with their own Docker Engine instance. It would also be stateless, but allows you to perform Docker operations from your command, like builds and pushes, but also use or run programs in Docker for the duration of the command.
* Similar implementation to Envy, but stateless and per command
* Implementation notes
    * Needs to start before command
    * Needs to be cleaned up after command
    * Label with user and whatever else for resource reporting
    * Enabled for a command via some per command settings?
    * Which image to use? Something based on Alpine...

tokens

Give access to non-GitHub users for automation purposes using tokens. Tokens can be used via HTTP or SSH (without SSH key).
Interchangeable with usernames (http and ssh)
UUID? fixed format in canonical form, easy to detect token over username
stored in dynamodb. new table
store last used timestamp, and IP
Token management CLI (see above, user level or command level?)
- $ ssh cmd.io cmd-tokens new command1 command2
- $ ssh cmd.io cmd-tokens rm
- $ ssh cmd.io cmd-tokens ls

Future improvement: Since tokens are "users", maybe tokens can optionally have SSH keys associated with them.

Estimated: 1d

config rename to env

resolve ambiguity about usage
leave config as alias w/ deprecation notice

Estimated: <1d

[doc] Failure planning

* identify potential failure scenarios
    * External services
        * EC2 / AZ
            * Host disappear
            * Network failure
            * Account limits
        * Dynamodb
            * Connectivity
            * Slow?
        * Github (login, ssh  keys, pages, groups)
            * Connectivity
            * Unsupported key
            * API limits
            * Missing group
        * Route53 (cmd.io, dune)
            * DNS failure
        * Auth0
        * Honeycomb
        * Sentry
        * Gliderlabs.io (Heroku)
            * Ops notifications
        * S3
            * Connectivity
        * Docker Hub
            * Slow
            * Connectivity
        * Dune
            * No Dockers
            * Hosts gone
        * Convox, ELB etc
    * Uncaught panics
    * Host resources
        * Memory
        * Disk space
    * Bad TF deploy
* testing
    * should we have an automated way of testing failure scenarios?
        * how could they be run, and how often?
            * Unittests ideally
            * https://github.com/Shopify/toxiproxy
                * Injecting responses?
        * identify what happens when (x) service goes down

usage stats

* Google analytics for sessions
    * SSH session == pageview
    * SSH sessions within timeout == session

descriptions on commands

* Argument on cmd-add
* `desc` metacommand to get/set
    * Might deprecate later

aliases

* provides a short name for a long command
* Can be implement as commands that resolve to another (alias field on command table?)
* Always resolved before others
* Not `install` but `alias`
* Does `rm` work on aliases? I guessssss...
* `alias` again with existing alias overwrites it
* Not shareable
* Most useful for expanding many subcommands
* Probably not used often??
    * Maaaaaybe hold off on this feature

root cmds

* Change from special "root meta" commands to something else ...
    * Make : specific to meta commands, which are always the same
* Shouldn't be an org (cmd) because cmd might need to be an org
* Name of namespace
    * cmd (conflicts with possible org/group)
    * meta
    * system
    * root
    * self
    * --
    * Another symbol prefix (like :)
    * ~ (not as a prefix)
    * cmd- (prefix) [current fav]
        * ssh cmd.io cmd-install
        * ssh cmd.io cmd-rm
        * ssh cmd.io cmd-info
    * account

Groups (aka orgs)

* Gruops provide namespace of shared commands for a set of users. Similar to GitHub orgs for repos
* Groups are non-user command "owners"
* Groups have members and admins in addition to per command admins and access
* Members all have access to commands in group
* Members have group added to their "PATH" so no need to specify group namespace
    * You can still add it to be explicit, esp when you have command(s) with same name
    * ssh cmd.io group-command # implicit
    * ssh cmd.io mygroup/group-command # explicit
* Listing commands in a group (with ~ convenience alias?)
    * ssh cmd.io ~mygroup
    * ssh cmd.io cmd-ls mygroup
* Groups are managed via builtin group command with subcommands
    * ssh cmd.io cmd-group
        * cmd-group create <group>
        * cmd-group destroy <group>
        * cmd-group users ls <group>
        * cmd-group users add <group> <user> [--admin]
            * Change admin by adding again w/ or w/o admin flag
        * cmd-group users rm <group> <user>
* Commands for groups are managed with existing builtin commands
    * ssh cmd.io cmd-add <group>/<name> <source>
    * ssh cmd.io cmd-rm <group>/<name>
* I think I prefer "groups" over "organization" because an actual organization might have several groups, and not all groups are organizations (such as for personal or project organization)...

Dune host rotation

* should be possible without downtime: 
    * Problem:
        * If rotation is from deploy done by TF in Cmd.io, it would kill the execution of TF
    * Autoscale groups
        * Just kill hosts to rotate in new one
        * Requires host registration
            * Docker via libkv+dynamodb
                * Managed our own fork until upstreams converge
            * Registrator+libkvt
            * Script/daemon, DNS
                * Needs AWS host credentials, secondary network
                * Daemon, docker always restart (starts on boots)
                * Garbage collection / health checking
                    * Connect to existing
                    * Remove DNS if not responding
                * Regularly makes sure IP is in DNS
                * Generalizable (Beacon)
                    * Host beacon / cluster membership (like Serf)
                    * Backends
                        * Route53
                        * Etcd
                        * ...
            * Registrator with Route53 and Docker Socket as a service
                * No health checks
    * Solution:
        * Just terminate dune hosts manually individually after TF deploy
        * Or use CloudFormation: https://github.com/hashicorp/terraform/issues/1552#issuecomment-190864512
* regularly rotate for updates.
    * Unless all software is pinned
    * Scheduled rotation can wait until Triggers.io
* automated deploy. 
    * PR -> gitscript -> cmd.io -> TF
    * Manual instance rotation

Godocs

* After refactorings

SSH key forwarding

* Just use the agent package!

set USER to cmd.io user

* for shared commands, identifying the user
* commands use root user and probably always will. any case that doesn't is special?
* If there's problems, prefix with CMD_

refactoring pass

split up large functions, create issues for ones that can't yet
split out functionality that should be in own components
add hooks where necessary (auth, session setup)
replace cobra hacks with new cobra hacks

est: 3d

block internal net + aws creds

* block private ipv4 address range
    * prevents access to aws metadataservice
    * internal systems
    * other containers

Test it blocks docker socket.

Est: <1d

spin down, aka image deletion timeout

* after period of no activity, delete image
* maybe just for free accounts

registration integration

* Registration library
    * Read in registration key from env
    * Lookup if registered

issue template

* Version output (release, channel, etc)
    * Bot can read and tag issue
* Actual vs expected (testable?)
* Steps to reproduce
    * Ideally source for command
* Debug output?

Resolve command help issue

* Problem: current usage and user expectation is to get help for commands with --help
* Solutions:
    * Fix usage text and make it more clear
    * Implement a hook for --help
    * Come up with some other standard help for commands

slack notification handling

Module to use for Slack notifications. All endpoints protected.

Est: 1d

runbooks

* steps on how to recover from a failure
* easily accessible
* tested regularly
* Make process for writing and maintaining runbooks
* Live under docs? Wiki?

Limit command run rate

* start with low cmd rate limit, free: 1 per sec?
* per command
* tests

[doc] Cost calculation

* Cost of manifold platform from EC2
* Cost of Cmd resources like ECR and Dynamo storage from others
* Convox billed based on % of mem per service
* Dune billed based on % of container duration per service
* Use tags

dune image management

* GC
    * daemon reaping outdated images on scheduled interval
    * Start by removing untagged: http://jimhoskins.com/2013/07/27/remove-untagged-docker-images.html
    * Also looking at active version in Cmd.io (delete non-active versions)
* "Spindown", further optimization
    * remove image 1 hour after container stops.
    * can be implemented on the daemon for GC

Limit cpu

no action necessary
specified in docker run api call, from limits.go
shares and quota for upper limit via docker
tests: inspect container

Estimated: <1d

Limit image size

free only 500mb to start. but specified in limits.go
does not include highly cacheable like official images
enforced after pull
check size, if over, delete and inform user
tests
future optimization: cache size in redis (make issue)
future optimization: use history to check layers for official image layers, subtract their size

Estimation: 1d

Hooks

* Hooks are user definable implementations to internal extension interfaces
* What form does user define hooks?
    * Webhooks
    * Commands as hooks
    * RPC/Duplex
    * ...
* auth hook
    * Only makes sense for shared commands
    * Same as sshfront auth hook
        * user and key fingerprint passed as args
        * Exit 0 means pass, exit non-zero means no auth
    * Use cases
        * Implementing groups and ACLs for org commands
        * Prevent commands being run in certain conditions
            * Not just specific to a single command
    * Mostly for authz. First builtin ACLs, then hook
* token hook
    * ???

PATH-like

* For making shared commands (users or orgs) easier by putting them in same namespace
* Account/user level setting similar to shell PATH, maybe different name?
    * resolve
    * order
* Maybe not user configurable yet, just an implementation detail for resolving commands
    * User commands, then shared commands, in alphabetical order of owner name?

status page

* Ping command over http using route53 health checks
    * SNS to gl.io and it will update status
* show live system status
    * [https://github.com/twilio/stashboard](https://github.com/twilio/stashboard) ?
        * Status repo with GH issues that sync to stashboard
    * latency

slack integration

* Ideally not builtin to Cmd, but a separate service
    * Using comlab we can incubate in Cmd and split out later
* Not all users in slack would be able to run Cmd.io commands via slack
    * They don't have Cmd.io account. Maybe not even Github account.
* Integration form: slash command
* Possible integration scoping
    * 1-1 mapping with SSH
    * Limited to particular owner / group
    * Maybe specific to command
* Slash command webhook URL picks scope
    * cmd.io/slack
    * cmd.io/slack/gliderlabs
    * cmd.io/slack/gliderlabs/deploy
* How to map users, workflow
    * Command that gives you a link to authorize both ends and records mapping
    * /cmd slack-link
        * Knows its the slack user, just needs to auth against Cmd.io (Auth0/github)
* If somehow Cmd.io has token to access Slack API, opens up possiblities
    * Put mapping into slack. Signed profile attribute.
    * Store everything in Slack as profile fields!
        * Global state can be in user profile that setup integration
* How to define permissions? Groups?
    * TBD

alerts/monitoring

* using cloudwatch alerts. sns to gliderlabs.io
    * gl.io sends to whatever, slack, email
* Metrics dashboards: Grafana hosted on now?
* disk utilization
* memory utilization
* cpu utilization
* process flapping
    * Docker start event to metric per service. Alert on X restarts within period

web ui

* What does it do
    * List and manage commands. (think early Heroku)
    * Sign-in w/ Auth0
    * Link to change/cancel plan
    * Web console/terminal
        * Hterm, see termshare
* Implementation
    * No React to start
    * Semantic UI
    * Server-side rendered
    * Auth0
* Part of Cmd on Convox

Nags

"Unregistered software. Support open source: < Link >"

CLI help/usage
HTTP API: headers
Console: bottom nag

est: 1.5d

Limit commands

limit to 10 unique images/sources in free. different for personal, etc plans
tests
limits specified in limits.go struct/map for each plan. "hardcode" for now.
need to know plan for owner/user (migration?)
check during cmd-add

Estimated: <1d

Community Dashboard / Console

* Dev
    * Open Issues, avg time to close
    * Open Prs, PR merge rate.. ??
    * Number of contributors/"authors"
    * Avg lines of code per component
    * Go report card stuff
    * Starting point for newcomers
* Ops
    * Ops metrics
    * Logs?
    * Etc...
* Biz
    * Google Analytics
    * Finances
* Placeholder to split into smaller tasks/projects

fix docker socket

* currently tcp socket is accessible from docker bridge

setup website deployment

hugo/deployment automation
- keep hugo filetree out of project tree (copy content dir from docs)
use www.cmd.io CNAME
redirect unhandled paths from app to github pages
set up CI for deploying pages from master

est: <1d

history

* could be used as an audit log. api?
* Web ui
    * History page updated in realtime 
* how will we capture them?
    * honeycomb
    * docker events
* Where do we store history?
    * DynamoDB
* Use cases
    * Audit log for shared commands (like for a company)
    * Reference, like bash history
        * Only needed across multiple machines
            * Assumes not using other interfaces (HTTP, web terminal, etc)
    * ...

S3-backed image registry

* store pulled/built images here
* Not necessary until we have non-registry commands
* Costs... ECR expensive, S3 cheap
* how will it affect image pull time compared to docker hub?
* Use local registry configured with s3 backend
* Optionally local passthrough proxy registry with s3 backend
* Configure docker daemon with --registry-mirror

Marketing Analytics (Google Analytics)

add google analytics to www and console

est: <1d

Limit memory

implemented within the docker daemon, kills if over
again, limit defined per plan in limits.go
specified in the docker run api call
OOM in docker api error? otherwise, docker events.. ehhh
display to user "Out of memory", close connection
tests

Estimation: <1d

Changelog

* Submit PR to master
* Bot squash and aggregate commit messages. Check for format etc
* Can the bot add the PR into changelog.md before the merge?
    * Use a tool like https://github.com/skywinder/github-changelog-generator

instrumentation

* honeycomb
    * identify useful datasets
        * http
        * cli
        * runs
        * misc logs?
    * keep note of useful queries
    * events to capture
        * runs separate from cli event (since could be http too)
        * cli events
        * docker events
        * http events/requests
        * warnings/debug (depends on environment)
* sentry
    * capture panics
    * try to wrap goroutines
* cloudwatch?
    * count stats in memory
        * expose over http? jmx?? prometheus format?
    * Expose and let telegraf pickup
    * store metrics such as
        * running cmds
        * active users
        * cpu usage
    * Cost of metrics???
* manifold (and dune) stats
        * Run Telegraf for shipping to cloudwatch
            * https://github.com/influxdata/telegraf
        * Convox and Dune hosts
        * metrics
            * memory
            * cpu
            * disk
        * Custom metrics
            * Disk utilization just for docker images

account settings

* Place to get/set account level settings for various purposes
* Use cases
    * PATH?
    * Git receive routing?
* Name
    * Cmd-settings?
* Needs more thought / use cases

non-docker authoring

* bash script as source
* possibly use nix pkgs and alpine to build images
    * NixOS/nix image is based on alpine
* arbitrary Dockerfile?
    * Nice to have to avoid docker hub
    * Easily abused?
        * Have a build timeout
        * Delete if too big after
        * Whitelisted base images? Official images?
* Implementation notes
    * "Types" for sources
* Brainstorm possible sources
    * Bash script source
        * Defining dependencies?
    * Nix scripts
        * Might be some work / waiting
    * Repo based alternative?
        * Maybe Dockerfile, maybe something else?
        * Multiple commands in repo, point to directory (like Go)
    * Point to a URL of a single file definition for a command (like Dockerfile, Nix script, etc)
* Use issue to brainstorm more ideas with users

You may have noticed we used GitHub Teams to model you as a group. Not only does Cmd.io use this team via the GitHub API to allow access, but it means you all have access to this repository. This was mostly important so you'd have access to the wiki where we're bootstrapping documentation.

We also realized you'll probably get notifications when we create GitHub issues. If you don't want this, feel free to change your notification settings on GitHub.

However, we'd invite you to stay so you can participate in our planning. We're about to share a Google Doc that we used as a living plan outline, and next week we'll be splitting it up into actionable GitHub issues. It's a big outline, so there will be a lot of issues. We hope you'll keep an eye on them as they come through because we welcome your feedback and questions.

Lastly, I'd like to think after all these years I understand how GitHub notifications work, but you never know. If you got a notification for this issue, via email or otherwise, maybe click through and add a reaction or comment. Thanks!

gliderlabs / cmd Goto Github PK

cmd's People

Contributors

Stargazers

Watchers

Forkers

cmd's Issues

Recommend Projects

Recommend Topics

Recommend Org