axsh / openvdc Goto Github PK

View Code? Open in Web Editor NEW

12.0 14.0 5.0 5.21 MB

Extendable Tiny Datacenter Hypervisor on top of Mesos architecture. Wakame-vdc v2 Project.

License: GNU Lesser General Public License v3.0

Shell 22.44% Go 76.53% PowerShell 0.26% Groovy 0.17% HCL 0.09% Dockerfile 0.50%

mesos-architecture golang iaas cloud-computing virtual-machine container

openvdc's Introduction

openvdc

Extendable Tiny Datacenter Hypervisor on top of Mesos architecture. Wakame-vdc v2 Project.

Build

Ensure $GOPATH is set. $PATH needs to have $GOPATH/bin.

go get -u github.com/axsh/openvdc
cd $GOPATH/src/github.com/axsh/openvdc
go run ./build.go

Build with compile proto/*.proto.

go run ./build.go -with-gogen

openvdc's People

Contributors

Stargazers

Watchers

Forkers

akirapop alexxnica optimuse neo4reo athna

openvdc's Issues

Executor host key file

Problem

The openvdc console command connects via ssh to an executor node which then in turn connects an instance's console. The problem is executor's host key is generated on startup. That means if the executor is restarted, there will be a problem with the client's known_hosts file.

Solution

Keep the executor host key somewhere under /etc/openvdc
If the key is not present, generate it on executor startup.
After generation, the key should be written to its proper path under /etc/openvdc
Key generation should be done in go. We don't want to use Linux commands because we might run executor on Windows in the future.

The pem package can be used to handle private key pem files.

RPM suggestion

Suggestions for the openvdc rpm packages

Package separation

All packages should suggest zookeeper and mesos as an optional dependecy.

openvdc

This is a metapackage that doesn't install anything but depends on all other packages. It's basically a shortcut to install everything in one hots.

Dependecies: openvdc-cli, openvdc-executor, openvdc-scheduler

openvdc-cli

This is the openvdc command. The command itself can be openvdc but the package name is openvdc-cli in order not to get confused with the metapackage.

openvdc-executor

Dependencies: mesos-agent

openvdc-scheduler

This should not run as the root user. A new user openvdc-scheduler should be made to run this process.

Example: https://github.com/axsh/openvnet/blob/develop/deployment/packagebuild/packages.d/vnet/openvnet.spec#L140-L148

Version Vs. Release

In OpenVNet we made mistake where the timestamp+gitcommit in release while it should be in version. Openvdc should fix this.

🚫 openvdc-executor-0.1dev-20161212145756git37689cd.el6.noarch.rpm
✅ openvdc-executor-0.1dev.20161212145756git37689cd-1.el6.noarch.rpm

Version is the version of the software while Release is the version of the package. http://rpm.org/max-rpm-snapshot/s1-rpm-build-creating-spec-file.html

Config file location

Config files should go here:

/etc/systemd/system/openvdcservice.d

Write an actual test for the acceptance test

Problem

In the acceptance-test branch I've set up the acceptance test environment to run in docker. Currently the only test in there checks if the openvdc command was properly installed and in PATH. While it's a beginning, this is hardly an acceptance test.

Solution

This should be the first test case:

Start an LXC container through the openvdc command.
SSH into the executor and check if the container is running.
Terminate the container through the opdnvdc command.
SSH into the executor and check if the container is gone.

Dependencies

Can't be completed unless this is done first:

Yum repo garbage collection

Problem

Currently the CI just pushes yum repositories for every commit and they stay online forever. We need a periodically running jenkins job to clean those up.

Current state

In https://ci.openvdc.org/repos/ there are a lot of folders named something like 20161223124709gitb06bd3.

After #52 is merged that format will change and instead a directory will be made for every branch.

Tighten Docker restrictions for acceptance test

Problem

We are currently using the --privileged flag when running Docker in the acceptance test. This is done to run KVM inside but basically gives the container full root access on the host.

Solution

Use options such as --device and --cap-add to only give the container the exact permissions we need.

Remarks

moby/moby#9976

Access to the Docker API is effectively root access. Even lacking --privileged, there are numerous mechanisms to avoid system policy if one has access to the docker socket or API.

It seems that when a user has access to docker, that user essentially has root access. If we were going to have root access anyway, I figured it's better to make that obvious by using sudo so the next person touching the code will be aware of it.

It could be a good idea to also investigate if there are side-effects to that and if it that was a terrible idea.

Garbage Collection: Docker images

Problem

Unit tests, rpmbuild and acceptance test are all running in Docker containers. The containers themselves are being cleaned up after the test runs but their images remain.

Removing the images along with the containers would not be the right thing to do because Docker uses those as cache. Leaving them all alone would just keep eating up disk space. We have to find the correct middle ground.

Current state

The Jenkins slave where the unit tests and rpmbuild jobs are running currently has a bunch of images lying around.

[vagrant@localhost ~]$ sudo docker images
REPOSITORY                                          TAG                 IMAGE ID            CREATED             SIZE
citest.acceptance-test.el7                          latest              da2a9e39c5c9        6 days ago          682.8 MB
citest.console-service.el7                          latest              da2a9e39c5c9        6 days ago          682.8 MB
unit-tests.citest.acceptance-test.el7               latest              ed39a2d02d4d        6 days ago          890.1 MB
unit-tests.citest.console-service.el7               latest              ed39a2d02d4d        6 days ago          890.1 MB
unit-tests.citest.lxc-networking.el7                latest              ed39a2d02d4d        6 days ago          890.1 MB
unit-tests.citest.master.el7                        latest              ed39a2d02d4d        6 days ago          890.1 MB
unit-tests.citest.protoc-go-generate.el7            latest              ed39a2d02d4d        6 days ago          890.1 MB
unit-tests.citest.remove-old-integration-test.el7   latest              ed39a2d02d4d        6 days ago          890.1 MB

...

Some times @unakatsuo is logging in and deleting them manually.

On the acceptance test slave, we are only just starting to use docker so there's only a few images at the time of writing.

[18:26:04] metallion@phys028  (~) > sudo docker images
REPOSITORY                TAG                                        IMAGE ID            CREATED             SIZE
openvdc/acceptance-test   acceptance-test.20170119064535git276b558   2546af339d82        13 minutes ago      466 MB
docker.io/centos          7                                          67591570dd29        4 weeks ago         191.8 MB

Solution

It's very tempting to just remove all images that are (for example) two weeks old. Some things have to be considered though.

It's possible that containers are still up because the LEAVE_CONTAINER flag was set. Whenever somebody sets that flag, it should be their responsibility to clean up after they're done.
What if the docker rmi command fails because some things are still dependent on an old image?

Maybe we can just try to implement the garbage collection as simple as possible and then see what complications come up? For that I would suggest to just run nightly and remove every image that is over 2 weeks old. Just like in the other GC jobs, that time should be configurable.

In-ordered instance state update

A symptom was observed that the instance launching failed with the log message following:

2017-03-21 15:51:35 [INFO] github.com/axsh/openvdc/hypervisor/lxc/lxc.go:326 Starting lxc-container... hypervisor=lxc instance_id=i-0000000002
2017-03-21 15:51:35 [INFO] github.com/axsh/openvdc/hypervisor/lxc/lxc.go:331 Waiting for lxc-container to become RUNNING hypervisor=lxc instance_id=i-0000000002
2017-03-21 15:51:35 [INFO] openvdc-executor/main.go:139 Instance launched successfully hypervisor=lxc instance_id=i-0000000002 state=STARTING
2017-03-21 15:51:35 [ERROR] openvdc-executor/main.go:104 Failed Instances.UpdateState Error: Invalid next state: QUEUED -> RUNNING hypervisor=lxc instance_id=i-0000000002 state=RUNNING

It reports that unexpected instance state transition from QUEUED to RUNNING.

stderr.txt

openvdc log command

Add Instance.Log API
openvdc log sub-command
Explore how to retrieve slave logs under /var/lib/mesos/slaves

% openvdc log i-xxxxxxx

Naming Resource

Set custom unique name to instance/resource.

Proposed CLI usage:

% openvdc run centos/7/lxc --name=myinstance1
% openvdc show myinstance1
%openvdc rename myinstance1 myinstance2

TODO:

Bring unique index feature to model. Design zNode keys and data structure on Zookeeper.
Update model.proto to add name field. message Instance (message ResourceTemplate?)
gRPC API change
- Enable Instance.Run, Instance.Create APIs to set name.
- Modify Instance.Show to retrieve by name.
- Instance.List should have name if exists.
- Add Instance.Rename API
Add CLI sub-command & options:
- openvdc run --name
- openvdc show <name>
- openvdc rename

openvdc log fails on instances with FAILED state

Problem

When we have an instance with state FAILED.

> ./openvdc show i-0000000005
{
  "ID": "i-0000000005",
  "instance": {
    ...
    "last_state": {
      "state": "FAILED",
      "created_at": "2017-07-05T09:08:41.181030773Z"
    },
    ...
}

We are no longer able to access its log. This makes debugging very difficult.

> ./openvdc log i-0000000005
FATA[0000] Error streaming log                           error="rpc error: code = 2 desc = cl.GetLog: application could not be found"

Cause

When an instance reaches FAILED state its mesos job is cleared. Mesos's log API no longer allows us to access it.

Solution

If there is no way to use the log API, rewrite the openvdc log <instance-id> command to be able to fetch the mesos logs directly from the executor machines.

LXC networking

LXC network suggestions

Don't use lxc.network.link as it is only compatible with Linux bridge and we want to use Open vSwitch too.
Use lxc.network.script.up and lxc.network.script.down instead. The scripts called by these options will call brctl addif or ovs-vsctl add-port respectively.
Put two script pairs in place when installing openvdc.
- linux-bridge-up.sh / linux-bridge-down.sh
- ovs-up.sh / ovs-down.sh
~~The bridge name on every executor will be passed by a command line parameter.~~
- ~~--bridge-type ovs` possible options: [ovs, linux]~~
- ~~--bridge-name br0~~
OpenVNet will need to know what LXC's tap devices are called. Maybe use lxc.network.veth.pair? In any case, it needs to be able to use openvdc to query the tap device name.

Scheduler dies when trying to start instance with invalid state

Problem

The following scenario:

Start an LXC instance and wait until running.
Reboot executor machine.

Now the LXC container is stopped but OpenVDC thinks it's running.

Start the instance

Result:

Feb 20 14:38:24 ci openvdc-scheduler[2806]: 2017-02-20 14:38:24 [FATAL] github.com/axsh/openvdc/api/instance_service.go:86 BUGON: Detected un-handled state instance_id=i-0000000000 state=state:RUNNING created_at:<seconds:1487314564 nanos:237858284 >
Feb 20 14:38:24 ci systemd[1]: openvdc-scheduler.service: main process exited, code=exited, status=1/FAILURE
Feb 20 14:38:24 ci systemd[1]: Unit openvdc-scheduler.service entered failed state.
Feb 20 14:38:24 ci systemd[1]: openvdc-scheduler.service failed.

The openvdc-scheduler service dies.

# systemctl status openvdc-scheduler
● openvdc-scheduler.service - OpenVDC scheduler
   Loaded: loaded (/usr/lib/systemd/system/openvdc-scheduler.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Mon 2017-02-20 14:38:24 JST; 6min ago
  Process: 2806 ExecStart=/opt/axsh/openvdc/bin/openvdc-scheduler (code=exited, status=1/FAILURE)
 Main PID: 2806 (code=exited, status=1/FAILURE)

ci/tests: failed_state_test.go might need to allow additional state

https://ci.openvdc.org/blue/rest/organizations/jenkins/pipelines/citest/branches/master/runs/179/nodes/36/log/

got an error

=== RUN   TestFailedState_RebootInstance
--- FAIL: TestFailedState_RebootInstance (5.21s)
	00_run_cmd.go:115: Unexpected Instance State: i-0000000009 goal=FAILED found=RUNNING

Each scenario in the file waits for FAILED state with transitional states. But the failure is detected since the origin state is not added to WaitInstance(). Possible symptom is:

Once the instance became RUNNING, openvdc reboot is issued.
The command is not reached to the executor yet so the state is kept to RUNNING.
The first attempt from WaitInstance() sees RUNNING but it is not listed as intermidiate states like []string{"REBOOTING"}

func TestFailedState_RebootInstance(t *testing.T) {
	stdout, _ := RunCmdAndReportFail(t, "openvdc", "run", "centos/7/null", `{"crash_stage": "reboot"}`)
	instance_id := strings.TrimSpace(stdout.String())

	WaitInstance(t, 5*time.Minute, instance_id, "RUNNING", []string{"QUEUED", "STARTING"})
	RunCmdAndReportFail(t, "openvdc", "reboot", instance_id)
	WaitInstance(t, 5*time.Minute, instance_id, "FAILED", []string{"REBOOTING"})
}

Instances can't be destroyed before being run

When created, instances get the state "REGISTERED" and this state isn't allowed to be directly set to "TERMINATED". So at the moment, Instances can't be destroyed unless they've been started at least once.

./openvdc register centos/7/lxc
INFO[0000] Found template: centos/7/lxc ID:"r-0000000002" resource:<id:"r-0000000002" template:<template_uri:"https://raw.githubusercontent.com/axsh/openvdc/resource-handler/centos/7/lxc.json" lxc:<lxc_image:<download_url:"https://images.linuxcontainers.org/1.0/images/d767cfe9a0df0b2213e28b39b61e8f79cb9b1e745eeed98c22bc5236f277309a/export" > > > >

./openvdc create r-0000000002
instance_id:"i-0000000002"

./openvdc destroy i-0000000002
FATA[0000] Disconnected abnormally error=rpc error: code = 2 desc = Invalid goal state: REGISTERED -> TERMINATED

Garbage Collection: acceptance test cache

Problem

As the acceptance test runs, it caches machine images for every branch it has run on. These need to be garbage collected.

Current state

The cache is kept in /data2/openvdc-ci/branches/ on the machine where the acceptance test runs. A new directory is created for every branch. Here's an example of the current state.

[18:09:01] metallion@IRON_MAIDEN_RULES  (~) > ls /data2/openvdc-ci/branches/
acceptance-test              console-service  fix-cli-print     fix-PR67        master           multibox-openvdc-install  registry-fix                 resource-naming  show-version       upgrade-epel-release
ci-merge-master-locally-fix  fix-binfile-add  fix-multibox-ssh  lxc-networking  model-timestamp  protoc-go-generate        remove-old-integration-test  rpm_cleanup      teardown-ci-multi  user-config

Solution

We should do something similar to #54

Run garbage collection job every night
Is this branch still on github?
- If no: delete!
- if yes: Has this branch been comitted to in the last 2 weeks? (probably best to make the 2 weeks period configurable)
  - if no: delete!
  - if yes: do nothing.
We could even use the same script if that's the easier implementation. That I'll leave to the programmer in question. ;)

Question

What shall we do for master? That cache should probably be removed and rebuilt some time too. Gonna think about that for a while and suggestions are welcome.

Instance stuck on `STARTING` if no internet access on executor

Problem

When you start an instance but the executor instance does not have access to the internet, the instance will never come up and its database entry will be stuck in STARTING state forever.

Reproduction

Install openvdc-cli on machine A.
Install openvdc-executor-lxc on machine B.
Set it up so machine A and machine B have network access to each other but machine B does not have internet access. For example remove its default gateway from the routing table.
On machine A:

A $ openvdc run centos/7/lxc
INFO[0000] Updating registry cache from https://raw.githubusercontent.com/axsh/openvdc/master/templates
INFO[0002] Found template: centos/7/lxc
i-0000000000

On machine B:

B $ lxc-info -n i-0000000000
i-0000000000 doesn't exist

On machine A:

A $ openvdc show i-0000000000
{
  "iD": "i-0000000000",
  "instance": {
    "id": "i-0000000000",
    "slaveId": "91fb613c-2abf-477a-b41c-41b37e1845b9-S0",
    "lastState": {
      "state": "STARTING",
      "createdAt": "2017-04-17T06:48:54.339487647Z"
    },
    "createdAt": "2017-04-17T06:48:48.379337475Z",
    "template": {
      "templateUri": "https://raw.githubusercontent.com/axsh/openvdc/master/centos/7/lxc.json",
      "lxc": {
        "lxcTemplate": {
          "template": "download",
          "distro": "centos",
          "release": "7"
        }
      },
      "createdAt": "2017-04-17T06:48:48.379337475Z"
    }
  }
}

Instance i-0000000000 is now stuck in this state forever.

Invalid state transition: QUEUED => RUNNING

Problem

When starting an instance it is expected to transition through the following states: QUEUED => STARTING => RUNNING.

However, some times an instance comes up so fast that it transitions to RUNNING before the STARTING state gets registered. That will cause the following error and the instance will be stuck in QUEUED state forever.

2017-03-21 15:51:35 [INFO] openvdc-executor/main.go:132 Starting instance hypervisor=lxc instance_id=i-0000000002 state=STARTING
2017-03-21 15:51:35 [INFO] github.com/axsh/openvdc/hypervisor/lxc/lxc.go:326 Starting lxc-container... hypervisor=lxc instance_id=i-0000000002
2017-03-21 15:51:35 [INFO] github.com/axsh/openvdc/hypervisor/lxc/lxc.go:331 Waiting for lxc-container to become RUNNING hypervisor=lxc instance_id=i-0000000002
2017-03-21 15:51:35 [INFO] openvdc-executor/main.go:139 Instance launched successfully hypervisor=lxc instance_id=i-0000000002 state=STARTING
2017-03-21 15:51:35 [ERROR] openvdc-executor/main.go:104 Failed Instances.UpdateState Error: Invalid next state: QUEUED -> RUNNING hypervisor=lxc instance_id=i-0000000002 state=RUNNING
2017/03/21 15:51:35 Recv loop terminated: err=EOF
2017/03/21 15:51:35 Send loop terminated: err=<nil>

Solution

Because we needed a quick fix for a demo, @b0r6 has allowed transition from QUEUED to RUNNING in this branch: https://github.com/axsh/openvdc/tree/allow-queued-to-running

We used that patch in the demo but it hasn't been merged to master yet. We need to decide if this is an acceptable long-term solution or if more work is required.

LXC executor currently requires virbr0 bridge

Problem

On a standard installation /etc/lxc/default.conf contains the following line.

lxc.network.link = virbr0

The default.conf file is included in every /var/lib/lxc/<container-name>/config file. That means LXC will never work unless the user first create a bridge called virbr0.

Solution

We can not touch the /etc/lxc/default.conf file. Users might have created their own default conf for LXC and we can't just let OpenVDC modify that behind their back. We have to get containers started by OpenVDC to ignore the default configuration.

First figure out if there's an option to ignore default conf in the library we're using.
If not, completely overwrite /var/lib/lxc/<container-name>/config after it gets created with the default conf contents in it.

Our binaries are not stripped

Problem

Our binaries aren't stripped. This means unnecessary symbols and debug information is still in there. We should strip them to save space.

[kemumaki@executor bin]$ file openvdc
openvdc: ELF 64-bit LSB executable .... not stripped
[kemumaki@executor bin]$ file openvdc-executor
openvdc-executor: ELF 64-bit LSB executable .... not stripped

Solution

~~Run the strip command on them before packaging. It's just a small difference but we should still do it.~~

Read the blog @unakatsuo posted below. :)

multi-box CI suggestion

Some discussion was had about how to minimize build time while still using clean environments. Here's a suggestion I thought of.

Destroy and rebuild images nightly (or weekly?).
Store built images in qcow2 format.
Every time the CI is run, make a copy-on-write and start the VMs.
Do not install openvdc in the base images. When the CI runs, update yum repos and install the correct branch.
Every time a test finishes, the copy-on-write image gets deleted, leaving the original clean one in place. We could also include a flag to skip this deletion in case somebody wants to keep the environment for debugging.
After test, clean up all bridges etc. multibox env builds.

Some branches will require the CI to be rebuilt while others don't.

All branches have their own set of images in /data/openvdc-ci/<branch>/
When a branch is run in the CI, build its images if they don't exist yet.
There's a flag on the CI to force rebuild even if the branch's images already exist.

Nightly rebuilds

Nightly deletes and rebuilds master and develop branches.

Nightly Garbage collection

rm -r /data/openvdc-ci/<branch>/
When to remove?
- When not available on github any more
- When the last commit on a branch is older than for example 2 weeks

Rebuild master acceptance test cache periodically

Problem

#96 reuses cache from the master branch to test other branches. This greatly speeds up the CI cycle but a side-effect is that we keep using old images and it's possible that OpenVDC no longer works on the latest version.

Solution

We should rebuild master's cache periodically. I'd say a Jenkins job that builds the master branch with REBUILD set to "true" should be enough. I suggest running this job every weekend. It should keep a copy of the old cache in case things go wrong and we don't immediately have time to fix it.

Scheduler needs configuration file to run on multi-host environment

Problem

It is currently not possible to start scheduler through systemd and have it interact with zookeeper, api, etc. when those are installed on another host.

Also if openvdc-cli isn't installed on the same host, scheduler will not start at all because an EnvironmentFile was mistakenly added to the openvdc-cli package instead.

Solution

Have scheduler accept a configuration file similar to #91.
Remove the EnvironmentFile in favour of the new configuration file.

Acceptance test loopback mount qcow

Problem

Currently the acceptance test environment is making .raw images and then converting them to .qcow. The reasoning behind this is that .raw is easy to loopback mount and .qcow has copy-on-write.

However, if we could work with .qcow from the beginning, there would be no need for the conversion that does take up some time.

Solution

It looks like nowadays loopback mounting .qcow isn't as hard as it used to be.

From http://ask.xmodulo.com/mount-qcow2-disk-image-linux.html:

sudo modprobe nbd max_part=8
sudo qemu-nbd --connect=/dev/nbd0 /path/to/qcow2/image

Example

sudo qemu-nbd --connect=/dev/nbd0 /var/lib/libvirt/images/xenserver.qcow2
sudo fdisk /dev/nbd0 -l # <= check existing partitions
sudo mount /dev/nbd0p1 /mnt

Acceptance test re-use master cache

Problem

There was code in place to use the acceptance test cache for the master branch when building a new branch and the REBUILD variable isn't set.

I temporarily disabled that code because I wrongly assumed that we couldn't set the REBUILD variable manually while the CI is still kicked off automatically

Solution

Re-enable that code.

Write tests with Open vSwitch

Problem

We are supporting Open vSwitch but we do not have any tests for OpenVDC with Open vSwitch yet.

Dependencies

We currently have an Open vSwitch host on the CI but we have no means of making sure it gets used in specific tests yet. This feature is being developed in #159.

Solution

Finish #159 first and then write some tests.

Multibox seed image broken because of hard coded version

Problem

The multibox environment currently won't build because the Centos version is hard coded.

** DOING STEP: Install zookeeper on zookeeper
++ sudo chroot /home/metallion/work/go/src/github.com/axsh/openvdc/ci/acceptance-test/multibox/10.0.100.10-zookeeper/tmp_root /bin/bash -c 'yum install -y mesosphere-zookeeper'
Loaded plugins: fastestmirror
http://ftp.jaist.ac.jp/pub/Linux/CentOS/7.2.1511/os/x86_64/repodata/repomd.xml: [Errno 14] HTTP Error 404 - Not Found

This is a common problem with images generated by buildbook-rhel7.

Solution

There's two ways we could go about this.

Change buildbook to use 7 instead of 7.2.1511 in the yum repository.
Get the seed image from somewhere else.

Document the acceptance test environment

Problem

In the past we've had problems with complicated test environments. For the kind of distributed visualization software we make, it's impossible to avoid these but we should make it as easy as possible for other programmers to pick up where the last guy left.

The OpenVDC acceptance test environment is an example of these with 5 machines that run in a docker container and can run in parallel.

Solution

I need to document this environment in a readme file so people can figure out what's going on even when I'm not around.

Duplicate offers

Doing this too fast:

./openvdc run centos/7/lxc
./openvdc run centos/7/lxc
./openvdc run centos/7/lxc
./openvdc run centos/7/lxc

causes this duplicate offer bug:

INFO[0156] Framework Resource Offers from master &TaskStatus{TaskId:&TaskID{Value:*i-0000000004,XXX_unrecognized:[],},State:*TASK_LOST,Data:nil,Message:*Task launched with invalid offers: Duplicate offer 43e660c2-76d2-4fd2-4fdc-85bc-1dfa7c268d29-S0 at slave(1)@127.0.0.1:5051 (ldc-85bc-1dfa7c268d29-O46 in offer list,SlaveId:&SlaveID{Value:*43e660c2-76d2-4fdc85bc-1dfa7c268d29S0,XXX_unrecognized[],},Timestamp:*1.482140777666846e+09,ExecutorId:nil,Healthy:nil,Source:*SOURCE_MASTER,Reason:*REASON_INVALID_OFFERS,Uuid:nil,Labels:nil,ContainerStatus:nil,XXX_unrecognized:[],}

The scheduler will keep displaying this error over and over again until you manually remove /openvdc in zookeeper.

OVS networking scripts missing

Problem

When using Open vSwitch starting an instance crashes with the following error.

2017-05-23 07:23:49 [ERROR] openvdc-executor/main.go:158 Failed CreateInstance Error: Failed to parse script template: /etc/openvdc/scripts/ovs-up.sh.tmpl: open /etc/openvdc/scripts/ovs-up.sh.tmpl: no such file or directory hypervisor=lxc instance_id=i-0000000007 state=STARTING
ErrorStackTrace:
github.com/axsh/openvdc/hypervisor/lxc.(*LXCHypervisorDriver).renderUpDownScript
        /var/tmp/go/src/github.com/axsh/openvdc/hypervisor/lxc/lxc.go:230
github.com/axsh/openvdc/hypervisor/lxc.(*LXCHypervisorDriver).CreateInstance
        /var/tmp/go/src/github.com/axsh/openvdc/hypervisor/lxc/lxc.go:301

It looks like the /etc/openvdc/scripts directory is missing entirely.

# ls /etc/openvdc/
executor.toml  scheduler.toml  scheduler.toml.rpmnew

Solution

Put the required scripts for Open vSwitch back in place.

Automate Jenkins master and slave building

Problem

Currently the Jenkins master and the slave on which we run unit tests and rpmbuild have been built manually. If they break, we're fucked. We should write scripts to build and rebuild them.

Currently both machines are VirtualBox VMs. I don't think this is ideal because it is not possible to run KVM inside of VirtualBox. This is something we will likely want to do for OpenVDC in the future but as long as VirtualBox is running on bare-metal, it is not possible to run KVM on the same machine. The main advantage of VirtualBox is that it will run on any OS but I don't think that is relevant for Jenkins. We are not likely to run it locally.

I would suggest to rebuild them as either KVM machines or docker containers.

Solution

Have a meeting (or chat) to decide what hypervisor we should run Jenkins on.
Write the scripts to automate building these. For KVM I'd suggest something similar to multibox and for Docker a Dockerfile (of course).
We should probably also store the jenkins configuration on a private github repository that we periodically update so we can pull from there in the build script.

OpenVDC log doesn't work with remote mesos

Problem

The openvdc log command ignores the mesos master IP address set in the openvdc config file.

[kemumaki@executor ~]$ cat .openvdc/config.toml

[api]
endpoint = "10.0.100.12:5000"
[mesos]
address = "10.0.100.11:5050"

[kemumaki@executor ~]$ openvdc log i-0000000001
FATA[0000] Couldn't connect to Mesos master              error="dial tcp 127.0.0.1:5050: getsockopt: connection refused"

Solution

Get the openvdc log command to use the mesos address value set in the config file if it's defined.

Pull request labels

We've got people in our team who like to make PRs after all work is done and we've got people who like to have PR while they're working. Both ways of working are ok but if you have a PR that shouldn't be merged yet, we should add a WiP label to it.

WiP (Work in Progress): This pull request is work in process and shouldn't be merged yet

openvdc command fails if home and tmp directories are on other partitions.

Problem

If the temporary directory and the user's home directory are not on the same partition, openvdc run centos/7/lxc will fail with the following error.

> ./openvdc run centos/7/lxc
INFO[0000] Updating registry cache from https://raw.githubusercontent.com/axsh/openvdc/master/templates
FATA[0003] Invalid path: centos/7/lxc, rename /tmp/gh-images-reg941395916/openvdc-e77ed15f3b2ba582087afa226ace61a6756f65dd/templates /home/metallion/.openvdc/registry/github.com-axsh-openvdc/master: invalid cross-device link

> df
...
tmpfs            8099764        52   8099712   1% /tmp
/dev/sda3      469420496 421441484  24110752  95% /home
...

The reason is because os.Rename doesn't allow moving files between different file systems.

Temporary workaround

You can set a tmp directory on the same partition using the TMPDIR environment variable.

> TMPDIR=$HOME/.openvdc/tmp ./openvdc run centos/7/lxc
INFO[0000] Found template: centos/7/lxc
...

Solution

@unakatsuo told me that the mv command had to deal with the same problem and solved it by not renaming the file but rather make a copy first and then delete the original. We could try a similar thing in our go code.

Instances not cleaned up on failure

Problem

When I ran into #169, I noticed that instances are correctly being set to the FAILED state in OpenVDC but their files on disk are not cleaned up.

$ openvdc show i-0000000000
{
  "ID": "i-0000000000",
  "instance": {
    "id": "i-0000000000",
    "slave_id": "24afc003-a255-4f52-b146-9c8e71041b87-S0",
    "last_state": {
      "state": "FAILED",
      "created_at": "2017-05-23T07:22:43.246760347Z"
    },
...

# lxc-info -n i-0000000000
Name:           i-0000000000
State:          STOPPED

# ls /var/lib/lxc/i-0000000000
config  rootfs
# du -hs /var/lib/lxc/i-0000000000
422M    /var/lib/lxc/i-0000000000

The above files stay on disk forever.

Solution

If an instance fails, clean up its resources in a similar way that openvdc destroy <instance-id> does.

Misleading error message when `openvdc console` fails

Problem

When piping a command to openvdc console and said command fails, the error message reads: Failed to ssh to <ip address>. This would have you believe that the SSH connection failed while in reality that succeeded and it was the command afterwards that failed.

bash-4.2$ echo xxxx | openvdc console i-0000000017
/bin/bash: line 1: xxxx: command not found
FATA[0000] Failed ssh to 172.16.3.10:31195       error="Process exited with status 127"

Solution

Display different error messages for when the SSH connection failed and for when the command afterwards failed.

Provision a Physical Appliance

To ease handling many physical appliances

A template like the following defines a physical appliance (eg. Juniper SSG 5) on a switch appliance like a Liquid Metal.

% cat ./ssg5_config.json
{
  "title": "Juniper SSG 5",
  "template": {
    "type": "physical/appliance",
    "nics": {
      # mapping nics on SSG5 to switch ethers
      "eth0/0": "eth0",
      "eth0/1", "eth1",
      ...(snip)...
      "eth0/7", "eth7"
    }
  }
}

The appliance is registered as a group named "juniper/ssg5".

% openvdc register ./ssg5_config.json --group "juniper/ssg5"

The following command sequence will provision the resource and return the resource to reuse.

% openvdc run juniper/ssg5
abdcefg12345678

% openvdc ssh abdcefg12345678
Could not support a ssh connection.

% openvdc console abdcefg12345678
Could not support a serial console connection.

% openvdc destroy abdcefg12345678

If administrator want to append a same kind of resource, edit the configuration file and register it using a same group name on the switch appliance which is connected the target resource.

Executor has local configuration file.

openvdc-executor is requiring local specific configuration parameters, such as bridge/openvswitch name.

The change could be similar to #64.

Switch the flag library to spf13/cobra
Load /etc/openvdc/executor.conf at startup using spf13/viper. Remove global variables for the old flag and replace to viper.Get***().

axsh / openvdc Goto Github PK

openvdc's Introduction

openvdc

Build

openvdc's People

Contributors

Stargazers

Watchers

Forkers

openvdc's Issues

Problem

Solution

Package separation

Version Vs. Release

Config file location

Problem

Solution

Dependencies

Problem

Current state

Suggested solution

Problem

Solution

Remarks

Problem

Current state

Solution

Problem

Cause

Solution

LXC network suggestions

Problem

Suggested solution

Problem

Current state

Solution

Question

Problem

Reproduction

Suggested solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Dependencies

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Solution

Problem

Temporary workaround

Solution

Problem

Solution

Problem

Solution

To ease handling many physical appliances

Recommend Projects

Recommend Topics

Recommend Org