The plancton from mconcas

Add a new control on memory used by containers

API should provide access to such value.

Easy & universal deployment (pip)

Pip would be a good solution as it will work everywhere.

Rewrite tabular printer in log functionality

Currently a very handmade one is implemented. It's time to use something smarter.
Payload: add a new requisite:

    sudo pip install prettytable

Nameless containers make Plancton crash

Due to some unknown reasons docker spawns nameless container instead the plancton-slave-XXXXX ones. Therefore Plancton crashes in accessing those NoneType keys.

TypeError: 'NoneType' object is unsubscriptable 2016-09-21 07:23:39 plancton CRITICAL [daemon.start] Terminating abnormally...

Use os.sysconf("SC_NPROCESSORS_ONLN") to establish the exact number of CPUs on the host

According with this commit it should be cleaner, or at least shorter, adopt that way to count CPUs on the host.

Fix README

This is the first thing that people see when connecting to the page, it has to be much simpler and reflect our changes.

Use more meaningful container hostnames

Use UTC for timestamps for monitoring

No cvmfs_cache available with CVMFS device mounted inside containers

Mounting a CVMFS repo using FUSE inside containers does not allow us to benefit from cvmfs_cache.
In fact, CVMFS, at first mount creates locks to avoid data races.

[root@container_1 /]# mount -t cvmfs alice-ocdb.cern.ch /cvmfs/alice-ocdb.cern.ch/
CernVM-FS: running with credentials 498:497
CernVM-FS: loading Fuse module... done
CernVM-FS: mounted cvmfs on /cvmfs/alice-ocdb.cern.ch/

and everything just works fine.
Therefore a second container can't mount the same Fully Qualified Repository Name getting:

[root@container_2 /]# mount -t cvmfs alice.cern.ch /cvmfs/alice.cern.ch/
Repository alice.cern.ch is already mounted on /cvmfs/alice.cern.ch/

As reference i spotted the lines of code where this control is defined.
#L142

Use alisw/slc6-cvmfs container

Dockerfile available at alisw/docks. Please make a PR there.

Add force-stop mode

Just like stop but kills all pertaining containers first.

Annoying behaviour in case the max number of launchable containers is equal to 1

@dberzano
I found that the line 187 generates a dumb behaviour in a scenario where max_docks is set to 1.

At the beginning of main_loop there are no running containers: the comparison made in min(max(self._count_containers(), 1), self.conf["max_docks"]) is equal to 0, that is it sends the program to a grace-spawn cooldown (it always find a cpu efficiency grater than 0.00%). This means that the execution starts with a delay proportional to the quota grace_spawn: in the config file.
In case self.conf["max_docks"] is set to 1, we fall again in the case where containers are spawned and killed later on. If there was a limit of more than one container for max_docks it would result in having max_docks - 1 since the corresponding overhead would fit in the threshold, and that would be fine. In case of max_docks, however, it simply spawns and kill a container.

For these reasons i will make two pr to fix this, i will need your opinion, though.

Use CVMFS from a container without Parrot

CVMFS through parrot_run has a series of known problems, most of them by design (i.e. somewhat low performances or orphan processes adoption). We should try to mount FUSE filesystems (like CVMFS) directly inside the container.

Find a way to do that
Change the Plancton run script to modprobe fuse in advance (I guess this is needed)
Fully document the solution

Use CVMFS on HLT Cluster

Any advice is appreciated.

Max TTL for containers

Containers should have a maximum time to live, no matter what's going on there. After TTL is passed, docker kill.

Apparmor configuration breaks start_container() API request if Apparmor isn't present/installed

At some point it will become necessary to determine the architecture and act accordingly.

Optionally run containers in privileged mode

Behaviour to assume in case of an exception in talking with InfluxDB database

Question:

Basically we have to decide how to manage the requests.exceptions.ConnectionError that may rise in feeding a database.

Description:

There are two cases in which this may occur:

Plancton attempts to create a database during its self.init() phase. This is done once a startup time.
Plancton periodically sends data to the configured host:port.

My idea:

Plancton should fail to start in case we decide that the reachability of the database is a necessary condition to run (yet another flag to set this, or a force design choice).
Plancton should be tolerant enough in case the service becomes unreachable for some time or forever. In case of an address changing we might want to be able to reconfigure Plancton to feed the correct new one, this is a second-order enhancement, IMHO.

Elasticsearch communication

Plancton must be able to communicate informations (its status, its container statuses, etc. ) to an elasticsearch DB .

Save container logs for post-mortem investigations

It will be useful to debug the job execution.

Display containers uptime

Configurable policy for spawning containers by looking at host used resources

We need to have a plan for that. Some policy that includes a configurable, say, "mathematical formula" or "Python expression" that calculates the number of containers to be spawned.

You might periodically save a series of parameters. For instance:

disk usage
load
CPU usage

and save also an "averaged" version of them over a period of time (5 minutes, 30 minutes, one hour, 12 hours, 1 day).

Decisions can be taken according to this values.

Note: this is just a plan, we need a draft to discuss before implementing anything.

Operating system image for PLANCTON AliEn site

I would do it with Packer and pick a decent OS as a base. Docker seems to "prefer" Ubuntu distributions for the record (to my understanding this is where their developers work).

We need a QCOW image and Packer (which we already use for our Docker containers) does support that.

Export JSON for Grafana dashboards

I will consider the possibility to export the JSON file of the dashboard on a repository, in order to constitute a common baseline for new installations.

@dberzano, let me know if you agree.

Kill containers in excess starting from the youngest

Use remote configuration

Find a way to fetch the configuration from, say, a remote HTTP URL. Maybe, HTTPS is better, but bear in mind that the CA must be known. So, in that case, better keeping the configuration on GitHub (it's even versioned then) than on one of the University servers (as long as no sensitive data is contained).

Add a force-start mode

Plancton should be able to start and remove the drainfile placeholder, if it is present.
force-start = resume + start

Plancton does not pull image if it does not exists

It only re-pulls it periodically.

Request plancton project at CERN OpenStack

This is the form to fill. If we do it "as an experiment" it's much easier, private users can't request new quota as easily.

Make it possible to send InfluxDB data to configurable hostname/IP

At the moment Plancton has a default that send data to its localhost, it is currently working, indeed we want to have a configurable InfluxDB target for installations.
Value in default configuration dictionary currently points to localhost.
We just have to verify the correct behaviour in case a custom value is set in config.yaml to override the default one.

Stream data to an InfluxDB/ES database

We would like to monitor:

CPU efficiency on every host
Number of running containers on each host and overall
Average containers' lifetime
Plancton status

Daemonize

Make this running as a daemon - have a look at elastiq, there is a Daemon class ready.

Make Plancton installable by pip

Start Plancton on opportunistic CERN nodes for AliEn

Drain containers on specific host in a Plancton-based cluster

Approaches:

Periodically check for a file in a specific path
?

Logs rotated and deleted too frequently

I would update the log rotation as follows:

max log size: 10 MB
keep last 50 logs

Any modern system can afford that :-)

Mismatch between API version between client and server generates CRITICAL errors.

2016-05-11 14:53:02 plancton WARNING [**init**.robust_call] 404 Client Error: Not Found ("client and server don't have same version (client : 1.22, server: 1.18)")

Rename "slave" to "worker"

It's more politically correct (Trump might not agree but we don't care.)

Add configuration option for disk quota

Improve configuration mode

General idea (TBD): overlay a set of files/dirs with Docker volumes. Make it independent from Condor. Make every reference to Condor and/or Parrot disappear from the Plancton codebase!

New installation procedure should be shorter

Refactoring of the plancton-bootstrap script will lead some changes:

Plancton repository won't be cloned anymore from github.
Requisites check for python modules won't be performed by plancton-bootstrap script
plancton user and home directory will still be created at bootstrap time

Pass new binds for custom local configurations

Some experiments might need custom configurations in addition.

CMS simulations do need a local squid to point to, for example:

<proxy url="http://t2-squid-01.to.infn.it:3128"/>

Use the plancton module installed through pypa

The handlers plancton-start plancton-stop plancton-status should refer to the plancton module distributed via pypa

Show container health in the container table

Add AliEn VOBOX at CERN for testing

Code refactoring

Better organise code structure, define common coding conventions, etc...

Make sure latest version of container is used

If you adopt the naming schema for containers:

mconcas/slc6-container:v1
mconcas/slc6-container:v2
mconcas/slc6-container:latest

where latest always points to the latest version, you can tell Docker to run mconcas/slc6-container:latest.

If the container does not exist, with the current implementation, it is silently pulled. However, no check is performed to make sure that the intended latest version is used.

To do that with the command line you'd do:

docker pull mconcas/slc6-container:latest
docker run mconcas/slc6-container:latest [opts...]

The docker pull part downloads nothing if the latest version is there already, but at least it ensures you are using that one.

The exercise is to implement it via the REST API if possible, like you have done with the rest.

mconcas / plancton Goto Github PK

plancton's Introduction

Plancton: opportunistic computing using Docker containers

Main features

Instant gratification

Configure

Credits

plancton's People

Contributors

Stargazers

Watchers

Forkers

plancton's Issues

Question:

Description:

My idea:

Recommend Projects

Recommend Topics

Recommend Org