Giter Site home page Giter Site logo

portworx / torpedo Goto Github PK

View Code? Open in Web Editor NEW
122.0 40.0 54.0 188.56 MB

A test suite to qualify storage providers for stateful containers running in a cluster.

License: Apache License 2.0

Makefile 0.11% Go 98.52% Shell 1.19% C 0.01% Dockerfile 0.10% Python 0.06%

torpedo's Introduction

Torpedo

Travis branch Go Report Card

Torpedo is a test suite to qualify storage providers for stateful containers running in a distributed environment. It tests various scenarios that applications encounter when running in Linux containers and deployed via schedulers such as Kubernetes, Marathon or Swarm.

Drawing

CSI

CSI is a specification for Linux Container Storage Interfaces. It defines the control plane interaction between a cloud native scheduler such as Kubernetes, and a cloud native storage provider. The specification is available here.

The Torpedo test suite natively supports the CSI specification for external volume support into Kubernetes and Mesosphere. It operates as a CSI enabled orchestrator (scheduler) to communicate with external storage providers that support CSI.

Torpedo tests cover the various scheduler-storage integration points that are being addressed by the CSI specification (https://docs.google.com/document/d/1JMNVNP-ZHz8cGlnqckOnpJmHF-DNY7IYP-Di7iuVhQI/edit#) and how external volume providers like Portworx are able to support production level operational scenarios when it comes to storage, server, software or network failures.

Legacy support

Since CSI is currently still work in progress, most schedulers provide external volume support to Mesosphere or Kubernetes via DVDI or the Kubernetes native driver interface.

Docker volume driver interface (DVDI) provides the control path operations to create, mount, unmount and eventually delete an external volume and is documented here.

In order to support legacy storage drivers, Torpedo can also work with schedulers that still use the Docker volume driver interface.

Scenarios to Consider when Deploying Stateful Applications

Deploying ephemeral applications require less consideration when compared to stateful applications. When running stateful applications in production, administrators should take into account various runtime scenarios that may occur and chose an external storage provider that is capable of dealing with these situations. Examples of these scenarios are:

Runtime software hangs and crashes

  • Container runtime engine (or scheduler) software failure, hang or crash: When a daemon, like Docker crashes, it can induce errors with an application's connectivity to the external storage. This problem is compounded when the storage provider itself, runs as a container. In general, you need to assume that user space code will either hang or crash, and the storage system needs to gracefully deal with this, without data loss, unavailability or corruption.
  • External storage driver software failure, hang or crash: When the storage software itself crashes, the overall solution needs to make sure that there are no lost IOs (data loss), unavailability or corruption.

Network and host issues

  • Network disconnect from the storage provider to the external environment: If a node on which the storage volume driver is running were to become disconnected from the network, the overall solution needs to make sure that the volume can be used on another node, and that there is no data loss or corruption.
  • A node running a stateful application becomes permanently (or for a prolonged period of time) unreachable: In many cases, a node can become permanently unusable. In cases, such as AWS, when an EBS volume is attached to such a node, the overall solution needs to make sure that the volume or the data can somehow still be used on some other node in the cluster.
  • A network partition in the cluster: When the scheduler cluster or the storage cluster gets partitioned in such a way that quorum is lost, the nodes that are still part of the quorum need to be able to use all of the data that was in the original cluster. Otherwise, this would lead to data unavailability.

Scheduler software issues

  • Scheduler software attempts to deploy a stateful container on a node that is not part of the storage cluster: It is possible that the storage cluster and the scheduler cluster do not comprise of the same machines. The overall solution must prevent, or somehow make sure that when a stateful application is deployed on a non-storage node, that the application's storage requirements are fulfilled. Some approaches to handle this include the use of scheduler constraints and labels.
  • Scheduler software attempts to bring up a new container/pod/task to use a storage volume prior to properly terminating the previous container/pod/task on a different host: Scheduler software, perhaps due to bugs or timing issues, may launch a new application stack on a new set of nodes that refer to a volume currently in use by an application stack being torn down. The overall solution must be capable of dealing with these transition scenarios, without application data loss or corruption.

Test Cases Run by Torpedo

Test/Scenario Acceptance vs Runtime Test Expected Result
Create dynamic volumes Runtime Expected to be able to create a volume with arbitrary parameters at runtime
Verify that the volume driver can deal with an uneven number of mounts and unmounts and allow the volume to get mounted on another node. Runtime Expected to pass
Volume Driver Plugin is down, unavailable - and the client container should not be impacted. Acceptance Client container does not get an IO error.
Volume driver plugin is down and the client container gets terminated. There is a lost unmount call in this case, but the container should be able to come up on another system and use the volume. Acceptance Expected to pass.
A container is using a volume on node X. Node X is now powered off. Acceptance The system must be able to create a new container on node Y and use the same volume using pod replace.
Storage plugin is down. Scheduler tries to create a container using the provider’s volume. Acceptance This should fail.,The container should not start and the scheduler should receive an error.
A container is running on node X. Node X looses network access and is partitioned away. Node Y that is in the cluster can use the volume for another container. Acceptance When node X re-joins the network and hence joins the cluster, it is expected that the application that is running will get I/O errors since the block volume is attached on another node.
A container is running on node X. Node X can only see a subset of the storage cluster. That is, it can see the entire DC/OS cluster, but just the storage cluster gets a network partition. Node Y that is in the cluster can use the volume for another container. Acceptance When node X re-joins the storage network and hence joins the cluster, it is expected that the application that is running will get I/O errors since the block volume is attached on another node.
Docker daemon crashes and live restore is disabled. Acceptance The agent detects that the task has died and it brings it up on another node and the task can re-use the volume.
Docker daemon crashes and live restore is enabled. This scenario should be a noop. Container does not crash. Acceptance Expected to pass

Qualified External Storage Providers

To submit an external storage provider, please submit a PR with the output of the Torpedo test program and the specifics of the environment used.

Provider Information Test Coverage Status

Usage

Build

See How to build.

Run

See How to run.

Contributing

The specification and code is licensed under the Apache 2.0 license found in the LICENSE file of this repository.

See the Style Guide.

Sign your work

The sign-off is a simple line at the end of the explanation for the patch, which certifies that you wrote it or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify the below (from developercertificate.org):

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

then you just add a line to every git commit message:

Signed-off-by: Joe Smith <[email protected]>

using your real name (sorry, no pseudonyms or anonymous contributions.)

You can add the sign off when creating the git commit via git commit -s.

torpedo's People

Contributors

adityadani avatar aghodke1312 avatar ak-px avatar alicelyy avatar apimpalgaonkar avatar dbhatnagar-px avatar jainsmit avatar jkagliwal-px avatar kphalgun-px avatar kshithijiyer-px avatar lsrinivas-pure avatar madanagopal19 avatar mborodin-px avatar mkoppal-px avatar nikolaypopov avatar piyush-nimbalkar avatar pure-adamukaapan avatar pureneelesh avatar ram-infrac avatar rohit-px avatar sayalasomayajula-px avatar sn-px avatar snigdha-px avatar soma-purestorage avatar spai-px avatar stgleb avatar thiguetta avatar tthurlapati-px avatar vinayakshnd avatar vprabhakar-px avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

torpedo's Issues

Generate application specs from yaml files

Currently the applications are programiatically defined in golang. This takes a learning curve for someone to add a new application.

End users are more familiar with yaml config files for describing k8s primitives. So we need a stub module that takes yaml files and auto-genrates golang objects which subsequently get used by the torpedo framework.

Scale test does not work with DCOS

Currently DCOS takes a name for an app which is hardcoded.
Before deploying every instance of the app, just append a UUID to the name to make it unique.

Add support for StatefulSet

In terms of application, support only exists for Deployment. We need support for StatefulSet and Service.

Support will mainly be in k8s driver implementation and k8sutils package.

Add scheduler support for upgrade of application

Scheduler driver should have an interface that upgrades the application.

In the k8s implementation,

  • Handle OnDelete and RollingUpgrades (OnDelete will require explicit delete of pods)
  • Simulate a spec update by changing env variables.

DCOS - Volume operations for UCR / mesos containers

Current implementation handler volumes only for docker containers. For UCR, the volume parameters are passed differently. We need to extract them correctly and perform volume operations like inspect and delete.

Don't use k8s labels to enable and disable portworx

Currently torpedo uses k8s labels to stop and start portworx.

These k8s labels are consumed by the oci-mon Portworx pod that runs in the cluster. At times, docker can hung on the system and the oci-mon pod may not be running.

In this state, there is none to consume the start Portworx label and one needs to wait for docker to be back.

We should just use systemctl commands directly to control portworx.

Fix make vendor

make vendor is not doing what it's supposed to.

It is complaining about packages already present in the vendor directory. It is also pulling additional packages.

Add new tests to test disk failures

Needs to be done after adding yank out disk support in node driver

  • start test with IO, yank out disk
  • yank out disk, start test with IO on that node

Make storage spec component in k8s configurable

Mainly following things should be configurable in the storage elements (StorageClass, PVC) in the k8s specs.

  1. Volume provisioner (currently hardcoded in the storage class). This needs to take in the volume provisioner that the torpedo instance is being run on.
  2. Volume size (currently hardcoded in the PVC)
  3. Volume options (currently hardcoded in the storage class)

This will allow the same spec to be used for different flavors of storage.

Add support for multiple sections in the same k8s yaml spec file

When parsing the yaml spec files, if we encounter a file like below, it doesn't work since it has a yaml file separator. We need to handle that.

##### Portworx storage class
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
    name: px-nginx-shared-sc
provisioner: kubernetes.io/portworx-volume
parameters:
   repl: "2"
   shared: "true"
---
##### Portworx persistent volume claim
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
   name: px-nginx-shared-pvc
   annotations:
     volume.beta.kubernetes.io/storage-class: px-nginx-shared-sc
spec:
   accessModes:
     - ReadWriteOnce
   resources:
     requests:
       storage: 1Gi

Add scale test

Add a scale test that can exercise the current application specs but scaled by the given scale factor

WaitForNode implementation for Portworx does not wait till portworx is up

Based on below logs, 1 second after portworx was started on a node, the test proceeded to stopping portworx on the node. There is a WaitForNode call between a start and stop. This call must have returned back with an incorrect status and hence px was stopped before it could start completely.

13:23:37 time="2018-01-12 21:17:57" level=info msg="[mysql] Destroyed Namespace: mysql-voldriverappdown-01-12-21h12m18s" 

13:23:37 STEP: wait for few seconds for app destroy to trigger

13:23:37 STEP: restarting volume driver

13:23:37 STEP: start volume driver on nodes: [{b503a941-8f19-4e80-b854-6d8fd9a50ccd minion2 minion2 [192.168.121.132]  Worker} {4881ca81-381e-4d4a-a73a-f04d0449f747 minion3 minion3 [192.168.121.243]  Worker}]

13:23:37 STEP: wait for volume driver to start on nodes: [{b503a941-8f19-4e80-b854-6d8fd9a50ccd minion2 minion2 [192.168.121.132]  Worker} {4881ca81-381e-4d4a-a73a-f04d0449f747 minion3 minion3 [192.168.121.243]  Worker}]

13:23:37 STEP: get nodes for nginx app

13:23:37 STEP: stop volume driver pxd on app nginx's nodes: [{4881ca81-381e-4d4a-a73a-f04d0449f747 minion3 minion3 [192.168.121.243]  Worker} {0a005856-6693-4e55-a83e-5fa2e64b8af9 minion1 minion1 [192.168.121.174]  Worker} {b503a941-8f19-4e80-b854-6d8fd9a50ccd minion2 minion2 [192.168.121.132]  Worker}]

Jan 12 21:18:07 minion2 dockerd-current[3881]: time="2018-01-12T21:18:07Z" level=info msg="Doing systemctl portworx START"
Jan 12 21:18:07 minion2 dockerd-current[3881]: time="2018-01-12T21:18:07Z" level=info msg="> run: /bin/sh -c systemctl start portworx"
Jan 12 21:18:07 minion2 systemd[1]: Starting Portworx OCI Container...
Jan 12 21:18:07 minion2 sh[16172]: container "portworx" does not exist
Jan 12 21:18:07 minion2 systemd[1]: Started Portworx OCI Container.
Jan 12 21:18:07 minion2 kubelet[5429]: E0112 21:18:07.973807    5429 kubelet_volumes.go:128] Orphaned pod "abbf8767-f7dd-11e7-974d-5254004f47e1" found, but volume paths are still present on disk. : There were a total of 1 errors similar to this.  Turn up verbosity to see them.
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="Rootfs found at /opt/pwx/oci/rootfs"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="SPEC READ [21ab28d954598d512faca66748c9261c  /opt/pwx/oci/config.json]"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="PX-RunC arguments: -a -c b91ffc94-2ca1-4736-88d3-d88599a5c036 -f -k etcd://70.0.5.211:2379,etcd://70.0.5.212:2379,etcd://70.0.5.213:2379 -x kubernetes"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="PX-RunC mounts: /dev:/dev /var/lib/docker/containers/47d548f25173a549452a2f20afbc977dae8c6b4aefb733763b9f62c9c2056e7c/hosts:/etc/hosts:ro /etc/pwx:/etc/pwx /var/lib/docker/containers/47d548f25173a549452a2f20afbc977dae8c6b4aefb733763b9f62c9c2056e7c/resolv.conf:/etc/resolv.conf:ro /opt/pwx/bin:/export_bin /lib/modules:/lib/modules proc:/proc:nosuid,noexec,nodev /run/docker:/run/docker sysfs:/sys:nosuid,noexec,nodev cgroup:/sys/fs/cgroup:nosuid,noexec,nodev /var/lib/kubelet/pods/2c016f57-f7dc-11e7-974d-5254004f47e1/containers/portworx/24be706d:/tmp/px-termination-log /usr/src:/usr/src /var/cores:/var/cores /var/run:/var/host_run /var/lib/kubelet:/var/lib/kubelet:shared /var/lib/osd:/var/lib/osd:shared /var/lib/kubelet/pods/2c016f57-f7dc-11e7-974d-5254004f47e1/volumes/kubernetes.io~secret/px-account-token-fd52v:/var/run/secrets/kubernetes.io/serviceaccount:ro"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="PX-RunC env: BTRFS_SOURCE=/home/px_btrfs GOMAXPROCS=64 GOTRACEBACK=crash KUBERNETES_PORT=tcp://10.96.0.1:443 KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443 KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1 KUBERNETES_PORT_443_TCP_PORT=443 KUBERNETES_PORT_443_TCP_PROTO=tcp KUBERNETES_SERVICE_HOST=10.96.0.1 KUBERNETES_SERVICE_PORT=443 KUBERNETES_SERVICE_PORT_HTTPS=443 KUBE_DNS_PORT=udp://10.96.0.10:53 KUBE_DNS_PORT_53_TCP=tcp://10.96.0.10:53 KUBE_DNS_PORT_53_TCP_ADDR=10.96.0.10 KUBE_DNS_PORT_53_TCP_PORT=53 KUBE_DNS_PORT_53_TCP_PROTO=tcp KUBE_DNS_PORT_53_UDP=udp://10.96.0.10:53 KUBE_DNS_PORT_53_UDP_ADDR=10.96.0.10 KUBE_DNS_PORT_53_UDP_PORT=53 KUBE_DNS_PORT_53_UDP_PROTO=udp KUBE_DNS_SERVICE_HOST=10.96.0.10 KUBE_DNS_SERVICE_PORT=53 KUBE_DNS_SERVICE_PORT_DNS=53 KUBE_DNS_SERVICE_PORT_DNS_TCP=53 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin PORTWORX_SERVICE_PORT=tcp://10.108.87.160:9001 PORTWORX_SERVICE_PORT_9001_TCP=tcp://10.108.87.160:9001 PORTWORX_SERVICE_PORT_9001_TCP_ADDR=10.108.87.160 PORTWORX_SERVICE_PORT_9001_TCP_PORT=9001 PORTWORX_SERVICE_PORT_9001_TCP_PROTO=tcp PORTWORX_SERVICE_SERVICE_HOST=10.108.87.160 PORTWORX_SERVICE_SERVICE_PORT=9001 PXMOD_SOURCE=/home/px-fuse PXMOD_VERSION=5 PX_IMAGE=harshpx/px:master PX_IMAGE_ID=sha256:23c641ae8971add0639efbb145d0350ec09906718a58c95e7518affc65d83c3f PX_RUNC=true PX_TEMPLATE_VERSION=v2 TERM=xterm"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="portworx-reboot.service content unchanged [1dc97b965f3c6ad99aa3a92a02b2e8b1 /etc/systemd/system/portworx-reboot.service]"
Jan 12 21:18:07 minion2 px-runc[16177]: time="2018-01-12T21:18:07Z" level=info msg="Linking /var/host_run/docker.sock to /opt/pwx/oci/rootfs/run/docker.sock"
Jan 12 21:18:08 minion2 px-runc[16177]: Executing with arguments: -a -c b91ffc94-2ca1-4736-88d3-d88599a5c036 -f -k etcd://70.0.5.211:2379,etcd://70.0.5.212:2379,etcd://70.0.5.213:2379 -x kubernetes

Extract all retry times as constants

There are multiple places where we are using time to retry a task. We need to have a global or a task level constant file to store the constants.

Add volume driver interface for ungraceful termination

Instead of shutting down a volume driver graceful, we should also be testing how a volume driver behaves when it is shutdown ungracefully. For e.g kill one of it's processes directly using kill command.

Also add Portworx implementation for this.

Init dependency between drivers

There is an unnecessary dependency on the order of the scheduler driver and volume driver to be init'd right now in order to have the node registry set up correctly

The scheduler driver needs to be Init'd before the volume driver otherwise the node registry doesn't get populated correctly. There is no need for such a dependency.

Add a mysql application that does IO.

Add a PerformIO() interface in scheduler driver which will get invoked from torpedo.

Applications in the app spec factory also need their own implementations since only an application knows what it means to perform IO on it (e.g postgres will run pgbench or write a bunch of database tables).

Add support for running fluentd and EFK stack

This ticket won't have any torpedo changes but rather work on creating specs and a deployment workflow for fluentd + EFK logging framework on a k8s cluster which is capable for pushing all pod logs to the EFK instance.

Add wordpress + Mysql spec

For kubernetes, wordpress + Mysql is a popular application. Replace existing nginx with a single wordpress deployment.

Add AWS node driver implementation

Currently we only have an ssh implementation for the node driver. This has limitations in the node operations we can do. For example, we cannot shutdown node since with ssh, we have no way to restarting the node.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.