Giter Site home page Giter Site logo

integration-k8s-kind's Introduction

integration-k8s-kind

How to run integration tests locally?

Single cluster tests

  1. Create kind cluster:
kind create cluster --config cluster-config.yaml --wait 120s
  1. Run tests
export CLUSTER_CIDR="172.18.1.128/25" # for monolith suite
go test -count 1 -timeout 2h30m -race -v ./tests_single

Calico single cluster tests

  1. Create kind cluster:
kind create cluster --config cluster-config-calico.yaml
  1. Apply calico:
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/tigera-operator.yaml
kubectl create -f https://raw.githubusercontent.com/projectcalico/vpp-dataplane/ba374a0583d8ab7938d0e46056c148563ee911ec/yaml/calico/installation-default.yaml
kubectl apply -k calico
  1. Wait for a calico-vpp rollout:
kubectl rollout status -n calico-vpp-dataplane ds/calico-vpp-node --timeout=5m
  1. Run tests:
 go test -count 1 -timeout 1h30m -race -v \
    ./tests_single/basic_test.go          \
    ./tests_single/heal_test.go           \
    ./tests_single/memory_test.go         \
    ./tests_single/observability_test.go  \
    ./tests_single/feature_test.go        \
    -calico

Multiple cluster scenario(interdomain tests)

  1. Create 3 kind clusters:
kind create cluster --name kind-1 --config cluster-config-interdomain.yaml --wait 120s
kind create cluster --name kind-2 --config cluster-config-interdomain.yaml --wait 120s
kind create cluster --name kind-3 --config cluster-config-interdomain.yaml --wait 120s
  1. Save kubeconfig of each cluster(you may choose appropriate location)
kind get kubeconfig --name kind-1 > /tmp/config1
kind get kubeconfig --name kind-2 > /tmp/config2
kind get kubeconfig --name kind-3 > /tmp/config3
  1. Run interdomain tests with necessary environment variables set
export KUBECONFIG1=/tmp/config1
export KUBECONFIG2=/tmp/config2 
export KUBECONFIG3=/tmp/config3 
export CLUSTER1_CIDR="172.18.1.128/25" 
export CLUSTER2_CIDR="172.18.2.128/25"
export CLUSTER3_CIDR="172.18.3.128/25"
go test -count 1 -timeout 1h -race -v ./tests_interdomain

integration-k8s-kind's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

integration-k8s-kind's Issues

K8s should know actual status of nsm PODs

Description

Sometimes fake NSE/NSMGR pods fail because of POD with the registry is deployed, but registry service is not started.

Solution

Add k8s liveness and readiness probes based on grpc health servies.

TestRunFeatureSuiteSingle/TestPolicy_based_routing fails

Looks like TestRunFeatureSuiteSingle/TestMutually_aware_nses affects TestPolicy_based_routing.

TestMutually_aware_nses runs before TestPolicy_based_routing and creates next routes:

  • 172.16.1.100 from 172.16.1.101 dev nsm-1 table 1 uid 0
  • 172.16.1.100 from 172.16.1.101 dev nsm-2 table 2 uid 0

TestPolicy_based_routing creates three routes. We expect the first one to be:

172.16.3.1 from 172.16.2.201 via 172.16.2.200 dev nsm-1 table 1

but instead we get:

172.16.3.1 from 172.16.2.201 via 172.16.2.200 dev nsm-1 table 3

Run

https://github.com/networkservicemesh/integration-k8s-kind/runs/5602556952?check_suite_focus=true

Logs

ci_logs.zip

`Update dependent repositories` workflow does update incorrectly

Description

integration-k8s-kind re-uses .github workflow
This is wrong, because this workflow updates integration-k8s-kind instead of integration-tests in dependent repositories (public clusters).
For example:
https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/5656999510/job/15325140836#step:7:1

Possible solution

Perhaps we should expand the workflow and use the ${{ github.repository }} by default (as it is now). But be able to override this value (using inputs).

Add Interdomain testing on kind

Description

Add interdomain testing scenario on ci to integration-k8s-kind repository:

  1. Add separate job in github ci for testing interdomain

a. In this job start and configure 3 kind clusters, using last kubernetes version(from secret)
b. Run Test interdomain suite

  1. Run all tests, check that everything passes

Add initial registry kind tests

Description

We need to start thinking about integration testing via k8s. We can start with a simple approach:

  1. Store all deployments in this repository. For example, we can store all deployments in folder /deployments and after some time we can move it to separate repository.
  2. Use yaml deployments in this repository.
  3. Use extra needed applications in this repository. For example, we can add some clients endpoints applications into folder /cmd and we can get rit of it when all nsm applications will be completed
  4. Use go tests that should be based on user steps. For this we can use exechelper and k8s-go-client.

Add interdomain integration test

Description

Recently we've achieved working nsmgr-proxy application and now we can start to cover

Motivation

Proof that interdomain NSM use-case is working in k8s via 2 kind clusters.

Add IPv6 cluster for kind testing

Description

We need to add kind cluster with ipv6 addresses.

To create it we need to add networking to cluster-configuration:

...
networking:
  ipFamily: ipv6
...

Integration tests are not stable after adding OPA policy for registry feature

Description

Heal tests with nsmgr restart are not stable. When we are deleting old nsmgr, it doesn't unregister its forwarder from registry. New nsmgr (with new spiffeID) can't register this forwarder again, because it's already registered by old nsmgr (with old spiffeID).

How to reproduce

  1. Run basic nsm setup
  2. Delete one of NSMGRs
  3. Wait for new NSMGR start
  4. See logs in registry pod

Expected behavior

NSMGR registers its forwarder successfully

Actual behavior

NSMGR can't register its forwarder, because it's already registered by old deleted nsmgr

[TRAC] [type:registry] (2.1)    register={"name":"forwarder-vpp-6tpfn","network_service_names":["forwarder"],"network_service_labels":{"forwarder":{"labels":{"nodeName":"kind-worker","p2p":"true"}}},"url":"tcp://10.244.1.109:5001","expiration_time":{"seconds":1663668686,"nanos":268959800},"initial_registration_time":{"seconds":1663668484,"nanos":824354244}}
[INFO] [type:registry] (2.2)    AUTHORIZE spiffieIDNSEsMap: map[spiffe://example.org/ns/nsm-system/pod/nsmgr-h7h5k:[forwarder-vpp-x565x nse-kernel-56c9ffbff5-bwpvs] spiffe://example.org/ns/nsm-system/pod/nsmgr-tcx5p:[forwarder-vpp-6tpfn]]
[INFO] [type:registry] (2.3)    AUTHORIZE spiffieID: spiffe://example.org/ns/nsm-system/pod/nsmgr-tkc59
[INFO] [type:registry] (2.4)    AUTHORIZE nseName: forwarder-vpp-6tpfn
[ERRO] [type:registry] (2.5)    rpc error: code = PermissionDenied desc = no sufficient privileges;	Error returned from sdk/pkg/registry/common/authorize/authorizeNSEServer.Register;	github.com/networkservicemesh/sdk/pkg/registry/core/trace.logError;		/build/local/sdk/pkg/registry/core/trace/common.go:38;	github.com/networkservicemesh/sdk/pkg/registry/core/trace.(*traceNetworkServiceEndpointRegistryServer).Register;		/build/local/sdk/pkg/registry/core/trace/nse_registry.go:129;	github.com/networkservicemesh/sdk/pkg/registry/core/next.(*nextNetworkServiceEndpointRegistryServer).Register;		/build/local/sdk/pkg/registry/core/next/nse_registry_server.go:59;	github.com/networkservicemesh/sdk/pkg/registry/common/begin.(*beginNSEServer).Register.func2;		/build/local/sdk/pkg/registry/common/begin/nse_server.go:64;	github.com/edwarnicke/serialize.(*Executor).process;		/go/pkg/mod/github.com/edwarnicke/[email protected]/serialize.go:68;	runtime.goexit;		/usr/local/go/src/runtime/asm_amd64.s:1571;	
[ERRO] [type:registry] (1.2)   Error returned from sdk/pkg/registry/common/authorize/authorizeNSEServer.Register: rpc error: code = PermissionDenied desc = no sufficient privileges

Possible solultions

  1. Wait until forwarder entry in registry is expired
  2. Add feature for heal servers: immediately delete entry from registry if the connection with endpoint is lost
  3. Disable OPA authorization for cmd-registry-k8s
  4. Add a possibilty for managers to bypass OPA policies
  5. Add roles to pods using spire. Add special role "nsmgr". Accept all register requests in OPA policies if role == "nsmgr" (For example, we can change spiffeID for some pods by adding special labels https://github.com/spiffe/spire/blob/main/support/k8s/k8s-workload-registrar/mode-crd/README.md#label-based-workload-registration)

Heal tests can fail on CI after TestRegistry_restart

--- FAIL: TestRunHealSuite (1271.70s)
    --- PASS: TestRunHealSuite/TestLocal_forwarder_death (54.59s)
    --- PASS: TestRunHealSuite/TestLocal_nse_death (61.82s)
    --- PASS: TestRunHealSuite/TestLocal_nsmgr_restart (62.74s)
    --- PASS: TestRunHealSuite/TestRegistry_restart (74.09s)
    --- FAIL: TestRunHealSuite/TestRemote_forwarder_death (114.81s)
    --- FAIL: TestRunHealSuite/TestRemote_forwarder_death_ip (114.56s)
    --- FAIL: TestRunHealSuite/TestRemote_nse_death (113.35s)
    --- FAIL: TestRunHealSuite/TestRemote_nse_death_ip (114.24s)
    --- FAIL: TestRunHealSuite/TestRemote_nsmgr_death (113.91s)
    --- FAIL: TestRunHealSuite/TestRemote_nsmgr_restart (111.83s)
    --- FAIL: TestRunHealSuite/TestRemote_nsmgr_restart_ip (112.22s)

Logs

TestRunHealSuite.zip

[Calico] TestKernel2Wireguard2Kernel_dual_stack is not stable

Description

TestKernel2Wireguard2Kernel_dual_stack is not stable with Calico.
As we can see in logs - ping IPv6 starts working but very intermittent. And inside the client's logs, we see that the datapath healing works.
In the order of the "dst_ip_addrs":["172.16.1.100/32","2001:db8::/128"], we can notice that the IPV4 address is checked first, and it fails.
This indirectly points to this problem - networkservicemesh/sdk-vpp#527.

Build:
https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/2376683993

Logs:

...
time=2022-05-24T11:57:40Z level=info msg=NSC=$(kubectl get pods -l app=alpine -n ${NAMESPACE} --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}') TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:57:40Z level=info msg=NSE=$(kubectl get pods -l app=nse-kernel -n ${NAMESPACE} --template '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}') TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:57:40Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 2001:db8:: TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:57:45Z level=info msg=PING 2001:db8:: (2001:db8::): 56 data bytes
64 bytes from 2001:db8::: seq=0 ttl=62 time=2.332 ms
64 bytes from 2001:db8::: seq=3 ttl=62 time=0.367 ms

--- 2001:db8:: ping statistics ---
4 packets transmitted, 2 packets received, 50% packet loss
round-trip min/avg/max = 0.367/1.349/2.332 ms TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdout
time=2022-05-24T11:57:45Z level=info msg=Defaulted container "alpine" out of: alpine, cmd-nsc, coredns, cmd-nsc-init (init) TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stderr
time=2022-05-24T11:57:45Z level=info msg=kubectl exec ${NSE} -n ${NAMESPACE} -- ping -c 4 2001:db8::1 TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:57:49Z level=info msg=PING 2001:db8::1 (2001:db8::1): 56 data bytes
64 bytes from 2001:db8::1: seq=1 ttl=62 time=0.858 ms
64 bytes from 2001:db8::1: seq=3 ttl=62 time=1.187 ms

--- 2001:db8::1 ping statistics ---
4 packets transmitted, 2 packets received, 50% packet loss
round-trip min/avg/max = 0.858/1.022/1.187 ms TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdout
time=2022-05-24T11:57:49Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:58:03Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes

--- 172.16.1.100 ping statistics ---
4 packets transmitted, 0 packets received, 100% packet loss TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdout
time=2022-05-24T11:58:03Z level=info msg=Defaulted container "alpine" out of: alpine, cmd-nsc, coredns, cmd-nsc-init (init)
command terminated with exit code 1 TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stderr
time=2022-05-24T11:58:03Z level=info msg=1 TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=exitCode
time=2022-05-24T11:58:03Z level=info msg=kubectl exec ${NSC} -n ${NAMESPACE} -- ping -c 4 172.16.1.100 TestRunFeatureSuiteCalico/TestKernel2Wireguard2Kernel_dual_stack=stdin
time=2022-05-24T11:58:17Z level=info msg=PING 172.16.1.100 (172.16.1.100): 56 data bytes
...

TestRunFeatureSuiteCalico.zip

Add spire support

Motivation

All our cmd applications expect spire on the cluster. So we need to add spire deployments before we start to test nsm cmd applications

Cover cmd-exclude-prefixes-k8s by integration tests

Description

We've completed cmd-exclude-prefixes-k8s and now we can cover the application by integration tests.

Test scenarios

We need to cover the next scenarios:

  1. Verify that cmd-exclude-prefixes-k8s can correctly read cluster prefixes from kube-system configmap.
  2. Verify that cmd-exclude-prefixes-k8s can correctly collect prefixes if configmap from scenario1 is missed.
  3. Verify that cmd-exclude-prefixes-k8s can correctly read user's prefixes
    3.1. Verify deleting the user's prefixes.
    3.2. Verify updating the user's prefixes.
    3.3. Verify adding the user's prefixes.
  4. Verify that cmd-exclude-prefixes-k8s can correctly read prefixes from envs.
    4.1. Verify that cmd-exclude-prefixes-k8s is working fine with correct prefixes from envs.
    4.2. Verify that cmd-exclude-prefixes-k8s is failing with incorrect prefixes from envs.
  5. Verify collision prefixes collision cases.

Testing notes

We have not completed tests for nsmgr + nsc yet, and for first step we can collect prefixes into alpine pod via shared volume. And check file source via kubectl cp, kubectl exec -ti "pod name" -- cat "path".

Add performance testing scripts and job for release workflow

Motivation

It'd be very useful to monitor our perfomance metrics on releasing candidates. It could help to catch performance issues.

TODO

  • check scrits (AWS2AWS)
  • prepare scripts for githunb workflow
  • add perormance testing step on releasing (kind2kind)

integration-k8s-kind spams upstream repos on ci workflow canceling

Expected Behavior

integration-k8s-kind doesn't spam upstream repos on ci workflow canceling.

Current Behavior

integration-k8s-kind spams upstream repos on ci workflow canceling.

Failure Information (for bugs)

Steps to Reproduce

  1. Do update in SDK
  2. See at https://github.com/networkservicemesh/integration-k8s-kind/actions/workflows/update-dependent-repositories-gomod.yaml

Note: expected only one green job update-dependent-repositories-gomod for one update from sdk.
Currently, updates count equal count of cmd applications.

Context

bdf4070

[Calico] Lack of cluster resources - TestNSE_Composition is unstable

Description

When we add TestRunFeatureSuiteCalico, we can see that TestNSE_Composition doesn't work properly.
The first Request (from the init-container) is usually fine, but the following Requests (refreshes) fail. Since the datapath has already been done, it does not affect the ping.
But it affects subsequent tests: since new Requests (from cmd-nsc) can't complete successfully, we don't have a Connection that will be closed when the test is completed (but actually Connection exists because of the init-container)

Logs

Pay attention to TestNSE_Composition and TestSelect_Forwarder.
Logs from cmd-nsc TestSelect_Forwarder:

...
Mar 25 14:57:26.176�[37m [TRAC] [id:alpine-0] [type:networkService] �[0m(1.1)   request={"connection":{"id":"alpine-0","network_service":"nse-composition","mechanism"...
...

Calico logs(5).zip

There is a guess that this is due to insufficient resources on github host-machine

TestDns is not working with Calico on ci

Description

Run TestRunFeatureSuiteSingle/TestDns with Calico on ci.
We can see, that nslookup works fine, but ping doesn't.
Logs:

...
time=2022-02-22T12:18:38Z level=info msg=kubectl exec ${NSC} -c dnsutils -n ${NAMESPACE} -- nslookup -norec -nodef my.coredns.service TestRunFeatureSuiteSingle/TestDns=stdin
time=2022-02-22T12:18:38Z level=info msg=Server:		127.0.0.1
Address:	127.0.0.1#53

Name:	my.coredns.service
Address: 172.16.1.100 TestRunFeatureSuiteSingle/TestDns=stdout
time=2022-02-22T12:18:38Z level=info msg=kubectl exec ${NSC} -c dnsutils -n ${NAMESPACE} -- ping -c 4 my.coredns.service TestRunFeatureSuiteSingle/TestDns=stdin
time=2022-02-22T12:18:44Z level=info msg=ping: bad address 'my.coredns.service'
command terminated with exit code 1 TestRunFeatureSuiteSingle/TestDns=stderr
...

Build example: https://github.com/networkservicemesh/integration-k8s-kind/runs/5287825293?check_suite_focus=true

Calico/VPP NSM integration

Calico allows for a choice of dataplanes. VPP is one of them.

Normally cmd-forwarder-vpp, normally cmd-forwarder-vpp starts its own instance of vpp in its own Pod.

In response to a request for integration between NSM and Calico/VPP, the process for integration was described.

This issue is about actually trying (and shaking the bugs out of) such integration.

This breaks down into a number of steps:

  • Calico/VPP NSM integration testing on kind
  • Calico/VPP NSM integration testing on gke
  • Calico/VPP NSM integration testing on aks
  • Calico/VPP NSM integration testing on aws
  • Calico/VPP NSM integration testing on packet
  • Calico/VPP NSM integration testing on packet with Calico/VPP binding to the interface with vfio.

Store containers logs as artifacts on ci

Motivation

Currently, researching bugs can be difficult because logs push as text on test failing. Also, the text doesn't include logs of containers.

Solution

  1. Add to https://github.com/networkservicemesh/integration-tests/blob/main/extensions/base/suite.go#L26 logic to store current POD logs into a specific folder. For example ./artifacts

  2. Add storing of artifacts on ci
    https://docs.github.com/en/free-pro-team@latest/actions/guides/storing-workflow-data-as-artifacts

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.