edgexr / edge-cloud-platform Goto Github PK

License: Apache License 2.0

Makefile 0.23% Go 93.47% Shell 0.74% HTML 5.25% Python 0.26% HCL 0.03% Jinja 0.02%

edge-cloud-platform's Introduction

Edge-Cloud Platform

The Edge-Cloud Platform is a set of services that allow for distributed and secure management of edge sites ("cloudlets"), featuring deployment of workloads, one-touch provisioning of new cloudlets, monitoring, metrics, alerts, events, and more.

The platform is intended to satisfy the architecture of an Operator Platform as described in the GSMA document, Operator Platform Telco Edge Proposal, and adhere to standardized interfaces developed by CAMARA.

Services

The Controller provides the API endpoint for, and manages creating, updating, deleting, and showing application definitions, cloudlets, clusters, application instances, policies, etc. It manages object dependencies and validation, stores objects in Etcd, and distributes objects to other services that need to be notified of data changes.
The Cloudlet Resource Manager (CRM) manages infrastructure on a cloudlet site, calling the underlying infrastruture APIs to instantiate virtual machines, docker containers, kubernetes clusters, or kubernetes applications, depending on what the actual infrastructure type supports.
The Central Cloudlet Resource Manager (CCRM) manages the lifecycle of CRMs deployed on edge sites.
The Distrbuted Matching Engine (DME) provides the API endpoint for mobile device clients to discover existing application instances, provide operator-specific services like location APIs, and push notifications to mobile devices via a persistent connection.
The ClusterSvc service automatically deploys additional applications to clusters to enable monitoring, metrics, and storage.
The EdgeTurn service is much like a TURN server, providing secure console and shell access to virtual machines and containers deployed on cloudlets.
Shepherd deploys alongside the CRM (Cloudlet Resource Manager) on cloudlet infrastructure for advanced metrics and alerts.
The AutoProv service monitors auto-provision policies and automatically deploys and undeploys application instances based on client demand.

Edge-Cloud Infrastructure Platform Code

Infrastructure specific code is packaged under an interface to allow for new infrastructures to be supported without needing to modify the edge-cloud platform code.

Currently supported infrastructures are:

VM-Based:

Openstack
VMWare VSphere
VMWare Cloud Director (VCD)
VMPool (a bunch of VMs)
Amazon Web Services (AWS) EC2

Kubernetes Based:

Amazon Web Services (AWS) EKS
Google Cloud Platform (GCP) GKE
K8S Bare Metal (primarily but not limited to Google Anthos)
Microsoft Azure Kubernetes Service (AKS)

Compiling

To build the platform services, you will need Go and protoc installed. Please see go.mod for the correct version of golang to install. Ensure that your GOPATH is set.

You will need to build the tools once first:

make tools

You will need to have the edge-proto repo checked out adjacent to this repo.

Then to compile the services:

make

Testing

There are an extensive set of unit tests. Some unit tests depend on local installations of supporting open source projects and databases. You will need to install docker, certstrap, Etcd, Redis, Vault, and Influxdb. See the test setup script.

make unit-test

Building Images

Scripts for building container images are in /build/docker. You will need to have docker installed on your machine. You may need to set the REGISTRY environment variable to the registry and parent path to push to, i.e.

export REGISTRY=ghcr.io/edgexr

cd build/docker
make build-platform

To build without pushing:

make build-platform-local

edge-cloud-platform's People

Contributors

Stargazers

Watchers

edge-cloud-platform's Issues

TestController unit test fails intermittently because etcd sync updates lost

The cmd/controller/main_test.go TestController test fails intermittenly with:

--- FAIL: TestController (31.75s)
    clusterinst_testutil.go:340:
                Error Trace:    clusterinst_testutil.go:340
                                                        clusterinst_testutil.go:260
                                                        main_test.go:133
                Error:          Expected nil, but got: &status.Error{s:(*status.Status)(0xc00072a8c8)}
                Test:           TestController
                Messages:       Create ClusterInst {"cluster_key":{"name":"Pillimos"},"cloudlet_key":{"organization":"UFGT Inc.","name":"San Jose Site"},"organization":"AtlanticInc"}

To reproduce, run

cd cmd/controller
go test -run TestController -count 30 -timeout 20m

I tested against both etcd 3.2.16 and etcd 3.5.4.

After debugging it turns out some sync updates were lost, so changes watched from etcd were not making into the Controller caches, and thus not being sent over notify to the fake CRM.

Change selector from `--all` to configlabel

We currently re-create all resources when an appinst is created/updated on a cluster, see:

edge-cloud-platform/pkg/k8smgmt/appinst.go

Line 384 in 9a3e4b5

// Selector selects which objects to consider for pruning.

Instead we should use configlabel on the apps instead. One idea is to bump up appinst compatibility_version field and based on that either use --all, or configlabel

vcd: add support for OVA images

OVAs are a convenient way to package OVF+VMDK+whatever in one file, and are the way Ubuntu cloud images are provided for VMWare. We should support images in OVA format for VCD-based cloudlets. Our own generation of converting qcow2 to ovf+vmdk gets the guest OS wrong, for one. There may other issues with our generated OVF. It's better to use the one supplied with the image.

Add an ip transparency option to envoy proxy we deploy on the rootLB

By default tcp proxy will rewrite src ip for all of the connections. This can be a problem on the app backend, if it sees a lot of connections coming from the same ip(ip of the rootLB). For instance if the backend limits the max number to connections from a single ip, the approach we are currently taking can cause problems. I propose we add an option to pass the src ip to the backend without re-writing it, there is a filter option in envoy that does exactly that:

https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/listener_filters/original_src_filter

Issue was seen by vspatial team.

mc: audit log filtering causes confusing logs

MC audit log filtering of sensitive data like this code:

		} else if strings.Contains(req.RequestURI, "/auth/federation/provider/create") ||
			strings.Contains(req.RequestURI, "/auth/federation/provider/setnotifykey") {
			fedReq := ormapi.FederationProvider{}
			err := json.Unmarshal(reqBody, &fedReq)
			if err == nil {
				// do not log partner federator's API key
				fedReq.PartnerNotifyClientKey = ""
				reqBody, err = json.Marshal(fedReq)
			}
			if err != nil {
				reqBody = []byte{}
			}

Causes an input json string to be converted to the output of a json marshalled object. The problem here is that unspecified data in the original json now ends up in the logged request data because we are marshalling from an object that does not know whether fields were specified or not. It is not sufficient to rely on "omitempty" json tags, because we want to be able to distinguish if the original json specified an empty val (i.e. "0") or just did not specify the field at all.

I.e., the logged request from above is:

 request: |-
  {
    "ID": 0,
    "Name": "fedtest-host",
    "OperatorId": "edgexr",
    "Regions": [
      "US"
    ],
    "FederationContextId": "",
    "MyInfo": {
      "FederationId": "",
      "CountryCode": "US",
      "MCC": "123",
      "MNC": [
        "123"
      ],
      "FixedNetworkIds": null,
      "DiscoveryEndPoint": "",
      "InitialDate": "0001-01-01T00:00:00Z"
    },
    "PartnerInfo": {
      "FederationId": "",
      "CountryCode": "",
      "MCC": "",
      "MNC": null,
      "FixedNetworkIds": null,
      "DiscoveryEndPoint": "",
      "InitialDate": "0001-01-01T00:00:00Z"
    },
    "PartnerNotifyDest": "",
    "PartnerNotifyTokenUrl": "",
    "PartnerNotifyClientId": "",
    "PartnerNotifyClientKey": "",
    "DefaultContainerDeployment": "",
    "Status": "",
    "ProviderClientId": "",
    "CreatedAt": "0001-01-01T00:00:00Z",
    "UpdatedAt": "0001-01-01T00:00:00Z"
  }

This is confusing because the caller did not specify many of these fields, but they show up here.

Suggestion is to convert to a map[string]interface{} instead of an object, remove offending fields, and then marshal the map back to a json string.

Add CRUD apis for the cloudlet access variables

Currently we don't show cloudlet access variables as part of the cloudlet show api. Since this is a sensitive information we should add a separate api to the controller to create/read/update/delete access variables for a given cloudlet. This should also change what platform dependent code deals with - platform should only provide back a list of supported access variable names back to the controller. That way vault modifications are contained in the controller and platform dependent does not have any direct vault interactions.

nfs storage provisioner

Provisioner we use: https://charts.helm.sh/stable:stable/nfs-client-provisioner is deprecated.
Need to update it to https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner

Remove hard-coded edgexr from stun reference

IP lookup via STUN server in pkg/platform/common/infracommon/uri.go refers to a hard coded "stun.edgexr.net". This needs to be parameterized via a deployment-specific parameter so that it can match the deployment domain name.

Base Image: minimum disk size for flavor seems too large

Trying to create a ClusterInst using a flavor with less than 20GB of disk fails due to this check:

./pkg/platform/common/vmlayer/cluster.go:                       return fmt.Errorf("Insufficient disk size, please specify a flavor with at least %dgb", MINIMUM_DISK_SIZE)

Note that this check does not apply to VM App Instances, only ClusterInsts (docker/kubernetes) using our own base image.

Does it really make sense that the minimum size for a VM's disk is 20GB? Can our base image get away with less than that?

Here is what is being used on the platform VM on Openstack:

ubuntu@abc-test-xyz-test-pf:~$ df -H --total
Filesystem      Size  Used Avail Use% Mounted on
udev            2.1G     0  2.1G   0% /dev
tmpfs           414M  902k  413M   1% /run
/dev/vda1        42G  5.0G   37G  12% /
tmpfs           2.1G     0  2.1G   0% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           2.1G     0  2.1G   0% /sys/fs/cgroup
tmpfs           2.1G  8.2k  2.1G   1% /tmp
/dev/vda15      110M  4.6M  105M   5% /boot/efi
tmpfs           414M     0  414M   0% /run/user/1000
total            51G  5.0G   46G  10% -

Here is what is being used on a kubernetes cluster master VM on Openstack:

ubuntu@mex-k8s-master-abc-main-reservable8-edgecloudorg:~$ df -H --total
Filesystem      Size  Used Avail Use% Mounted on
udev            2.1G     0  2.1G   0% /dev
tmpfs           414M  1.7M  412M   1% /run
/dev/vda1        42G  7.5G   35G  18% /
tmpfs           2.1G     0  2.1G   0% /dev/shm
tmpfs           5.3M     0  5.3M   0% /run/lock
tmpfs           2.1G     0  2.1G   0% /sys/fs/cgroup
tmpfs           2.1G     0  2.1G   0% /tmp
/dev/vda15      110M  4.6M  105M   5% /boot/efi
tmpfs           414M     0  414M   0% /run/user/1000
total            51G  7.5G   44G  15% -

In any case, the minimum disk size is hard-coded. Perhaps it should at least be configurable.

infra (all): show current/available gpu resources

Currently there's no way on any of the infras (but particularly Openstack) to see how many gpu resources are available, and how many are being used.

This makes it hard to determine if "no valid host found" errors are due to insufficient gpu resources or some other issue.

Add Zone objects

Zones are groups of cloudlets that have a similar geographical location and network performance. The user can choose to deploy to a Zone, and the platform will decide which cloudlet in the Zone to deploy to. The definition comes from the OPG definitions for operator platforms and federation APIs.

Currently, for federation, the operator must create a zone that corresponds to only one cloudlet. These zones are defined only for federation, and not for local users to use. There is no code for automating placement within a zone of more than one cloudlet.

CRM alert: access creds not valid

CRM should periodically (every 2 hrs?) check that openstack commands work, and generate an alert if not.

Sometimes the certificate for the openstack API gets refreshed either without our knowledge, or we updated it in Vault and the CRM didn't refresh it.

CRM can just run "openstack server list" periodically, and check for error

2023-08-27T12:29:52.583Z        INFO    10640a91a3eeac42        openstack/openstack-cmd.go:40   OpenStack Command Start {"name": "openstack", "parms": "server list -f json"}
2023-08-27T12:29:53.585Z        INFO    10640a91a3eeac42        openstack/openstack-cmd.go:45   Openstack command returned error        {"parms": "server list -f json", "err": "exit status 1", "out": "Failed to discover available identity versions when contacting https://abc:5000/v3. Attempting to parse version from URL.\nSSL exception connecting to https://abc.de:5000/v3/auth/tokens: HTTPSConnectionPool(host='abc', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)')))\n", "elapsed time": "1.001300648s"}

This should probably be extended to all platform types, and have the platform implement a platform-specific function that checks that access creds are still valid. If not valid, pull from Controller and try again. If still not valid, generate the alert.

infra (vcd): support for VM console

VCD plugin has no support for getting a console session on a developer's deployed VM.

Consider making flavors required only for VMs

Flavors are really only needed for VMs. For docker/kubernetes, raw vcpu and mem (serverless config) should be used.

If the system needs to create a VM to host a cluster for a docker/kubernetes deployment, it can choose the best flavor to do so. In particular, it can choose a cloudlet specific flavor which may be closer to the actual resource requirements than first mapping to an edge-cloud regional flavor, and then mapping to a cloudlet-specific flavor.

MC: crash in ldap library code

MC crashed with this stack trace:

2023/05/14 07:04:03 len(packet.Children) < 2
2023/05/14 07:04:04 len(packet.Children) < 2
2023/05/14 07:04:14 len(packet.Children) < 2
2023/05/14 07:04:14 len(packet.Children) < 2
2023-05-14T07:04:25.786Z        INFO    3e5892bf5a494d6a        alertmgr/alertmgr.go:131        start alert-mgr
2023-05-14T07:04:25.786Z        INFO    3e5892bf5a494d6a        alertmgr/alertmgr.go:133        Sending Alerts to AlertMgr      {"AlertMrgAddr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:04:25.787Z        INFO    3e5892bf5a494d6a        alertmgr/alertmgr.go:635        Tls client config       {"addr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:04:25.831Z        INFO    3e5892bf5a494d6a        alertmgr/alertmgr.go:148        finish alert-mgr        {"lineno": "alertmgr/alertmgr.go:131"}
2023-05-14T07:04:25.839Z        DEBUG   [email protected]+incompatible/reporter.go:311   flushed 1 spans
2023-05-14T07:04:55.832Z        INFO    7a1e371bfcbf5374        alertmgr/alertmgr.go:131        start alert-mgr
2023-05-14T07:04:55.832Z        INFO    7a1e371bfcbf5374        alertmgr/alertmgr.go:133        Sending Alerts to AlertMgr      {"AlertMrgAddr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:04:55.832Z        INFO    7a1e371bfcbf5374        alertmgr/alertmgr.go:635        Tls client config       {"addr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:04:55.864Z        INFO    7a1e371bfcbf5374        alertmgr/alertmgr.go:148        finish alert-mgr        {"lineno": "alertmgr/alertmgr.go:131"}
2023-05-14T07:04:55.866Z        DEBUG   [email protected]+incompatible/reporter.go:311   flushed 1 spans
2023/05/14 07:05:09 len(packet.Children) < 2
2023/05/14 07:05:10 len(packet.Children) < 2
2023-05-14T07:05:25.865Z        INFO    35357ad4cc6831b alertmgr/alertmgr.go:131        start alert-mgr
2023-05-14T07:05:25.865Z        INFO    35357ad4cc6831b alertmgr/alertmgr.go:133        Sending Alerts to AlertMgr      {"AlertMrgAddr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:05:25.865Z        INFO    35357ad4cc6831b alertmgr/alertmgr.go:635        Tls client config       {"addr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:05:25.900Z        INFO    35357ad4cc6831b alertmgr/alertmgr.go:148        finish alert-mgr        {"lineno": "alertmgr/alertmgr.go:131"}
2023-05-14T07:05:25.903Z        DEBUG   [email protected]+incompatible/reporter.go:311   flushed 1 spans
2023/05/14 07:05:29 len(packet.Children) < 2
2023/05/14 07:05:30 len(packet.Children) < 2
2023-05-14T07:05:55.902Z        INFO    781b5b59643193e5        alertmgr/alertmgr.go:131        start alert-mgr
2023-05-14T07:05:55.902Z        INFO    781b5b59643193e5        alertmgr/alertmgr.go:133        Sending Alerts to AlertMgr      {"AlertMrgAddr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:05:55.902Z        INFO    781b5b59643193e5        alertmgr/alertmgr.go:635        Tls client config       {"addr": "https://alertmanager.tef.edgexr.org:9094"}
2023-05-14T07:05:55.937Z        INFO    781b5b59643193e5        alertmgr/alertmgr.go:148        finish alert-mgr        {"lineno": "alertmgr/alertmgr.go:131"}
2023-05-14T07:05:55.940Z        DEBUG   [email protected]+incompatible/reporter.go:311   flushed 1 spans
2023/05/14 07:05:56 len(packet.Children) < 2
2023/05/14 07:06:11 len(packet.Children) < 2
2023/05/14 07:06:12 len(packet.Children) < 2
2023/05/14 07:06:12 len(packet.Children) < 2
panic: runtime error: makeslice: len out of range

goroutine 1383 [running]:
github.com/nmcclain/asn1-ber.resizeBuffer(...)
        github.com/nmcclain/[email protected]/ber.go:169
github.com/nmcclain/asn1-ber.ReadPacket({0x459ac40, 0xc001836090})
        github.com/nmcclain/[email protected]/ber.go:238 +0x31f
github.com/nmcclain/ldap.(*Server).handleConnection(0xc000234150, {0x45c3830, 0xc001836090})
        github.com/nmcclain/[email protected]/server.go:230 +0x7b
created by github.com/nmcclain/ldap.(*Server).serve
        github.com/nmcclain/[email protected]/server.go:214 +0xc5

We may need to patch/update the ldap library.

Change platforms to use golang SDKs where possible

The base-image, which is used for the Controller and CRM because it needs to support the platform-specific code, clocks in at 3.69GB. This is way too big for a docker container.
The python base image (python:3.9.14-slim-bullseye) is only 275MB or so. Python is needed for openstack commands.

Space used up by others:
kubectl: 45 MB
google-cloud-sdk/cli: 706 MB
azure-cli: 1135 MB
vcd-cli: 110 MB
govc: 32 MB
awscli: 320MB
aws eksctl: 120 MB

We could split the CRM into separate containers, one for each platform, but we can't do that for the Controller. The controller needs to call platform-specific code for Cloudlet setup/destroy.

Instead, most of the above have golang SDKs:
google-cloud-sdk: https://cloud.google.com/go/docs/reference
aws: https://aws.amazon.com/sdk-for-go/
azure: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go
vcd: https://github.com/vmware/go-vcloud-director

What I would propose is that we replace the shell calls to cli programs with go sdk libraries, so we don't need to install these tools. I'd start with the biggest offenders (google/azure/aws).

Refactor calculations in influx_usage.go

This code is too gnarly - I'll open a tracking issue to rework some of the usage code

Originally posted by @levshvarts in #185 (comment)

Email verification sent for non-existent user

On the UI's login page, users can click on "verify email", which will prompt them to enter an email address to which to resend the email verification link. MC will send this email even if the user doesn't exist. When the link in the email is received, the UI displays a "Oops, this link is expired" error.

Instead, when MC gets a request to resend a email verification email, if the user does not exist, it should send an email that says something like:

A request for email verification was made for <email>, but no user exists for this email address. If this action was not requested by you, please ignore this email. If this action was requested by you, you first need to create an account by using the "Register" link at https://console.<domain>, or using 'mcctl --addr https://console.<domain> user create'.

Remember that we cannot return an error directly to the email verification request, because that can be used by malicious actors to determine what email addresses are registered in our platform.

infra: kubernetes cluster as a cloudlet

It would useful to allow an existing kuberentes cluster to be onboarded as a cloudlet. The user would submit a k8s config that has the certificates for accessing the cluster's API, and we can then manage the cloudlet as a single kubernetes cluster. Alongside resource quotas, this make it easy to allocate a chunk of an existing kubernetes cluster to be managed by edge-cloud.

Format access vars as a keys/values in the vault as well as in the cloudlet object

Currently AccessVars for the openstack cloudlet are formatted as as string delimited by new line characters. In addition we store them as a single keyt/value in vault. Example:
accessVarsTestGood["OPENRC_DATA"] = "OS_AUTH_URL=https://openstacktest.mobiledgex.net:5000/v3\nOS_PROJECT_ID=12345\nOS_PROJECT_NAME=\"mex\"\nOS_USER_DOMAIN_NAME=\"Default\"\nOS_PROJECT_DOMAIN_ID=\"default\"\nOS_USERNAME=\"mexadmin\"\nOS_PASSWORD=password123\nOS_REGION_NAME=\"RegionOne\"\nOS_INTERFACE=public\nOS_IDENTITY_API_VERSION=3"

And in vault this is what it looks like:

$ vault kv list secret/EU/cloudlet/openstack/TDG-OTC-TEST/berlin
Keys
----
openrc.json
$ vault kv get secret/EU/cloudlet/openstack/TDG-OTC-TEST/berlin/openrc.json
========================== Secret Path ==========================
secret/data/EU/cloudlet/openstack/TDG-OTC-TEST/berlin/openrc.json

======= Metadata =======
Key                Value
---                -----
created_time       2023-04-01T12:32:26.317510149Z
custom_metadata    <nil>
deletion_time      n/a
destroyed          false
version            1

=== Data ===
Key    Value
---    -----
env    [map[name:OS_AUTH_URL value:XXXXX] map[name:OS_PASSWORD value:XXXX] map[name:OS_USER_DOMAIN_NAME value:XXXX] map[name:OS_IDENTITY_API_VERSION value:3] map[name:OS_CACERT_DATA value:-----BEGIN CERTIFICATE-----<cert here>
-----END CERTIFICATE-----] map[name:OS_USERNAME value:username] map[name:OS_REGION_NAME value:RegionOne] map[name:OS_PROJECT_NAME value:osproj] map[name:OS_PROJECT_DOMAIN_NAME value:default]]

We should instead handle them as key value pairs at the controller level and store them as individual key values in the vault like this:

$ vault kv list secret/EU/cloudlet/openstack/TDG-OTC-TEST/berlin
Keys
----
OS_AUTH_URL 
OS_PASSWORD 
OS_USER_DOMAIN_NAME 
OS_IDENTITY_API_VERSION 
...

And on the UI side, we should populate the list of AccessVars similar to how we deal with EnvVars.

Change federation provider/consumer to host/guest

There are still places in the code (particularly API paths) that use provider/consumer instead of host/guest nomenclature for federation roles. We should finish replacing the old nomenclature.

AppInst stuck in "In Progress"

From UI, an AppInst is stuck in "In Progress" state.

Creating new auto-cluster named reservable3 to deploy AppInst

Defaulting IpAccess to IpAccessShared for deployment kubernetes

Creating

Creating Heat Stack for frankfurt-main-reservable3-edgecloudorg

Creating Heat Stack for frankfurt-main-reservable3-edgecloudorg, Heat Stack Status: CREATE_IN_PROGRESS

Creating Heat Stack for frankfurt-main-reservable3-edgecloudorg, Heat Stack Status: CREATE_COMPLETE

Waiting for Cluster to Initialize

Waiting for Cluster to Initialize, Checking Master for Available Nodes

This appears to be due to Redis stream state not being cleared properly.

Use TLS for PostgreSQL DB connection

Currently MC assumes the postgres DB is local, so no TLS is being used. However, in the future it may not be local, and even if it's local, it is still recommended to use TLS.

Also not using TLS gets flagged in security audits.

certgen fails to update public cert

Certgen failed to update the public cert for the controller's access api endpoint.

This could be seen with the following command:

echo | openssl s_client -showcerts -connect eu.ctrl.{DOMAIN}:41001 2>/dev/null | openssl x509 -inform pem -noout -text

The fix was to go into the certgen pod and run the renew command manually. The renew command runs as part of cron:

# On global k8s:
kubectl exec -it certgen-644f57f6f6-7h7qd -- bash
## In pod:
/etc/letsencrypt/live# openssl x509 -in eu.ctrl.DOMAIN/cert.pem -text

more /etc/crontab
9 4,16 * * * root certbot renew >/proc/1/fd/1 2>/proc/1/fd/2

certbot renew
Saving debug log to /var/log/letsencrypt/letsencrypt.log

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/_.abcdef.edgexr.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Certificate not yet due for renewal

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/_.dme.abc.edgexr.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Renewing an existing certificate for *.dme.abc.edgexr.org
Waiting 10 seconds for DNS changes to propagate

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/_.abc-01-abc.eu.app.abc.edgexr.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Renewing an existing certificate for *.abc-01-acb.eu.app.abc.edgexr.org
Waiting 10 seconds for DNS changes to propagate

Certbot failed to authenticate some domains (authenticator: dns-cloudflare). The Certificate Authority reported these problems:
  Domain: abc-01-abc.eu.app.abc.edgexr.org
  Type:   dns
  Detail: DNS problem: NXDOMAIN looking up TXT for _acme-challenge.abc-01-abc.eu.app.abc.edgexr.org - check that a DNS record exists for this domain

Hint: The Certificate Authority failed to verify the DNS TXT records created by --dns-cloudflare. Ensure the above domains are hosted by this DNS provider, or try increasing --dns-cloudflare-propagation-seconds (currently 10 seconds).

Failed to renew certificate _.abc-01-abc.eu.app.abc.edgexr.org with error: Some challenges have failed.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/_.abc-gpu-abc.eu.app.abc.edgexr.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Certificate not yet due for renewal

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Processing /etc/letsencrypt/renewal/eu.ctrl.abc.edgexr.org.conf
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Renewing an existing certificate for eu.ctrl.abc.edgexr.org
Waiting 10 seconds for DNS changes to propagate

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
The following certificates are not due for renewal yet:
  /etc/letsencrypt/live/_.acb-abc.eu.app.abc.edgexr.org/fullchain.pem expires on 2023-02-19 (skipped)
  /etc/letsencrypt/live/_.abc-gpu-abc.eu.app.abc.edgexr.org/fullchain.pem expires on 2023-03-16 (skipped)
The following renewals succeeded:
  /etc/letsencrypt/live/_.dme.abc.edgexr.org/fullchain.pem (success)
  /etc/letsencrypt/live/eu.ctrl.abc.edgexr.org/fullchain.pem (success)

The following renewals failed:
  /etc/letsencrypt/live/_.abc-01-abc.eu.app.abc.edgexr.org/fullchain.pem (failure)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 renew failure(s), 0 parse failure(s)
Ask for help or search for solutions at https://community.letsencrypt.org. See the logfile /var/log/letsencrypt/letsencrypt.log or re-run Certbot with -v for more details.

May need to increase 10sec timeout for validation.

infra (openstack): cryptic no host found error message

On Openstack, creating a cluster via a Heat stack often returns a generic error message:

error: 'rpc error: code = Unknown desc = Encountered failures: Create failed:
Cluster VM create Failed: Heat Stack failed: Resource CREATE failed:
ResourceInError:
resources.mex-k8s-master-frankfurt-main-jon-gpu-test-jongainsley-tester: Went
to status ERROR due to "Message: No valid host was found. There are not enough
hosts available., Code: 500"'

It's not entirely clear from this message what the problem is - are there no more vcpus/gpus/mem/disk or is this some other error? We should capture this error and provide a better error message to the user to inform them how to rectify this problem (change their cluster config or just use another cloudlet, etc.).

CRM should reload accessvars if they are updated

If the user does a CloudletUpdate and updates the access vars, the CRM should notice that and reload the access vars, since it caches them in memory at start up. Otherwise we need to manually restart the CRM to get it to reload the new access vars.

Remove version.go from compile

If the repo tag changes, this causes the compiled binary to change even if none of the source code has changed. This presents a problem for docker builds which would like to avoid having to upload a new version of the binary if nothing has actually changed. This is especially true for the edge-cloud-platform which has 7 or 8 different services and are statically compiled go binaries, so clock in at anywhere from 20MB-80MB in size.

Instead we could add the version tag as an external file, that would be a very small layer change and would not affect the binary.

e2e test failure: etcdserver: mvcc: required revision is a future revision

In the Controller's logs for e2e-tests and unit-tests which run an etcd process, intermittently we see errors from etcd:

etcdserver: mvcc: required revision is a future revision.

The root cause is having multiple etcd process hammering a single disk. The issue is resolved by having etcd use a ramdisk for testing.

On MacOS Darwin, specifying the env var ETCD_RAMDISK_SIZE=1.5 will automatically create a 1.5G ramdisk for both unit and e2e tests.
On Linux, creating a ramdisk requires sudo permissions, so setting ETCD_RAMDISK_SIZE will enforce that a ramdisk exists, but it must be created manually via:

mkdir -p $HOME/ramdisk
sudo mount -t tmpfs -o size=2g tmpfs $HOME/ramdisk

Hopefully this should not be seen in production due to etcd processes being spread across different servers, but if it is, it may be due to a slow or busy disk.

pod logs/terminal access for helm deployments

For standard kubernetes deployments we allow log/terminal access to pods. However, we do not do the same for pods deployed via Helm charts. We should support this as well for helm deployments.

infra (vmlayer): allow configuring whether pods run on master node

Right now I believe we taint the master node so no pods can run on it, unless the cluster is just a single master node. This is fairly wasteful for small clusters with only one or two worker nodes. Especially if there no regional setting for a master node flavor, then the master node flavor defaults to the node flavor.

We may want to consider allowing the user to decide whether the master node can run pods for their own clusters.

cluster-svc: alertPolicy clobbering for multiple Apps in the same cluster

In cluster-svc-main.go:createAppInstCommon(), the cluster-svc creates Alerts based on the user App's alert policy. A second App with another alert policy will overwrite the first App's alert policy for the cluster.

When fixing this, we should consider how to also deal with multiple instances in a multi-tenant Kubernetes cluster.

infra (openstack): monitor network quotas

Openstack implements quotas on resources we can use. We have some monitoring and alerting around cpu/mem quotas, but none on network quotas. We recently hit errors on deploying AppInsts because we hit a security group rule quota limit:

    "message": "Encountered failures: Create App Inst failed: Heat Stack failed: Resource CREATE failed: OverQuotaClient: resources.win1010-mytestorg.*****.eu.app.****.edgexr.org-sg: Quota exceeded for resources: ['security_group_rule'].\nNeutron server returns request_ids: ['req-77ff1c7a-e6de-4d7e-b7b6-c436c1cd86e9']"

Unfortunately at this time it appears we can't even see what those quota limits are.

# openstack quota list --network
You are not authorized to perform the requested action: identity:list_projects. (HTTP 403) (Request-ID: req-e4cf610b-d760-42d9-b301-3aadba75687c)

We should add listing quotas to the list of openstack commands needed for onboarding a new openstack cloudlet.

We should add monitoring and alerting when we are close to using up any of our network quotas.

federation: support resource reservation

Federation EWBI APIs include APIs around resource reservation and resource reporting which we have not implemented yet. This issue tracks the need to implement them.

mcctl: suppress status code on stream api failure

For stream APIs, the HTTP status code is always 200 to be able to start the stream. If the stream fails midway, mcctl prints out the status code in addition to the error. This is confusing because mcctl prints a successful status code with an error.

Instead, mcctl should avoid printing the 200 status code during a stream error.

Example to fix:

mcctl --addr https://console.xyz.edgexr.org cloudlet create region=EU cloudlet=abc-test \
  cloudletorg=TDG location.latitude=30.72248 location.longitude=17.1422 numdynamicips=20 \
  platformtype=Openstack physicalname=main
message: Sourcing access variables
message: 'Creating VM Image from URL: edgecloud-v4.10.0'
message: no flavors found
message: Deleting Cloudlet due to failures
message: Deleting cloudlet
message: Deleting RootLB abc-test.xyz.app.xyz.edgexr.org
message: Deleting Platform VMs abc-test-xyz-pf
message: Deleting Cloudlet Security Rules abc-test.xyz.app.xyz.edgexr.org
Error: OK (200), No flavors found

vcd: base image disk size optimization

We create a 20GB Ubuntu disk image. We should try to decrease the resource usage of the load balancers.

Federation: show jsonToDbNames not honoring gorm embedded tag

The jsonToDbNames converts show command input from json to db to allow postgres to filter for show commands. However for federation host/guest objects which use a gorm embedded tag, the function is not properly converting the json "myinfo" name to the gorm embedded "my_federator" database name. The function needs to take into account the gorm embedded tag.

 request: |-
  {
    "name": "fedtest-host",
    "operatorid": "edgexr",
    "regions": [
      "US"
    ],
    "myinfo": {
      "countrycode": "US",
      "mcc": "123",
      "mnc": [
        "123"
      ]
    }
  }
 respheaders: |-
  {
    "Content-Type": "application/json; charset=UTF-8"
  }
 response: >-
  {
    "message": "Failed to parse input data: JSON field myinfo not found in database object FederationProvider"
  }

admin: allow access to k8s cluster

For debugging, it is useful for the admin to be able to access a kubernetes cluster via kubectl. This is currently only possible with knowledge of how to access the cloudlet directly (or go via the crm access which is clunky). For security we don't allow users to directly access their own clusters, but for admins it is useful to help debug any issues.

Support for Red Hat Openstack DCN

This request came from a customer to support cloudlets with a small compute footprint(2-3 servers).
Red Hat Openstack platform supports a DCN(distributed compute node) modem, where the compute nodes can be remote to the control plane.
We should be able to provide similar support by running crm/shepherd outside of the small cloudlet.

Reference:
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/distributed_compute_node_and_storage_deployment/understanding-dcn

infra (vcd): VM instances get their own LB VM, even for shared LB mode

On VCD (and perhaps other VM infra), deploying a shared-access VM instance also instantiates a dedicated LB for the VM, even though we've specified shared-access via the shared root LB.

This is an inefficient use of resources because we should be using the shared LB instead. This was implemented as part of mobiledgex/edge-cloud-infra#1619.

Talking with Jim, he mentioned:

we used to have direct or lb access. But since there is no way to define a clusterInst for VM apps, we had to pick one or the other. So we went with LB as it also gives us envoy/stats/etc
we are also using the lb for dhcp in some cases to get ips on the vm
Q: But stats can be gotten through the shared root LB. DHCP can also be run from the shared Root LB.
VM apps do not support shared lb, there is no way to specify it because there is no cluster created

This sounds like just a code config issue, and not a technical issue. Perhaps it can be fixed.

shepherd: refactor `newMetric` function

This function is being used by both cluster and appinst metrics, and as a result the logic is pretty convoluted.
See discussion

MC: websocket write broken pipe

A long idle command resulted in a broken pipe between MC and the UI via a websocket. I'm not sure if this is an issue with keepalives, the browser (Thomas was running the command), or the network in between. Probably hard to determine. In any case filing an issue with the relevant MC logs in case it happens again.

2023-04-18T20:53:39.849Z        INFO    24a95aa8dc9903e2        orm/auditlog.go:85      start /ws/api/v1/auth/ctrl/CreateAppInst
2023-04-18T20:53:39.850Z        INFO    24a95aa8dc9903e2        gormlog/logger.go:69    Call sql        {"sql": "SELECT * FROM \"configs\"  WHERE \"configs\".\"id\" = $1 ORDER BY \"configs\".\"id\" ASC LIMIT 1", "vars": [1], "rows-affected": 1, "took": "1.174115ms"}
2023-04-18T20:53:39.850Z        INFO    24a95aa8dc9903e2        ormutil/auth.go:90      get claims: no user
2023-04-18T20:53:39.889Z        INFO    24a95aa8dc9903e2        gormlog/logger.go:69    Call sql        {"sql": "SELECT * FROM \"organizations\"  WHERE (\"organizations\".\"name\" = $1) ORDER BY \"organizations\".\"name\" ASC LIMIT 1", "vars": ["mytestorg"], "rows-affected": 1, "took": "969.013µs"}
2023-04-18T20:53:39.891Z        INFO    24a95aa8dc9903e2        gormlog/logger.go:69    Call sql        {"sql": "SELECT * FROM \"configs\"  WHERE \"configs\".\"id\" = $1 ORDER BY \"configs\".\"id\" ASC LIMIT 1", "vars": [1], "rows-affected": 1, "took": "1.293217ms"}
2023-04-18T20:53:39.892Z        INFO    24a95aa8dc9903e2        gormlog/logger.go:69    Call sql        {"sql": "SELECT * FROM \"org_cloudlet_pools\"  WHERE (\"org_cloudlet_pools\".\"org\" = $1) AND (\"org_cloudlet_pools\".\"region\" = $2)", "vars": ["mytestorg","EU"], "rows-affected": 2, "took": "846.911µs"}
2023-04-18T20:53:39.892Z        INFO    24a95aa8dc9903e2        ctrlclient/cloudletpool.ctrl.go:67      start controller api
2023-04-18T20:53:39.894Z        INFO    24a95aa8dc9903e2        ctrlclient/cloudletpool.ctrl.go:94      finish controller api
2023-04-18T20:53:39.894Z        INFO    24a95aa8dc9903e2        ctrlclient/appinst.ctrl.go:34   start controller api

...no logs for the trace in the interim...

2023-04-18T21:04:55.312Z        INFO    24a95aa8dc9903e2        ctrlclient/appinst.ctrl.go:51   finish controller api
2023-04-18T21:04:55.312Z        INFO    24a95aa8dc9903e2        orm/server.go:1280      Failed to write error to websocket stream       {"err": "write tcp 10.244.0.176:9900->10.244.2.4:33212: write: broken pipe", "writeErr": "write tcp 10.244.0.176:9900->10.244.2.4:33212: write: broken pipe"}
2023-04-18T21:04:55.313Z        INFO    24a95aa8dc9903e2        orm/auditlog.go:368     finish /ws/api/v1/auth/ctrl/CreateAppInst       {"remote-ip": "62.143.93.203", "app": "win10", "error": "write tcp 10.244.0.176:9900->10.244.2.4:33212: write: broken pipe", "cluster": "DefaultVMCluster", "cloudlet": "***", "federatedorg": "", "email": "***", "org": "mytestorg", "request": "{\"region\":\"EU\",\"appinst\":{\"key\":{\"app_key\":{\"organization\":\"mytestorg\",\"name\":\"win10\",\"version\":\"1.0\"},\"cluster_inst_key\":{\"cloudlet_key\":{\"name\":\"***\",\"organization\":\"***\"},\"cluster_key\":{\"name\":\"DefaultVMCluster\"},\"organization\":\"mytestorg\"}},\"flavor\":{\"name\":\"m4.xlarge\"}}}", "response": "{\"code\":200,\"data\":{\"message\":\"Creating\"}}\n{\"code\":200,\"data\":{\"message\":\"Creating VM Image from URL: win10.-300c8c5865096018b2857b806f42b9f9\"}}\n{\"code\":200,\"data\":{\"message\":\"Deploying App\"}}\n{\"code\":200,\"data\":{\"message\":\"Creating Heat Stack for mytestorgwin1010-***-***\"}}\n{\"code\":400,\"data\":{\"message\":\"Write tcp 10.244.0.176:9900-\\u003e10.244.2.4:33212: write: broken pipe\"}}", "method": "GET", "username": "***", "clusterorg": "mytestorg", "status": 200, "level": "audit", "region": "EU", "lineno": "orm/auditlog.go:85", "apporg": "mytestorg", "appver": "1.0", "cloudletorg": "***"}

infra (vcd): manage edge router and its firewall/NAT rules

In previous VCD deployments, the edge router and it's firewall/NAT rules were pre-configured by the operator. However, in recent deployments, the operator has created the edge router but left the management of firewall and NAT rules on the router to us. Currently we have manually set the rules, but this should be done programmatically by the platform.

Rules include:

firewall ingress any for tcp 80/443/53 udp 53
firewall egress any for specific routing network
appinst ingress any for specific ip range of externally-routable IPs
DNAT/SNAT rules, on each to map public IP address to internal 10.10.0.x ip address, for LBs

mcctl better support for multiple deployments

Mcctl login stores a single token regardless of which deployment you log into. This means if you are switching between multiple deployments, you need to log in again every time you switch.

Instead, it would be better if mcctl stored the token per address.

general and federation: add App Group feature

A few customers have asked for this in the past, and this is baked into the federation EWBI (although it doesn't appear to be well thought out). But having a way for developers to create a new App definition that is a combination of existing App definitions to be able to deploy them together is a useful feature. How to define their interaction with each other (if it's supported) is something that needs to be explored.

container runcommand: copy files to and from container

RunCommand allows getting container logs, and running commands/shell in the container. A customer was trying to copy out a medium sized file from their container to their local machine. It would be nice if there was a way to transfer a binary file securely in or out of the target container.

Currently the customer would need to use ssh/scp to do, which requires enabling ssh/scp access to their container/local machine, which is not feasible in many situations.

Support App secrets

Support user-supplied secrets on App definitions. These should be stored in Vault, and applies as env vars during deployment to docker, and as env vars or k8s secret objects in k8s.

vcd: security: cross talk between tenants via additional networks

This may be a general issue beyond just VCD.
During VCD testing, it was noted that creating clusters from separate developer tenants, both with additional networks defined in VCD as:
10.241.0.1/16 isolated
10.199.0.1/24 isolated

That one tenant VM could ping the other tenant's VM over the network. Because users assign networks to ClusterInsts, it's possible for a developer to reach another developer's VM via the additional network.

We need to prevent developers from being able to reach other developer's VMs via additional networks, as it poses a security risk.

vcd: apptemplate code tries to upload converted vmdk to public repo

VCD platform converts qcow2 images to vmdk, then tries to upload to vm-registry for storage. However, if the imagepath comes from a publicly available image repository, i.e. https://cloud-images.ubuntu.com/, it tries to upload the ovf+vmdk to ubuntu.com and fails.

Instead, VCD should detect that image does not come from vm-registry, and skip the upload to vm-registry step, and just create the image directly in the target VCD.

MC: config option to log most show commands

Currently there is a MC config option to log show commands, which is very useful for debugging. However there's probably a few commands that we don't want to log for most debug sessions, such as:

Login
audit/event/event terms show commands (their output can fill up the log file)
ShowAlert (UI runs it every 30 secs)

I suggest we add another option, "log show for debug" which logs all show commands except the aforementioned ones.

Change platform type to string

Change infra platform type to string. This will make it easier to add new infra platform support, without needing to modify the protobuf definitions. We would need some way to query what platforms are supported. This should come from the platform/plugin code.