cloudnativesdwan / cnwan-operator Goto Github PK

View Code? Open in Web Editor NEW

21.0 7.0 12.0 695 KB

License: Apache License 2.0

Dockerfile 0.87% Makefile 1.00% Go 88.27% Smarty 1.29% Shell 8.56%

kubernetes service-registry sd-wan aws-cloudmap gcp service-directory service-discovery

cnwan-operator's People

Contributors

Stargazers

Watchers

Forkers

joeyjarosz asimpleidea lwlxwang ljakab varadarajana joearul09 hayleyshim lulianbing58 bhandiwad hecg119 jgomezve kodesam

cnwan-operator's Issues

Remove dependency from mounted secrets for service directory

Right now we are create a Kubernetes Secret with Google's service account as its content and we mount it in the operator's pod's container.

This was fine until now, but with the support of etcd this is not a scalable solution anymore:

We may create a file with the username and pass for etcd there and keep it mounted with the same name leaving the operator in charge of unmarshal it properly (it knows how to do that because it knows what service registry we are using)...
but if user has no authentication in etcd, we have nothing to mount and so we have to modify the deployment
if user uses TLS we know have two files to mount and yet again

Mounting the settings file is fine because that never changes and we always expect it and we expect it to be the same way.
But this is no longer a good solution for credentials.

With etcd we are using a different approach: according to the authentication method, a Secret must still be created of course, but now it's not mounted anymore, as the operator will pull the Secret from Kubernetes on its own.

Service directory will experience the same thing: user creates the Secret with their service account and it won't be mounted, but rather it will be pulled by the operator automatically.

Create script to automate etcd installation

Maybe we should add these steps as a script to scripts/etcd_init.sh or similar...? Anything that decreases friction and gets the user up and running quickly is great! :D

Originally posted by @ljakab in #25 (comment)

Improve namespace allow/block system

Brief

Improve the way to make the operator monitor namespaces with an easier, more "semantic" and less error-prone way.
Note: this is not an urgent issue, but can improve usability of the operator. Code-wise it won't be hard to change, but depending on how users view the current system (easy or confusing) this maybe long or short term.

Current way of doing it

An administrator/user can define which namespaces the operator can monitor by putting namespaces inside an allowlist or blocklist.

If settings are set as listPolicy: allowlist then the operator will only monitor namespaces that have label operator.cnwan.io/allowed: yes (but really the value can be anything, more on this later).
On the contrary, listPolicy: blocklist will make the operator monitor all namespaces and avoid those that have label operator.cnwan.io/blocked: yes (again, the value can really be anything here).

Issue

This system is very convenient but can certainly be improved, as it has some shady parts that can be considered bad practice or misleading.

Labels

We have two labels to filter a namespace (maybe filter is not the right word here, but it explains weel), instead of one. What happens when a user - by mistake - puts both labels operator.cnwan.io/allowed and operator.cnwan.io/blocked in the same namespace? The operator will know which one is the one that it has two check. But nonetheless this can confuse users, who can potentially mis-configure the operator, which in turn can have unwanted/unexpected behaviors.
- Users may be tempted to do operator.cnwan.io/allowed: no. But this will have no effect as the key is actually the important part (read next point) and will bring even more confusion.
operator.cnwan.io/allowed: yes and operator.cnwan.io/blocked: yes are a bad use of labels and a bad practice as a consequence: in labels and metadata what is actually meaningful is the value, not the key. They key is supposed to be the differentiator and holder (after all, you can think of it as a YAML key), while the value is supposed be the important thing because it tells how key must be configured. Instead, we are actually considering the key here (../allowed or ../blocked) as the actual value, which is a bad practice because it is not what its job is.
- As a consequence, the value is really meaningless because it can be whatever thing since we consider the key to be important. Making it more meaningful by forcing it to have yes or true is useless, because in that case it might just as well not be there entirely. Additionally, I think it would be pointless to have a key that can have only two values (synonyms) and the others are "invalid" (for lack of other word) in light of the previous sentence.
Minor thing as it is just a language issue: but the operator is not really allowed or blocked to do anything here so it is just a minor confusion point.

List Policy

The current way to set the default list as listPolicy: allowlist/blocklist is good and can even stay like this, but I think that it can be improved and made easier to understand.

As said above, there is no real allow/block here as there is no real concept of "permissions" that you can give to the operator.
The way it works can be re-worked in a easier way.
Minor thing: is "policy" a good/fitting word here?
Minor thing (but it may just be me): when re-writing the documentation and describing this part, I had the feeling that listPolicy: allowlist may be wrongly interpreted as "namespaces are in the allowlist by default" if one does not read the documentation.

Changes

I intend/propose to change the operator to make it better and more understandable with following ways.

From ListPolicy to DefaultBehavior

The listPolicy setting can be moved to a better description.

For example: listPolicy can be changed to defaultBehavior or defaultAction to make it clear that it modifies the default way the operator works.

listPolicy: allowlist can be changed to defaultBehavior: ignore which makes the operator ignore (I think this is a better fitting word) a namespace by default.
listPolicy: blocklist can be changed to defaultBehavior: watch which makes the operator watch (again, better fitting word) a namespace by default.

Labels

This is more pressing issue.

Instead of having two different keys, have only one that is more good practice: operator.cnwan.io/action. The prefix operator.cnwan.io again proves that what comes next (action) only makes sense for the cnwan operator (just as k8s.io/something makes sense for kubernetes internal purposes).
- There is no risk of having two keys and maybe messing up something.
To watch a namespace use values watch, include, enable and maybe something else. I think that operator.cnwan.io/action: watch is more fitting and better understandable. watch is of course enough, but multiple synonyms allow a user not to have to look at the documentation if they don't remember the valid word.
On the contrary, ignore, disable, exclude, etc. will replace operator.cnwan.io/blocked.

Upgrade

For people that used versions where the old names were used, an upgrade.sh shell be provided that will automatically write the correct values and change the labels accordingly.
A DEPRECATION NOTICE will be included in the documentation.

Support

The old mechanics will be supported for x (like, 2?) major version cycles with the DEPRECATED notice.

Update service directory package to v1

Recently, the golang package for service directory has been updated to v1 from v1beta1 (what we are using now).

The operator will be updated to include this change. Since this is not critical, it will probably come within other PRs.

Create kubernetes probes

At the beginning of the Operator project I didn't really find a reason to create probes for the operator -- e.g. readiness probes and/or liveness probes. But now, since much effort is going to be done to improve the operator and move away from some component of kube-builder and operators in general, I am thinking of creating probes to the operator.

For example, since the operator will not have dependencies on mounted files, the operator will be able to to start even without any settings or secrets deployed to kubernetes and it will not crash (what it does now), but will merely wait for it to happen, sending not ready values to the probes. This will make the operator more flexible as it will wait for users to deploy the needed file and this does happen the probe will receive a ready state. The normal behavior will occur, which is that after x consecutive not ready values, the operator will be restarted, which currently happens immediately right now.
There are many other use cases, which I will show when this is implemented.

Update documentation with current K8s and GCP requirements

The Operator requires a particular version of K8s that should be documented. It also requires as of now access to GCP public APIs, which needs to be taken into account specially in private deployments behind a proxy. These should be documented.

Create a Helm chart for the operator

Installing the operator ideally should be as simple as helm install cnwan-operator. We should create a Helm chart that:

Can install the operator with sensible defaults
Allows customization of all the options currently available when installing manually
Optionally pulls in and installs etcd using the bitnami/etcd chart, with sane defaults, but allowing customization of the most critical values

Support Ingress

I will start experimenting with Ingress resources, so that we could start supporting it, since it makes sense for many users to use an ingress to expose their services inside the cluster. And because it's about time :)

Add `latest` tag too when pushing GitHub Package

The workflow from https://github.com/CloudNativeSDWAN/cnwan-operator/blob/master/.github/workflows/publish-github-package.yml pushes to GitHub Packages with the tag created from the version number. In addition to that it should also tag the same image as latest.

Implement workload identity for AWS

Instead of providing secrets with tokens, the operator could use AWS's v2 SDK (which is already using) for logging in, in case it is running on EKS.

For reference: https://blog.jimmyray.io/kubernetes-workload-identity-with-aws-sdk-for-go-v2-927d2f258057

Update go mod

Right now, we use go version 1.13.

A new update will make the operator use 1.15 and will re-run init to clean a lot dependencies.

Remove viper dependency

On next versions, dependency from Viper will be completely removed in favor of local settings passed by value and only the ones that are actually needed by the package/struct.

Update controllers

Currently the project uses version v0.6.3 of controller runtime, which is 1.5 years old.

Updating controller runtime to a more recent version requires many changes to our controllers as versions after v0.6.3 introduce breaking changes.

As such, the controllers will be revised and updated in order to utilize a recent version of the controller runtime (ideally always the latest of course).

Explore auto-release notes

Github has a feature to do Auto-release notes.

Need to look into this but I think that it means that a PR can be opened and when tagging you can link that PR to automatically get release notes.

Modify versioning format

To better reflect its status as beta, versioning will be changed from current 2.0.0 to 0.<something>.0.

Cloud Map tags are not being removed properly

While making tests today I realized that Cloud Map treats tags as incremental, so if an object has tags a=1 and you want to update them as a=1;b=2 it will work fine. But if you want to update them as b=3; then you will first have to explicitly call a delete operation for a=1, as that will not be removed with an update only.

This is not really urgent and is being fixed in the new API but will be fixed in the operator as well if request is there as this will inevitably mean implementing a diff algorithm, which -- as of current state of code -- could mean having to remove everything and re-adding everything again.

Make packages stable

When etcd package will be merged, we will have public packages that will be used by other repositories in a convenient way (first example: the cnwan reader).

So, public packages will undergo some modifications (in some cases very lightly) to make them more usable and organized and reach some kind of a v1 for them. These changes are heavily inspired by other stable packages.

This is just a draft of how they will be

.
└── pkg
    ├── serviceregistry
    │   ├── core
    │   ├── broker
    │   ├── etcd
    │   ├── gcp
    │   │   └── servicedirectory
    │   ├── aws
    │   │   └── cloudmap
    │   └── coredns
    ├── controllers
    ├── internal
    └── settings

Package path	Description
`pkg`	This is where all packages will live. Having them under `pkg` is a best practice in golang. NOTE: I will move controllers here as well.
`pkg/serviceregistry`	Main package that contains all the others. Contains no code and only has a `doc.go` that documents the package as a whole for `pkg.go.dev`. Right now it is called `servregistry` but will be called `serviceregistry` with this change.
`pkg/serviceregistry/core`	Contains just the definitions: the `ServiceRegistry` interface, the KeyBuilder interface, and all the objects (Namespace, Service and Endpoint), common errors and a new error type which will be used by all other subpackages here. No implementation. All functions will have full names, i.e. `CreateEndpoint` instead of `CreateEndp`, and will use short names as golang suggests only for internal and not exposed stuff.
`pkg/serviceregistry/broker`	Broker will be promoted to its own package
`pkg/serviceregistry/etcd`	Implementation of `ServiceRegistry` with etcd
`pkg/serviceregistry/gcp`	no code package, just a container for sub packages. i.e Service Directory and maybe dns with service directory or something
`pkg/serviceregistry/gcp/servicedirectory`	implementation of `ServiceRegistry` with service directory
etc....

Better errors

On next versions, more descriptive and idiomatic errors will be implemented for all packages, probably with custom structs that inherit from errors. This will make it much more easier to understand why a feature didn't work and what object experienced the error.

Fix context deadline exceeded issues

Right now, there is an issue that doesn't allow the operator to understand when an error is due to time out expiration. The operator can still terminate a call due to time out, but when it checks the error to see if the timeout expired, that check never happens because of mis-matched types. In fact this mostly happen when grpc, because it encapsulates the error and thus is not of type context deadline exceeded anymore.

This is not a critical bug but will be fixed on next versions anyway.

Utilize --ignore-not-found on scripts

as per title (this is just a reminder for next commits)

Remove default annotations

Currently, settings for the operator have cnwan.io/profile as default allowed annotations.

A new update will remove this, leaving the user complete choice during deployment phase.

Add `latest` tag too when pushing Docker Hub image

The workflow from https://github.com/CloudNativeSDWAN/cnwan-operator/blob/master/.github/workflows/push_image.yml pushes to Docker Hub with the tag created from the version number. In addition to that it should also tag the same image as latest.

Config to define mapping between k8s namespace and service registry namespace

Currently, the Operator takes the name of the namespace in k8s and uses it as the namespace for the service registry. It might be interesting to have a configuration option that would allow to specify which k8s namespace maps to which service registry namespace.

[CI] use `govulncheck` to check for known vulnerabilities upon PR

govulncheck is a new tool released by the go team that analyses the code for known vulnerabilities. I will check this and see if we can integrate it on our CI.

For reference: https://go.dev/blog/vuln

Workflow `push_image.yml` is broken

Images cannot be built & published correctly on DockerHub because the repository name and namespace is pulled from a secret, and it is converted as ***/cnwan-operator:something.
I'm going to fix it by turning it into an environment and use secrets just for authentication.

Please don't create any releases/tags while this is broken.

Modify Operator RBAC

Right now, the operator has basically all permission for the resources it needs.
This is not accurate as it never actually performs some actions on Kubernetes.
For example: the operator has permissions to create or delete namespaces, although this is something that it never does.

The operator will be updated with only the correct permissions.
After that, permissions will be added as features are introduced, not upfront.
Note that this doesn't change anything in the way that the operator works, so its semantic version will not be updated.

Link check workflow might be broken or bugged

Go here: https://github.com/CloudNativeSDWAN/cnwan-operator/blob/master/docs/etcd/quickstart.md#2---deploy-a-test-service and search for this line

Please notice that the namespace has this label: operator.cnwan.io/watch: enabled which inserts the namespace in the opeartor's allowlist. Also notice that the service has annotations that will be registered as metadata:

This link: ./concepts.md#namespace-lists does not really exist but the link check workflow is not catching this, probably because it doesn't work well with anchors :(

GKE: auto network detection fails because of deprecated API

While testing on a GKE cluster the auto network detection -- for the Cloud Metadata feature -- I realized that the current way to get them is deprecated, as you can see here https://pkg.go.dev/google.golang.org/[email protected]/container/v1#ProjectsZonesClustersService.Get

This will be fixed, maybe in #98.

[ENHANCEMENT] Major package reorganization

The code of the operator will be completely rethought and reorganized.

The update can be thought of as a new major version that will bring enhancements, fixes and stronger testing, among other fixes.
A full list will be included in the pull request as well.

Document etcd installation in Kubernetes

Since new users testing the operator already have a Kubernetes cluster, it's easier if etc is installed into the same cluster rather than a separate machine. Add a quickstart style documentation for doing just that, with the caveat that it won't be a secure by default installation.

Better exits

A new exiting mechanism will be implemented.

Currently we inherit os.Exit(n) from Kube-builder which is fine, but it does not honor defers and is too annoying to provide exit codes.

Coming on next version but not critical.

Update: this will probably use custom errors since it basically asks for an implementation like that, so it may be done with #28

Support IPv6 and dual-stack services

As per title.

The latest versions of Kubernetes graduated to GA the new IPv6 and dual-stack services and the operator will need to support this change.

Though not urgent, this feature will come on a future version.

Sign container

As per title, the operator's container image can/should be signed: https://docs.docker.com/engine/security/trust/

Extend documentation with specific CNWAN examples

It would be interesting to have examples in the documentation that tie with the rest of the CNWAN project. In particular I think we should have some examples that use "cnwan.io/traffic-profile" through the documentation. That would be specially useful for those that come to the project after watching some CNWAN demo.

Improve deploying section on documentation

README will be updated with a list of components that must be installed prior to deploying the operator.

Additionally, some sections will be improved, i.e. by changing the title to a more descriptive one and moving them to follow a more natural installation/deploying flow.

Pull Request #3 brings the aforementioned changes to the project.

Should Kubernetes have semantics for traffic profiles?

As part of the CN-WAN presentation in KubeCon NA, we raised the question: "Should Kubernetes have semantics in general for traffic profiles?".

Our motivation for asking this question comes from our experience with CN-WAN. We have found really helpful to have annotations that specify the traffic profile for an application (i.e. the type of traffic that the application generates or consumes). In the CN-WAN case, those traffic profiles help the CN-WAN components to optimize the application traffic over the WAN connection. We wonder if the model could be generalized and apply to Kubernetes applications in general, so that application traffic can be optimized not only on the WAN but in any network (e.g. cloud, DC, campus, etc) based on the traffic profile.

We would love to hear opinions on this idea, so we are opening this GitHub issue to collect the thoughts from the community.

Avoid mounting GCP service account into the operator's pod

Once PR #43 is merged, adopt the new code to get GCP service account from the Kubernetes secret, and avoid mounting it into the pod.

Publish Operator in OperatorHub

Explore publish the CNWAN Operator in OperatorHub and update the repo as needed. Update documentation to describe the process of installing the CNWAN Operator through OperatorHub.

More descriptive logs on service removal

Current Operator logs the removal of services from the Service Registry as follows:

2021-02-11T18:38:59.492Z DEBUG ServiceRegistryBroker.RemoveServ going to remove service from service registry {"serv-name": "web-training"}

However, it would be very interesting to get some extra context on why the Operator is removing the service.

Service Directory: auto data detection

Just as it currently happens with the Cloud Metadata feature, where the operator automatically detects the platform, network and sub-network where it is running -- e.g. GKE, a new feature will be introduced that automatically infers the Region and Project ID in case the cluster is GKE.

To do this write auto as a value, e.g.:

projectID: auto
defaultRegion: auto

I'd prefer to do it with an explicit auto rather than keeping them empty and just try it anyways, but I will welcome different opinions and feedback! :)

EKS: load balancer services need to be resolved prior to registering

When creating a service of type LoadBalancer in EKS, this service is assigned a hostname. See example:

[truncated for brevity...]
status:
  loadBalancer:
    ingress:
    - hostname: abc123674efg-648438329.us-west-1.elb.amazonaws.com

Obviously this won't be accepted by either Cloud Map or Service Directory, as they will only accept IP addresses, whether v4 or v6. So, in conclusion, whenever we see hostnames instead of IPs we need to resolve them in some way.

Specifically, for this scenario (EKS) I am thinking of these two ways of working:

Ask CoreDNS pods to resolve it for us
- Benefit of this: will probably work for any managed k8s provider that work like this
- Downside: don't really want to send DNS queries for this, API approach is preferable
Study AWS SDK v2 to see if there is an endpoint that does this for us.
- API based
- Will only work for EKS.

Please provide ideas/comments/feedback if you think/know of an alternative way, as they are more than welcome!

Will keep posted.

Provide ID to resources

Different service registries have different method to store an object and index it.

For example:

Google Service Directory indexes resemble a path: locations/<location>/projects/<project-id>/namespace/<namespace/etc/<etc.>
ETCD indexes are similar to the above: /prefix/namespaces/<namespace>/etc/<etc.>
DNS "indexes" (for lack of better word) can be my-endpoint.my-service.my-namespace.cnwan.local
AWS are in the format of aws:doeq9431ns if I recall correctly.
And so on...

While knowing the ID of the resource may not be useful for end users - i.e. reader or adaptor, they are necessary for other entities like the operator's broker because it needs to find them.

An update to the operator's service registry package will be brought which also includes an ID string field, which differs from the already existing Name string as the former will make sense for the specific service registry, while the latter for the user.
From the update on, Name will only contain the user-friendly name of the resource and what we are currently referring to as Name will be moved to ID.

Other names for ID:

ResourceName

Introduce support for ETCD

ETCD is a distributed key-value store and having moved from incubation to graduation for CNCF it makes sense to leverage on it to create a sort of reliable "in-house" service registry.

A PR will come that will enable the Operator to use etcd as a service registry.

Remove advanced installation methods

Advanced installation is mostly useless: most people that want to contribute already know how to do that as they probably know Kubernetes and that the simple installation method is actually way more than enough even for them.

Its existence is basically only a burden, as it is just one more thing to support and update while being rarely used. This last point is also more valid since we're about to have helm installation.

cloudnativesdwan / cnwan-operator Goto Github PK

cnwan-operator's People

Contributors

Stargazers

Watchers

Forkers

cnwan-operator's Issues

Brief

Current way of doing it

Issue

Labels

List Policy

Changes

From ListPolicy to DefaultBehavior

Labels

Upgrade

Support

Recommend Projects

Recommend Topics

Recommend Org