Giter Site home page Giter Site logo

charon-k8s-distributed-validator-cluster's People

Contributors

aly-obol avatar db2510 avatar gsora avatar haroldsphinx avatar lukehackett12 avatar oisinkyne avatar sugh01 avatar thomasheremans avatar xenowits avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

charon-k8s-distributed-validator-cluster's Issues

Add charon relay support

Problem to be solved
In release v0.13.0 we introduced charon relay a new replacement to the bootnodes, we need to update the test and canary clusters to use it.

Proposed solution
Update the charonxk8s templates and configmap with the new charon relays flag, and bump up the deployment versions from (0.11.0, 0.12.0, latest) to (0.12.0,0.13.0, latest).

Out of scope
If there is anything to highlight as out of scope for this issue, please outline it here.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Test QBFT with a 2/3 Cluster setup

๐ŸŽฏ Problem to be solved

Extract from - https://discord.com/channels/930779837218562078/1109760461186015273/1110102612411416576
We have been asked by users in the past that they want size 3 clusters
These are usually institutional stakers that want to run single-operator clusters of size 3.
They want this because they want CFT, they want to de-risk their infra.
They want this because they do not care about BFT so they want to reduce infra costs. (3 is cheaper than 4).

๐Ÿ› ๏ธ Proposed solution

Create a cluster internally of 3 nodes and monitor that with one node being down. This cluster can be deployed on gcp like all canary clusters.
We can setup 2 k8 canary cluster - One with mix of VCs and one with Lodestar VCs
Run 2/3 nodes and test for functional completeness and performance.

๐Ÿงช Tests

No missed proposals
No missed attestation
Exits gracefully
Aggregation?

Monitor Logs and Metrics for a period 2-3 weeks.

Update README doc

Update the cluster deployment README doc with the updated instructions to use a GCS bucket backend.

Add lodestar VC

Problem to be solved
Add support for lodestar VC. This needs to be done in order to deploy lodestar VC in canary clusters.

Proposed solution
Add a new template file templates/lodestar-vc.yaml. Also configure deploy-cluster.sh script to include lodestar VC in canary cluster VC types. Test lodestar VC in canary cluster deployments.

Reenable bootnodes template configuration

Problem to be solved

We need to override bootnodes config with the new relay endpoint

Solution

Update the charon k8s template to with the bootnodes flag and env var

Create k8s templates for VCs voluntary exit

Problem to be solved

We need our charon-k8s templates to support the VCs voluntary exit for Lighthouse and Lodestar.

Proposed solution

  • Create k8s templates for lighthouse and lodestar voluntary exit
  • Update the readme doc

Automate DKG/Relay Load tests

๐ŸŽฏ Problem to be solved

Automated DKG load tests built here - PR

Suggetsion to add a repository trigger with a relay url arg? Then could potentially automate tests after a dev relay deploy.

๐Ÿ› ๏ธ Proposed solution

Use Github workflow to run the tests.
Send CHARON_P2P_RELAYS as the run parmeter.

๐Ÿงช Tests

Workflow should be able to trigger load tests by taking the Relay as user defined inputs

โŒ Out of Scope

Generating definition files with SDK

Remove the data-dir flag

Problem to be solved

data-dir flag is deprecated and we should remove it.

Proposed solution

Remove the data-dir flag from the k8s template and add the private-key-file flag

Charon canary deployments

When we deploy charon to gcp we need to do canary rollout to nodes < the threshold to ensure the cluster will not operate if a buggy charon version is shipped.

The sh command inside Nimbus and Lodestar VC template does not resolve variables

๐Ÿž Bug Report

Description

The InitContainer step needs to run a command -

init_container:
          - name: init-nimbus
            image: statusim/nimbus-eth2:$NIMBUS_VERSION
            command:
              - sh
              - -ac
              - "tmp=/data/nimbus/nimbus-keystores \n mkdir -p $${tmp}\n for f in /validator_keys/keystore-*.json; do\n  cp $${f} $${tmp}\n  cat $${f%.*}.txt | \\\n  /home/user/nimbus_beacon_node deposits import \\\n  --log-level=debug \\\n --data-dir=/data/nimbus \\\n  /data/nimbus/nimbus-keystores \n  rm $${tmp}/$(basename $${f}) \n done\n"
            volume_mount:
              - name: data
                mount_path: /data/nimbus
              - name: validators
                mount_path: /validator_keys
            security_context:
              run_as_user: 0

However, this does not succeed, the InitContainer crashes.

๐Ÿ”ฌ Minimal Reproduction

  1. Set a kubectl namespace, can use a namespace with a running cluster - like canary-goerli-exit-3
  2. Add a Nimbus VC to the .env file VC_TYPES=0,1,2,3 3=nimbus
  3. Upload the file to gcp storage
  4. Run deploy cluster

๐Ÿ”ฅ Error

The Nimbus VC pod crashes.
User kubectl describe pod to check for the error inside the pod
Notice that the variables inside the initContainer sh command, are not getting resolved

Add charon sepolia cluster to the CD workflow

๐ŸŽฏ Problem to be solved

We run a charon-sepolia cluster and we need to ensure its deployment is automated and managed with the canary workflows

๐Ÿ› ๏ธ Proposed solution

Add the charon-sepolia-1780 cluster to the canary clusters CD workflow.

๐Ÿงช Tests

  • Cluster pods are up and healthy (check with kubectl, and grafana dashboard)
  • Has attested on a testnet at least once

Push logs to Loki

๐ŸŽฏ Problem to be solved

Canary clusters should send logs to Loki.

๐Ÿ› ๏ธ Proposed solution

Add config keys for Loki to charon.yaml.

๐Ÿงช Tests

Verify logs are being sent, check Grafana

  • Cluster is deployed successfully to k8s
  • Has attested on a testnet at least once

๐Ÿ‘ Additional acceptance criteria

โŒ Out of Scope

Update canary clusters BN and Relays config

Problem to Solve

Canary clusters were failing as BNs were syncing.

Solution

  • Update canary clusters configuration to use alternative BNs
  • Reconfigure canary clusters to use the new Relays from Figma and HashQark

Add nimbus VC

Problem to be solved
We recently added support for nimbus VC to the cdvc repo. Now we want to test it by running the new VC in our canary clusters.

Proposed solution
Add a new template file templates/nimbus-vc.yaml. You can take inspiration from [https://github.com/ObolNetwork/charon-distributed-validator-cluster/tree/main/nimbus] and also from the templates/lighthouse-vc.yaml.

Out of scope
None.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Migrate clusters config management backend to gcs

Problem

We need a versatile but secure way to persist clusters' config (validators keys, cluster-lock, and cluster.env) to be used by CICD tools and team members.

Solution

There are a few reasonable options such as hashicorp vault, GCP Secrets Manager, GCS, and GitHub secrets. We decided to use GCS as it is the least option to introduce operational complexity while it is secure and inherits GCP RBAC rules. In the mid-term, we will reconsider this choice in favor of Vault.

Refactor clusters deployment

Problem to be solved
Enable the cluster to flexibly run multiple VCs with mixed charon versions.

Proposed solution

  • Update the deployment scripts to get a list of VCs and charon versions from the config backend of the cluster and deploy the k8s templates.
  • Merge the canary and regular deployments in one script
  • Remove the relay and bootnodes configuration, and let it use the charon defaults
  • Update the CD workflows with the new scripts names

Add a flag to DKG/Relay Load tests for publish

๐ŸŽฏ Problem to be solved

Add a flag to DKG load tests to send a user choice to publish or not to publish post dkg.

๐Ÿ› ๏ธ Proposed solution

Update Github workflow to run the tests.

๐Ÿงช Tests

Workflow should be able to trigger load tests by taking the Relay as user defined inputs

โŒ Out of Scope

Generating definition files with SDK

Call the key upload utilities inside the create-web3signer-secrets.sh

๐ŸŽฏ Problem to be solved

The key utility is a Go utility which -

  1. Iterates through node validator set, decrypts and uploads the hex into Vault
  2. Create web3signer key configuration files
    This is a prerequisite for web3signer setup

At the moment we call it manually, however we have a scope to add this into the create-web3signer-secrets.sh so it can help to avoid the manual step along with a flag to allow the operator to opt out from the upload in case the vault has the keys already

๐Ÿ› ๏ธ Proposed solution

Update the shell script

๐Ÿงช Tests

The script should take the cluster name as an input and prepare all needed secrets inside the namespace

๐Ÿ‘ Additional acceptance criteria

Ensure cluster is up and running

Remove unnecessary volumes mounts from VCs templates

๐ŸŽฏ Problem to be solved

Remove unnecessary volumes mounts from the VCs templates

๐Ÿ› ๏ธ Proposed solution

Update the VCs yaml templates

๐Ÿงช Tests

  • Deployed successfully to k8s
  • Has attested on a testnet at least once

๐Ÿ‘ Additional acceptance criteria

โŒ Out of Scope

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.