Giter Site home page Giter Site logo

dmwm / cmskubernetes Goto Github PK

View Code? Open in Web Editor NEW
17.0 17.0 42.0 13.76 MB

Set of instructions and examples to deploy CMS data-services to Kubernetes cluster

Shell 80.47% Dockerfile 9.95% Go 1.49% CSS 0.12% HTML 0.49% Smarty 6.68% JavaScript 0.11% Python 0.10% PLSQL 0.58%

cmskubernetes's People

Contributors

aamiralidev avatar abrinke1 avatar amaltaro avatar anpicci avatar arooshap avatar belforte avatar bockjoo avatar dmielaikaite avatar ericvaandering avatar fabioespinosa avatar fernandogarzon avatar goughes avatar ivmfnal avatar jrotter2 avatar khurtado avatar kyrylogy avatar leggerf avatar mapellidario avatar micsucmed avatar mrceyhun avatar muhammadimranfarooqi avatar nikodemas avatar novicecpp avatar nsmith- avatar panos512 avatar pmandrik avatar snyk-bot avatar todor-ivanov avatar vkuznet avatar yuyiguo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmskubernetes's Issues

Implement FluxCD for CMS Monitoring cron jobs

Monitoring cron jobs deployment will be automatized with FluxCD. After cron jobs, all services will be automatized, but first implementation will include only CronJobs.

We may need a test github repo to easily test.

Helm charts issue: #1214 and its PR by @kyrylogy #1215

How to install FluxCD CLI to lxplus:

curl -s -O https://fluxcd.io/install.sh
# remove sudo command
sed -i 's/sudo ${CMD_MOVE}/${CMD_MOVE}/g' install.sh

# Change bin directory
sed -i 's+DEFAULT_BIN_DIR="/usr/local/bin"+DEFAULT_BIN_DIR="/afs/cern.ch/user/c/cuzunogl/private/bin"+g' install.sh

# Create user bin directory, because /usr/local/bin is not allowed
mkdir -p /afs/cern.ch/user/c/cuzunogl/private/bin
chmod -R +x /afs/cern.ch/user/c/cuzunogl/private/bin
# Add /afs/cern.ch/user/c/cuzunogl/private/bin to your `.bashrc` and source it

# Run script
bash install.sh

# All set

Ref: https://fluxcd.io/flux/installation/#install-the-flux-cli

Incorporating new robot certificates for crab and dmwm namespaces.

Hi,

Following the discussion we had at CMSKUBERNETES-229, I created separate robot accounts for CRAB and DMWM. After obtaining the certificates, I made sure that they have they are configured properly and have the same permissions as they had previously.

Since we deploy these certificates as cluster-secrets through the deploy-cluster-secrets.sh script, I made some changes in this script to have proper certificate according to the namespace, which can be reviewed here: #1351. If you have any suggestions to make it more efficient, let me know.

We can deploy these secrets in the dev clusters, and once we are sure that the services are working properly, we can then deploy the changes in testbed, and eventually production. Let me know your suggestions in this regard as well.

Note that this change does not require any of the teams to modify their deployment or kubernetes manifest files

CMSWeb Troubleshooting Documentation for DMWM.

  • Checking host certificates are not expired and who is in charge of updating them otherwise.
  • List of kubernetes components and secret areas that control the DMWM deployment outside of the DMWM namespace.
  • A troubleshooting guide of how to debug when there are problems in the components above.

Deploy Spark crontab jobs as K8s CronJobs

As CMS Monitoring we need to move our cron jobs running as crontab both in K8s and VirtualMachines to the K8s CronJob kind. All our K8s CronJobs that run Spark/Hadoop, should be in kubernetes/monitoring/cron-hdfs directory.

Assignee: @kyrylogy

Documentation for a cmsweb-like dev environment

Valentin, as we discussed in the past week, it would be extremely useful to many CMSWEB applications developers on how one can run a full deployment as we have nowadays.
Which means, deploying the service we want to deploy together with the frontend, frontend rules, frontend accounts/certificates mapping and so on, such that we can have the complete cycle and test services as they would get deployed in a production environment.

Something along the lines as https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb-nginx/docs/end-to-end.md , but really end-to-end ;)

how to change CRABServer installation variant

@vkuznet @muhammadimranfarooqi with reference to

PKGS="admin backend crabserver/preprod"

currently the procedure in this repository builds a crabserver preprod flavor.
The 'preprod' tag there refernce to once specific Oracle DataBase instance.
There are 3 such instance: prod, preprod and dev which allows us to
test and validate things and we will eventually need to have all in K8s.

What is the suggested way to manage this ?
so far @ddaina has switched cmsweb-test2 from preprod to test by building
the container with a modified file, but this is a fragile procedure in the long term.
I see that we have some cluster-specific files in
https://gitlab.cern.ch/cmsweb-k8s
while in this github.com/dmwm/CMSKubernetes repository files are generic for all clusters
But I do not see a place where e.g. customize the k8s yaml files.

One possibility that comes to mind is to allow building different containers: crabserver-prod, crabserver-preprod, crabserver-dev with different versions of

PKGS="admin backend crabserver/preprod"
and use the appropriate one as image in the various yaml files.
What do you think ?

New endpoint in cmsweb which replace NodePort

cmsmon-rucio-mon-goweb service is deployed on cmsweb-test1 cluster for now and we are using NodePort to have access to web service. Can we replace this NodePort with an endpoint like cmsmon-rucio. Service web page is http://cmsweb-test1.cern.ch:31280/, so it will be like http://cmsweb-test1.cern.ch/cmsmon-rucio

I think we need an ingress rule similar to DAS. However, currently no authantication is implemented in this service. It is in our TODO list but it will not be ready in near future. So, without authentication (I mean without APS/XPS/SPS, but grid certificate requirement is okay), can we implement this?

When everything ready on ingress side, I can change NodePort to container port.

fyi @arooshap @muhammadimranfarooqi

Create helm charts for cms monitoring spark running cron jobs

Eventually we will automate our deployment process with FluxCD and helm charts. Currently, our k8s CronJobs are ready to run and we need to prepare their Helm charts.

How

CronJobs have common definitions like concurrencyPolicy, failedJobsHistoryLimit , backoffLimit . And also there are changing values and they should be arranged in helm. In templates, we can have cronjob.yaml and service.yaml initially. We can evaluate to get rid of ConfigMaps and to provide commands directly in values.yaml for each cron.

We can use values.yaml structure like in this SO entry https://stackoverflow.com/a/73571078/6123088 . An example can be found in here https://github.com/mrceyhun/CMSKubernetes/blob/f-cmsmon-crons-helm/helm/cmsmon-cron/values.yaml .

Tasks

  1. Create first version of helm charts for cmsmon cronjobs (kyrylo)
  2. Parametrize for prod and test (kyrylo)
  3. Saving cron logs in S3 using fluentd (ceyhun)
  4. Conditioning EOS volume mount according to cronjob (some of them does not need) (kyrylo)
  5. Implement ingress for port management (ceyhun)
  6. Evaluate using SOPS secret management using cmsweb sops service (ceyhun)

Helpful documentations:

@kyrylogy Let's share the tasks in this issue. As we talked, just create the initial version of helm chart and later we can decide our direction. I though that we may give priority to the first task over CMSSpark tasks ;)

Placeholder: Hadoop access in k8s

This is a place holder for the consistency checking scripts. How do/can we get access to the same Hadoop in the analytic cluster on a kubernetes pod to read in the results of a job?

Add logstash parsers for cmsweb services

This the placeholder ticket to add cmsweb service specific parsers to logstash. For simplicity let's collect all messages collected by filebeats:

  • workqueue
  • reqmon
"message":"[14/Oct/2019:16:39:19] reqmon-df69c6598-tk58z 127.0.0.1 \"GET /wmstatsserver/data/info HTTP/1.1\" 200 OK [data: 346 in 39 out 1982 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmon\" ]"
  • reqmgr2
[15/Oct/2019:02:00:39] reqmgr2-89f7df4fd-95g9q 137.138.33.200 "GET /reqmgr2/data/request?name=amaltaro_TaskChain_InclParents_Oct2019_Val_191010_125845_9547 HTTP/1.1" 200 OK [data: 2369 in 1862 out 530492 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0192.cern.ch" "" ] [ref: "" "WMCore.Services.Requests/v002" ]
  • reqmgr2ms
"message":"[14/Oct/2019:16:40:37] reqmgr2ms-6796895484-7vzj6 127.0.0.1 \"GET /ms-transferor/data/status HTTP/1.1\" 200 OK [data: 351 in 158 out 683 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmgr2ms\" ]"
  • couchdb
"message":"[Mon, 14 Oct 2019 14:39:59 GMT] [info] [<0.26674.38>] 127.0.0.1 - - GET /workqueue/_design/WorkQueue/index.html 200"
  • phedex
"message":"::1 - - - [14/Oct/2019:16:44:48 +0200] \"GET /phedex/datasvc/doc HTTP/1.1\" 200 15229 \"-\" \"ServerMonitor-phedex\""
  • dbs
"message":"INFO:cherrypy.access:[14/Oct/2019:16:43:52] dbs-phys03-w-5fd9b6ffc-9kqxb 127.0.0.1 \"GET /dbs/ HTTP/1.1\" 200 OK [data: 298 in 468 out 1320 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor/2.0\" ]"
"message":"127.0.0.1 - - [14/Oct/2019:16:45:47] \"GET / HTTP/1.1\" 200 22 \"\" \"ServerMonitor/2.0\""
  • das
{"DASQuery":{"query":"dataset=/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM","hash":"705c64bd8a82ee31aa9607dada4a6208","spec":{"dataset":"/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM"},"fields":["dataset"],"pipe":"","instance":"prod/global","detail":false,"system":"","filters":{},"aggregators":[],"error":"","tstamp":1570577955},"PID":"705c64bd8a82ee31aa9607dada4a6208","ProcessTime":10.645284837,"Unix":1570577956,"level":"info","msg":"ready","time":"2019-10-09T01:39:16+02:00"}
  • crabserver
"message":"[14/Oct/2019:16:45:22]  RESTSQL:ovUDzLxBPpVb RELEASED cmsweb_analysis_preprod@devdb11 timeout=300 inuse=0 idle=1"
  • crabcache
"message":"[14/Oct/2019:16:46:21] crabcache-6c7f6559d6-5ksbz 127.0.0.1 \"GET /crabcache/info HTTP/1.1\" 200 OK [data: 340 in 69 out 608 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-crabcache\" ]"
  • dqmgui
  • t0wmadatasvc
"message":"[14/Oct/2019:16:44:21] t0wmadatasvc-74d87c769b-p8zcx 127.0.0.1 \"GET /t0wmadatasvc/prod/hello HTTP/1.1\" 200 OK [data: 352 in 25 out 64701 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-t0wmadatasvc\" ]"

Cannot have two different directories of helm charts

The chart releaser action overwrites things when it's like that. The Rucio ones are overwritten if you update a cmsweb or vice versa.

So there are only two possibilities. We all put our charts in the same directory (and if that's what we want, I'll make the PR since it's a little bit tricky) or we just move the Rucio stuff to it's own repository. Maybe CMSKubernetes has outgrown it's usefulness. Up to you. @arooshap @muhammadimranfarooqi @vkuznet

Migrate/Copy all MongoDB databases to the new MonogoDB As A Service

Migrating MongoDB to a New Cluster.

April 18, 2023

Overview

WMCore#11534
We are carrying out the migration of the MongoDB cluster from the old load-balancer setup, to a newer architecture that does not include the need to have a load-balancing setup.

Checklist for CMSWEB team

  • Provision of new clusters for production and testbed.

  • Create necessary roles and distribute them to the WMCore team.

  • Validate the cluster setup.

  • Create necessary mount points for backing up and restoring the databases.

  • Update documentation to reflect the latest changes.

  • Backup, and then eventually restore the databases to the new cluster.

Further Action Items to address the Undone Checklist Items

  • Check the the list of databases:

    cms-db:PRIMARY> show dbs
    admin                      0.000GB
    config                     0.000GB
    ddm_monitoring             0.456GB
    local                      0.058GB
    msOutDB                    0.000GB
    msOutputDBPreProd          0.000GB
    msOutputDBProd             0.103GB
    msUnmergedDBPreProd        0.000GB
    msUnmergedDBProd           0.005GB
    msUnmergedDBcmsweb-test10  0.000GB
    msUnmergedDBcmsweb-test8   0.000GB
    msUnmergedDBcmsweb-test9   0.000GB
    rchauhan                   3.668GB
    
  • Create the backup of the necessary databases in vocms0750:

     #!/bin/bash
    
     # Define the MongoDB connection URI
    
     uri="mongodb://mongodb-cms.cern.ch:27017"
    
       
    
     # Define the MongoDB authentication options
    
     username="cmssw"
    
     password="xxxx"
    
     authdb="admin"
    
       
    
     # Define the output directory
    
     output_dir="/cephfs/product/mongodb"
    
       
    
     # Loop through each database and run mongodump
    
     for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd" "rchauhan"
    
     do
    
     echo "Dumping database: $db_name"
    
     mongodump -vvv --uri="$uri/$db_name?replicaSet=cms-db" --username="$username" --password="$password" --authenticationDatabase="$authdb" --out="$output_dir" --db="$db_name" 
    
     done
    
    
  • Restoring the databases:

    
     #!/bin/bash
       
    
     # Define the MongoDB connection URI
    
     uri="mongodb://'cms-mongo-prod-node-0.cern.ch:32001,cms-mongo-prod-node-1.cern.ch:32002,cms-mongo-prod-node-2.cern.ch:32003"
    
       
    
     # Define the MongoDB authentication options
    
     username="cmssw"
    
     password="xxx"
    
     authdb="admin"
    
       
    
     # Define the backup directory
    
     backup_dir="/cephfs/product/mongodb"
    
       
    
     # Loop through each database and run mongorestore
    
     for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd"  "rchauhan"
    
     do
    
     echo "Restoring database: $db_name"
    
     mongorestore -vvv --uri="$uri/$db_name?replicaSet=mongodb-prod" --username="$username" --password="$password" --authenticationDatabase="$authdb" "$backup_dir/$db_name" --dryRun
    done
    
    

Helmchart publication is not working

From the latest action:

myrepo https://registry.cern.ch/chartrepo/cmsweb
19
Successfully packaged chart and saved it to: /home/runner/work/CMSKubernetes/CMSKubernetes/helm/rucio-consistency-0.4.2.tgz
20
Error: unknown flag: --username

And I confirm that none of the Rucio charts have been updated on the CERN repo in months. Can one of you fix this ASAP? We need this to be able to fully clean up the CERN site.

WMCore service, MSUnmerged, reported being slow.

We were contacted by the DMWM who reported that,

For the last few days we started observing huge delays in DNS lookup queries from the MSUnmerged pods in the K8 production cluster. This results in an infinite polling cycle of the service and in practice prevent it from ever completing the iteration through all sites. Yesterday, @germanfgv alarmed us that it became noticeable for T2_CH_CERN.

When are the plans to test it?
I could not reproduce this issue yesterday but after the migration of some of the nodes, I will continue with it.

Any alternate comments?
It looks related to the issue dmwm/WMCore#11330.

Update the rolling upgrade procedure, and add automation to it.

Our main focus would be to make it descriptive enough to not cause any major outages after the migration. Also, add some additional scripts to make it more automated i.e. deployments based on namespaces, automatic configuration of fluetnd, e.t.c.

I am adding all the points that we need to focus on/include in the documentation, so that we don't miss anything.

  • Add more endpoint checks for the services. Some new ones that I have discovered are for das-server, dbs, and rucio monitor.
  • Include about nginx settings in the rolling upgrade document.
  • Create a separate directory for storing secrets for individual cluster. The .pem files can be encrypted (the procedure that was already being followed for DBS cluster).
  • Improve the procedure for stress testing the cluster.
  • Remove IT services that are not being used. One particular example is the fluentd service that was causing major issues with the nodes.

I will add more points to this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.