dmwm / cmskubernetes Goto Github PK

View Code? Open in Web Editor NEW

17.0 17.0 42.0 13.76 MB

Set of instructions and examples to deploy CMS data-services to Kubernetes cluster

Shell 80.47% Dockerfile 9.95% Go 1.49% CSS 0.12% HTML 0.49% Smarty 6.68% JavaScript 0.11% Python 0.10% PLSQL 0.58%

cmskubernetes's People

Contributors

Stargazers

Watchers

Forkers

ericvaandering nataliaratnikova nsmith- smuzaffar goughes amaltaro yuyiguo belforte muhammadimranfarooqi fabioespinosa cronosnull leggerf costahep ddaina panos512 dmielaikaite khurtado fernandogarzon mrceyhun todor-ivanov ivmfnal arooshap mapellidario pmandrik jrotter2 jhonatanamado novicecpp micsucmed bockjoo aamiralidev kyrylogy vkuznet guyzsarun nikodemas rukaiahbadran germanfgv abrinke1 d-ylee linarestoine anpicci

cmskubernetes's Issues

Implement FluxCD for CMS Monitoring cron jobs

Monitoring cron jobs deployment will be automatized with FluxCD. After cron jobs, all services will be automatized, but first implementation will include only CronJobs.

We may need a test github repo to easily test.

Helm charts issue: #1214 and its PR by @kyrylogy #1215

How to install FluxCD CLI to lxplus:

curl -s -O https://fluxcd.io/install.sh
# remove sudo command
sed -i 's/sudo ${CMD_MOVE}/${CMD_MOVE}/g' install.sh

# Change bin directory
sed -i 's+DEFAULT_BIN_DIR="/usr/local/bin"+DEFAULT_BIN_DIR="/afs/cern.ch/user/c/cuzunogl/private/bin"+g' install.sh

# Create user bin directory, because /usr/local/bin is not allowed
mkdir -p /afs/cern.ch/user/c/cuzunogl/private/bin
chmod -R +x /afs/cern.ch/user/c/cuzunogl/private/bin
# Add /afs/cern.ch/user/c/cuzunogl/private/bin to your `.bashrc` and source it

# Run script
bash install.sh

# All set

Ref: https://fluxcd.io/flux/installation/#install-the-flux-cli

Incorporating new robot certificates for crab and dmwm namespaces.

Hi,

Following the discussion we had at CMSKUBERNETES-229, I created separate robot accounts for CRAB and DMWM. After obtaining the certificates, I made sure that they have they are configured properly and have the same permissions as they had previously.

Since we deploy these certificates as cluster-secrets through the deploy-cluster-secrets.sh script, I made some changes in this script to have proper certificate according to the namespace, which can be reviewed here: #1351. If you have any suggestions to make it more efficient, let me know.

We can deploy these secrets in the dev clusters, and once we are sure that the services are working properly, we can then deploy the changes in testbed, and eventually production. Let me know your suggestions in this regard as well.

Note that this change does not require any of the teams to modify their deployment or kubernetes manifest files

Make a helm chart for statsd exporter

I or @nsmith- should do this. It will make the different versions of Rucio much easier to manage as there is only one hostname in the k8s files

CMSWeb Troubleshooting Documentation for DMWM.

Checking host certificates are not expired and who is in charge of updating them otherwise.
List of kubernetes components and secret areas that control the DMWM deployment outside of the DMWM namespace.
A troubleshooting guide of how to debug when there are problems in the components above.

Deploy Spark crontab jobs as K8s CronJobs

As CMS Monitoring we need to move our cron jobs running as crontab both in K8s and VirtualMachines to the K8s CronJob kind. All our K8s CronJobs that run Spark/Hadoop, should be in kubernetes/monitoring/cron-hdfs directory.

Example test for cmsmon-rucio-ds can be found in : https://github.com/mrceyhun/CMSKubernetes/blob/f-cron-hdfs/kubernetes/monitoring/cron-hdfs/cron-test.yaml
Helpful commands and files for tests can be found in /afs/cern.ch/user/c/cuzunogl/public/kyrylo.

Assignee: @kyrylogy

Documentation for a cmsweb-like dev environment

Valentin, as we discussed in the past week, it would be extremely useful to many CMSWEB applications developers on how one can run a full deployment as we have nowadays.
Which means, deploying the service we want to deploy together with the frontend, frontend rules, frontend accounts/certificates mapping and so on, such that we can have the complete cycle and test services as they would get deployed in a production environment.

Something along the lines as https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb-nginx/docs/end-to-end.md , but really end-to-end ;)

add redirection ruls for CRAB schedds to FrontEnd

as done in current frontend:
https://github.com/dmwm/deployment/blob/27580b5863583abcd7557003008d3c1a66737b48/frontend/backends-prod.txt#L15-L34

as discussed in
https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1699/1.html

I know that such an explicit list of VM names is not nice, but it it can be accomodated, it surely is much less work then alternatives. All in all that list changes very rarely.

clean-up helm area (rucio tar balls)

We need to remove all tar balls from help area. @ericvaandering could you please take care of that.

how to change CRABServer installation variant

@vkuznet @muhammadimranfarooqi with reference to

CMSKubernetes/docker/crabserver/install.sh

Line 7 in 52bbeee

PKGS="admin backend crabserver/preprod"

currently the procedure in this repository builds a crabserver preprod flavor.
The 'preprod' tag there refernce to once specific Oracle DataBase instance.
There are 3 such instance: prod, preprod and dev which allows us to
test and validate things and we will eventually need to have all in K8s.

What is the suggested way to manage this ?
so far @ddaina has switched cmsweb-test2 from preprod to test by building
the container with a modified file, but this is a fragile procedure in the long term.
I see that we have some cluster-specific files in
https://gitlab.cern.ch/cmsweb-k8s
while in this github.com/dmwm/CMSKubernetes repository files are generic for all clusters
But I do not see a place where e.g. customize the k8s yaml files.

One possibility that comes to mind is to allow building different containers: crabserver-prod, crabserver-preprod, crabserver-dev with different versions of

CMSKubernetes/docker/crabserver/install.sh

Line 7 in 52bbeee

PKGS="admin backend crabserver/preprod"

and use the appropriate one as image in the various yaml files.
What do you think ?

[Introduction.md]The proxy.sh mentioned is missing.

In the introduction.md tutorial there is a mention to a proxy.sh from the frontend folder, linked to docker/frontend/proxy.sh`, which is missing.
I guess it should be docker/proxy/proxy.sh

New endpoint in cmsweb which replace NodePort

cmsmon-rucio-mon-goweb service is deployed on cmsweb-test1 cluster for now and we are using NodePort to have access to web service. Can we replace this NodePort with an endpoint like cmsmon-rucio. Service web page is http://cmsweb-test1.cern.ch:31280/, so it will be like http://cmsweb-test1.cern.ch/cmsmon-rucio

I think we need an ingress rule similar to DAS. However, currently no authantication is implemented in this service. It is in our TODO list but it will not be ready in near future. So, without authentication (I mean without APS/XPS/SPS, but grid certificate requirement is okay), can we implement this?

When everything ready on ingress side, I can change NodePort to container port.

fyi @arooshap @muhammadimranfarooqi

Start using pull request/merge model

Now that you have a fork, it'd be good to start using a more formal development model for the repo. :-)

Create helm charts for cms monitoring spark running cron jobs

Eventually we will automate our deployment process with FluxCD and helm charts. Currently, our k8s CronJobs are ready to run and we need to prepare their Helm charts.

How

CronJobs have common definitions like concurrencyPolicy, failedJobsHistoryLimit , backoffLimit . And also there are changing values and they should be arranged in helm. In templates, we can have cronjob.yaml and service.yaml initially. We can evaluate to get rid of ConfigMaps and to provide commands directly in values.yaml for each cron.

We can use values.yaml structure like in this SO entry https://stackoverflow.com/a/73571078/6123088 . An example can be found in here https://github.com/mrceyhun/CMSKubernetes/blob/f-cmsmon-crons-helm/helm/cmsmon-cron/values.yaml .

Tasks

Create first version of helm charts for cmsmon cronjobs (kyrylo)
Parametrize for prod and test (kyrylo)
Saving cron logs in S3 using fluentd (ceyhun)
Conditioning EOS volume mount according to cronjob (some of them does not need) (kyrylo)
Implement ingress for port management (ceyhun)
Evaluate using SOPS secret management using cmsweb sops service (ceyhun)

Helpful documentations:

@kyrylogy Let's share the tasks in this issue. As we talked, just create the initial version of helm chart and later we can decide our direction. I though that we may give priority to the first task over CMSSpark tasks ;)

Placeholder: Hadoop access in k8s

This is a place holder for the consistency checking scripts. How do/can we get access to the same Hadoop in the analytic cluster on a kubernetes pod to read in the results of a job?

Add to end-to-end documentation the OS project export requirement

From the training, please add instructions requesting the openstack project environment to be properly set.

Add logstash parsers for cmsweb services

This the placeholder ticket to add cmsweb service specific parsers to logstash. For simplicity let's collect all messages collected by filebeats:

workqueue
reqmon

"message":"[14/Oct/2019:16:39:19] reqmon-df69c6598-tk58z 127.0.0.1 \"GET /wmstatsserver/data/info HTTP/1.1\" 200 OK [data: 346 in 39 out 1982 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmon\" ]"

reqmgr2

[15/Oct/2019:02:00:39] reqmgr2-89f7df4fd-95g9q 137.138.33.200 "GET /reqmgr2/data/request?name=amaltaro_TaskChain_InclParents_Oct2019_Val_191010_125845_9547 HTTP/1.1" 200 OK [data: 2369 in 1862 out 530492 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0192.cern.ch" "" ] [ref: "" "WMCore.Services.Requests/v002" ]

reqmgr2ms

"message":"[14/Oct/2019:16:40:37] reqmgr2ms-6796895484-7vzj6 127.0.0.1 \"GET /ms-transferor/data/status HTTP/1.1\" 200 OK [data: 351 in 158 out 683 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmgr2ms\" ]"

couchdb

"message":"[Mon, 14 Oct 2019 14:39:59 GMT] [info] [<0.26674.38>] 127.0.0.1 - - GET /workqueue/_design/WorkQueue/index.html 200"

phedex

"message":"::1 - - - [14/Oct/2019:16:44:48 +0200] \"GET /phedex/datasvc/doc HTTP/1.1\" 200 15229 \"-\" \"ServerMonitor-phedex\""

"message":"INFO:cherrypy.access:[14/Oct/2019:16:43:52] dbs-phys03-w-5fd9b6ffc-9kqxb 127.0.0.1 \"GET /dbs/ HTTP/1.1\" 200 OK [data: 298 in 468 out 1320 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor/2.0\" ]"
"message":"127.0.0.1 - - [14/Oct/2019:16:45:47] \"GET / HTTP/1.1\" 200 22 \"\" \"ServerMonitor/2.0\""

{"DASQuery":{"query":"dataset=/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM","hash":"705c64bd8a82ee31aa9607dada4a6208","spec":{"dataset":"/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM"},"fields":["dataset"],"pipe":"","instance":"prod/global","detail":false,"system":"","filters":{},"aggregators":[],"error":"","tstamp":1570577955},"PID":"705c64bd8a82ee31aa9607dada4a6208","ProcessTime":10.645284837,"Unix":1570577956,"level":"info","msg":"ready","time":"2019-10-09T01:39:16+02:00"}

crabserver

"message":"[14/Oct/2019:16:45:22]  RESTSQL:ovUDzLxBPpVb RELEASED cmsweb_analysis_preprod@devdb11 timeout=300 inuse=0 idle=1"

crabcache

"message":"[14/Oct/2019:16:46:21] crabcache-6c7f6559d6-5ksbz 127.0.0.1 \"GET /crabcache/info HTTP/1.1\" 200 OK [data: 340 in 69 out 608 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-crabcache\" ]"

dqmgui
t0wmadatasvc

"message":"[14/Oct/2019:16:44:21] t0wmadatasvc-74d87c769b-p8zcx 127.0.0.1 \"GET /t0wmadatasvc/prod/hello HTTP/1.1\" 200 OK [data: 352 in 25 out 64701 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-t0wmadatasvc\" ]"

Add docker login instructions on the building documentation

As just discussed on the training, please add instructions highlighting how one can upload the images to docker hub to:
https://github.com/dmwm/CMSKubernetes/blob/master/docker/README.md#how-to-build-docker-image-for-cms-data-service

Cannot have two different directories of helm charts

The chart releaser action overwrites things when it's like that. The Rucio ones are overwritten if you update a cmsweb or vice versa.

So there are only two possibilities. We all put our charts in the same directory (and if that's what we want, I'll make the PR since it's a little bit tricky) or we just move the Rucio stuff to it's own repository. Maybe CMSKubernetes has outgrown it's usefulness. Up to you. @arooshap @muhammadimranfarooqi @vkuznet

Migrate/Copy all MongoDB databases to the new MonogoDB As A Service

Migrating MongoDB to a New Cluster.

April 18, 2023

Overview

WMCore#11534
We are carrying out the migration of the MongoDB cluster from the old load-balancer setup, to a newer architecture that does not include the need to have a load-balancing setup.

Checklist for CMSWEB team

Provision of new clusters for production and testbed.
Create necessary roles and distribute them to the WMCore team.
Validate the cluster setup.
Create necessary mount points for backing up and restoring the databases.
Update documentation to reflect the latest changes.
Backup, and then eventually restore the databases to the new cluster.

Further Action Items to address the Undone Checklist Items

Check the the list of databases:

cms-db:PRIMARY> show dbs
admin                      0.000GB
config                     0.000GB
ddm_monitoring             0.456GB
local                      0.058GB
msOutDB                    0.000GB
msOutputDBPreProd          0.000GB
msOutputDBProd             0.103GB
msUnmergedDBPreProd        0.000GB
msUnmergedDBProd           0.005GB
msUnmergedDBcmsweb-test10  0.000GB
msUnmergedDBcmsweb-test8   0.000GB
msUnmergedDBcmsweb-test9   0.000GB
rchauhan                   3.668GB

Create the backup of the necessary databases in vocms0750:

 #!/bin/bash

 # Define the MongoDB connection URI

 uri="mongodb://mongodb-cms.cern.ch:27017"

   

 # Define the MongoDB authentication options

 username="cmssw"

 password="xxxx"

 authdb="admin"

   

 # Define the output directory

 output_dir="/cephfs/product/mongodb"

   

 # Loop through each database and run mongodump

 for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd" "rchauhan"

 do

 echo "Dumping database: $db_name"

 mongodump -vvv --uri="$uri/$db_name?replicaSet=cms-db" --username="$username" --password="$password" --authenticationDatabase="$authdb" --out="$output_dir" --db="$db_name" 

 done

Restoring the databases:


 #!/bin/bash
   

 # Define the MongoDB connection URI

 uri="mongodb://'cms-mongo-prod-node-0.cern.ch:32001,cms-mongo-prod-node-1.cern.ch:32002,cms-mongo-prod-node-2.cern.ch:32003"

   

 # Define the MongoDB authentication options

 username="cmssw"

 password="xxx"

 authdb="admin"

   

 # Define the backup directory

 backup_dir="/cephfs/product/mongodb"

   

 # Loop through each database and run mongorestore

 for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd"  "rchauhan"

 do

 echo "Restoring database: $db_name"

 mongorestore -vvv --uri="$uri/$db_name?replicaSet=mongodb-prod" --username="$username" --password="$password" --authenticationDatabase="$authdb" "$backup_dir/$db_name" --dryRun
done

Helmchart publication is not working

From the latest action:

myrepo https://registry.cern.ch/chartrepo/cmsweb
19
Successfully packaged chart and saved it to: /home/runner/work/CMSKubernetes/CMSKubernetes/helm/rucio-consistency-0.4.2.tgz
20
Error: unknown flag: --username

And I confirm that none of the Rucio charts have been updated on the CERN repo in months. Can one of you fix this ASAP? We need this to be able to fully clean up the CERN site.

WMCore service, MSUnmerged, reported being slow.

We were contacted by the DMWM who reported that,

For the last few days we started observing huge delays in DNS lookup queries from the MSUnmerged pods in the K8 production cluster. This results in an infinite polling cycle of the service and in practice prevent it from ever completing the iteration through all sites. Yesterday, @germanfgv alarmed us that it became noticeable for T2_CH_CERN.

When are the plans to test it?
I could not reproduce this issue yesterday but after the migration of some of the nodes, I will continue with it.

Any alternate comments?
It looks related to the issue dmwm/WMCore#11330.

Avoid updating all packages every time a new service is built

As discussed in this HN thread: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1693/1.html

it would be useful to move those yum update & yum clean all to the cmsweb base image only, such that services can use a consistent set of software/versions and also benefit from a building speed up.

Update the rolling upgrade procedure, and add automation to it.

Our main focus would be to make it descriptive enough to not cause any major outages after the migration. Also, add some additional scripts to make it more automated i.e. deployments based on namespaces, automatic configuration of fluetnd, e.t.c.

I am adding all the points that we need to focus on/include in the documentation, so that we don't miss anything.

Add more endpoint checks for the services. Some new ones that I have discovered are for das-server, dbs, and rucio monitor.
Include about nginx settings in the rolling upgrade document.
Create a separate directory for storing secrets for individual cluster. The .pem files can be encrypted (the procedure that was already being followed for DBS cluster).
Improve the procedure for stress testing the cluster.
Remove IT services that are not being used. One particular example is the fluentd service that was causing major issues with the nodes.

I will add more points to this.

Review and remove code placed in CMSKubernetes/kubernetes/rucio and docker/rucio

Finish CMSRucio #287
Do the review, move what's needed to CMSRucio