dmwm / cmskubernetes Goto Github PK
View Code? Open in Web Editor NEWSet of instructions and examples to deploy CMS data-services to Kubernetes cluster
Set of instructions and examples to deploy CMS data-services to Kubernetes cluster
Monitoring cron jobs deployment will be automatized with FluxCD. After cron jobs, all services will be automatized, but first implementation will include only CronJobs.
We may need a test github repo to easily test.
Helm charts issue: #1214 and its PR by @kyrylogy #1215
curl -s -O https://fluxcd.io/install.sh
# remove sudo command
sed -i 's/sudo ${CMD_MOVE}/${CMD_MOVE}/g' install.sh
# Change bin directory
sed -i 's+DEFAULT_BIN_DIR="/usr/local/bin"+DEFAULT_BIN_DIR="/afs/cern.ch/user/c/cuzunogl/private/bin"+g' install.sh
# Create user bin directory, because /usr/local/bin is not allowed
mkdir -p /afs/cern.ch/user/c/cuzunogl/private/bin
chmod -R +x /afs/cern.ch/user/c/cuzunogl/private/bin
# Add /afs/cern.ch/user/c/cuzunogl/private/bin to your `.bashrc` and source it
# Run script
bash install.sh
# All set
Ref: https://fluxcd.io/flux/installation/#install-the-flux-cli
Hi,
Following the discussion we had at CMSKUBERNETES-229, I created separate robot accounts for CRAB and DMWM. After obtaining the certificates, I made sure that they have they are configured properly and have the same permissions as they had previously.
Since we deploy these certificates as cluster-secrets
through the deploy-cluster-secrets.sh script, I made some changes in this script to have proper certificate according to the namespace, which can be reviewed here: #1351. If you have any suggestions to make it more efficient, let me know.
We can deploy these secrets in the dev clusters, and once we are sure that the services are working properly, we can then deploy the changes in testbed, and eventually production. Let me know your suggestions in this regard as well.
Note that this change does not require any of the teams to modify their deployment or kubernetes manifest files
I or @nsmith- should do this. It will make the different versions of Rucio much easier to manage as there is only one hostname in the k8s files
As CMS Monitoring we need to move our cron jobs running as crontab both in K8s and VirtualMachines to the K8s CronJob
kind. All our K8s CronJobs
that run Spark/Hadoop, should be in kubernetes/monitoring/cron-hdfs
directory.
cmsmon-rucio-ds
can be found in : https://github.com/mrceyhun/CMSKubernetes/blob/f-cron-hdfs/kubernetes/monitoring/cron-hdfs/cron-test.yaml/afs/cern.ch/user/c/cuzunogl/public/kyrylo
.Assignee: @kyrylogy
Valentin, as we discussed in the past week, it would be extremely useful to many CMSWEB applications developers on how one can run a full deployment as we have nowadays.
Which means, deploying the service we want to deploy together with the frontend, frontend rules, frontend accounts/certificates mapping and so on, such that we can have the complete cycle and test services as they would get deployed in a production environment.
Something along the lines as https://github.com/dmwm/CMSKubernetes/blob/master/kubernetes/cmsweb-nginx/docs/end-to-end.md , but really end-to-end ;)
as done in current frontend:
https://github.com/dmwm/deployment/blob/27580b5863583abcd7557003008d3c1a66737b48/frontend/backends-prod.txt#L15-L34
as discussed in
https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1699/1.html
I know that such an explicit list of VM names is not nice, but it it can be accomodated, it surely is much less work then alternatives. All in all that list changes very rarely.
We need to remove all tar balls from help area. @ericvaandering could you please take care of that.
@vkuznet @muhammadimranfarooqi with reference to
currently the procedure in this repository builds a crabserver preprod flavor.
The 'preprod' tag there refernce to once specific Oracle DataBase instance.
There are 3 such instance: prod, preprod and dev which allows us to
test and validate things and we will eventually need to have all in K8s.
What is the suggested way to manage this ?
so far @ddaina has switched cmsweb-test2 from preprod to test by building
the container with a modified file, but this is a fragile procedure in the long term.
I see that we have some cluster-specific files in
https://gitlab.cern.ch/cmsweb-k8s
while in this github.com/dmwm/CMSKubernetes repository files are generic for all clusters
But I do not see a place where e.g. customize the k8s yaml files.
One possibility that comes to mind is to allow building different containers: crabserver-prod, crabserver-preprod, crabserver-dev with different versions of
and use the appropriate one as image in the various yaml files.In the introduction.md
tutorial there is a mention to a proxy.sh from the frontend folder, linked to
docker/frontend/proxy.sh`, which is missing.
I guess it should be docker/proxy/proxy.sh
cmsmon-rucio-mon-goweb service is deployed on cmsweb-test1 cluster for now and we are using NodePort to have access to web service. Can we replace this NodePort with an endpoint like cmsmon-rucio
. Service web page is http://cmsweb-test1.cern.ch:31280/, so it will be like http://cmsweb-test1.cern.ch/cmsmon-rucio
I think we need an ingress rule similar to DAS. However, currently no authantication is implemented in this service. It is in our TODO list but it will not be ready in near future. So, without authentication (I mean without APS/XPS/SPS, but grid certificate requirement is okay), can we implement this?
When everything ready on ingress side, I can change NodePort to container port.
Now that you have a fork, it'd be good to start using a more formal development model for the repo. :-)
Eventually we will automate our deployment process with FluxCD and helm charts. Currently, our k8s CronJobs are ready to run and we need to prepare their Helm charts.
CronJobs have common definitions like concurrencyPolicy
, failedJobsHistoryLimit
, backoffLimit
. And also there are changing values and they should be arranged in helm. In templates, we can have cronjob.yaml and service.yaml initially. We can evaluate to get rid of ConfigMaps and to provide commands directly in values.yaml for each cron.
We can use values.yaml structure like in this SO entry https://stackoverflow.com/a/73571078/6123088 . An example can be found in here https://github.com/mrceyhun/CMSKubernetes/blob/f-cmsmon-crons-helm/helm/cmsmon-cron/values.yaml .
prod
and test
(kyrylo)@kyrylogy Let's share the tasks in this issue. As we talked, just create the initial version of helm chart and later we can decide our direction. I though that we may give priority to the first task over CMSSpark tasks ;)
This is a place holder for the consistency checking scripts. How do/can we get access to the same Hadoop in the analytic cluster on a kubernetes pod to read in the results of a job?
From the training, please add instructions requesting the openstack project environment to be properly set.
This the placeholder ticket to add cmsweb service specific parsers to logstash. For simplicity let's collect all messages collected by filebeats:
"message":"[14/Oct/2019:16:39:19] reqmon-df69c6598-tk58z 127.0.0.1 \"GET /wmstatsserver/data/info HTTP/1.1\" 200 OK [data: 346 in 39 out 1982 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmon\" ]"
[15/Oct/2019:02:00:39] reqmgr2-89f7df4fd-95g9q 137.138.33.200 "GET /reqmgr2/data/request?name=amaltaro_TaskChain_InclParents_Oct2019_Val_191010_125845_9547 HTTP/1.1" 200 OK [data: 2369 in 1862 out 530492 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0192.cern.ch" "" ] [ref: "" "WMCore.Services.Requests/v002" ]
"message":"[14/Oct/2019:16:40:37] reqmgr2ms-6796895484-7vzj6 127.0.0.1 \"GET /ms-transferor/data/status HTTP/1.1\" 200 OK [data: 351 in 158 out 683 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-reqmgr2ms\" ]"
"message":"[Mon, 14 Oct 2019 14:39:59 GMT] [info] [<0.26674.38>] 127.0.0.1 - - GET /workqueue/_design/WorkQueue/index.html 200"
"message":"::1 - - - [14/Oct/2019:16:44:48 +0200] \"GET /phedex/datasvc/doc HTTP/1.1\" 200 15229 \"-\" \"ServerMonitor-phedex\""
"message":"INFO:cherrypy.access:[14/Oct/2019:16:43:52] dbs-phys03-w-5fd9b6ffc-9kqxb 127.0.0.1 \"GET /dbs/ HTTP/1.1\" 200 OK [data: 298 in 468 out 1320 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor/2.0\" ]"
"message":"127.0.0.1 - - [14/Oct/2019:16:45:47] \"GET / HTTP/1.1\" 200 22 \"\" \"ServerMonitor/2.0\""
{"DASQuery":{"query":"dataset=/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM","hash":"705c64bd8a82ee31aa9607dada4a6208","spec":{"dataset":"/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v2/MINIAODSIM"},"fields":["dataset"],"pipe":"","instance":"prod/global","detail":false,"system":"","filters":{},"aggregators":[],"error":"","tstamp":1570577955},"PID":"705c64bd8a82ee31aa9607dada4a6208","ProcessTime":10.645284837,"Unix":1570577956,"level":"info","msg":"ready","time":"2019-10-09T01:39:16+02:00"}
"message":"[14/Oct/2019:16:45:22] RESTSQL:ovUDzLxBPpVb RELEASED cmsweb_analysis_preprod@devdb11 timeout=300 inuse=0 idle=1"
"message":"[14/Oct/2019:16:46:21] crabcache-6c7f6559d6-5ksbz 127.0.0.1 \"GET /crabcache/info HTTP/1.1\" 200 OK [data: 340 in 69 out 608 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-crabcache\" ]"
"message":"[14/Oct/2019:16:44:21] t0wmadatasvc-74d87c769b-p8zcx 127.0.0.1 \"GET /t0wmadatasvc/prod/hello HTTP/1.1\" 200 OK [data: 352 in 25 out 64701 us ] [auth: OK \"\" \"\" ] [ref: \"\" \"ServerMonitor-t0wmadatasvc\" ]"
As just discussed on the training, please add instructions highlighting how one can upload the images to docker hub to:
https://github.com/dmwm/CMSKubernetes/blob/master/docker/README.md#how-to-build-docker-image-for-cms-data-service
The chart releaser action overwrites things when it's like that. The Rucio ones are overwritten if you update a cmsweb or vice versa.
So there are only two possibilities. We all put our charts in the same directory (and if that's what we want, I'll make the PR since it's a little bit tricky) or we just move the Rucio stuff to it's own repository. Maybe CMSKubernetes has outgrown it's usefulness. Up to you. @arooshap @muhammadimranfarooqi @vkuznet
April 18, 2023
WMCore#11534
We are carrying out the migration of the MongoDB cluster from the old load-balancer setup, to a newer architecture that does not include the need to have a load-balancing setup.
Provision of new clusters for production and testbed.
Create necessary roles and distribute them to the WMCore team.
Validate the cluster setup.
Create necessary mount points for backing up and restoring the databases.
Update documentation to reflect the latest changes.
Backup, and then eventually restore the databases to the new cluster.
Check the the list of databases:
cms-db:PRIMARY> show dbs
admin 0.000GB
config 0.000GB
ddm_monitoring 0.456GB
local 0.058GB
msOutDB 0.000GB
msOutputDBPreProd 0.000GB
msOutputDBProd 0.103GB
msUnmergedDBPreProd 0.000GB
msUnmergedDBProd 0.005GB
msUnmergedDBcmsweb-test10 0.000GB
msUnmergedDBcmsweb-test8 0.000GB
msUnmergedDBcmsweb-test9 0.000GB
rchauhan 3.668GB
Create the backup of the necessary databases in vocms0750
:
#!/bin/bash
# Define the MongoDB connection URI
uri="mongodb://mongodb-cms.cern.ch:27017"
# Define the MongoDB authentication options
username="cmssw"
password="xxxx"
authdb="admin"
# Define the output directory
output_dir="/cephfs/product/mongodb"
# Loop through each database and run mongodump
for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd" "rchauhan"
do
echo "Dumping database: $db_name"
mongodump -vvv --uri="$uri/$db_name?replicaSet=cms-db" --username="$username" --password="$password" --authenticationDatabase="$authdb" --out="$output_dir" --db="$db_name"
done
Restoring the databases:
#!/bin/bash
# Define the MongoDB connection URI
uri="mongodb://'cms-mongo-prod-node-0.cern.ch:32001,cms-mongo-prod-node-1.cern.ch:32002,cms-mongo-prod-node-2.cern.ch:32003"
# Define the MongoDB authentication options
username="cmssw"
password="xxx"
authdb="admin"
# Define the backup directory
backup_dir="/cephfs/product/mongodb"
# Loop through each database and run mongorestore
for db_name in "ddm_monitoring" "msOutDB" "msOutputDBProd" "msUnmergedDBProd" "rchauhan"
do
echo "Restoring database: $db_name"
mongorestore -vvv --uri="$uri/$db_name?replicaSet=mongodb-prod" --username="$username" --password="$password" --authenticationDatabase="$authdb" "$backup_dir/$db_name" --dryRun
done
From the latest action:
myrepo https://registry.cern.ch/chartrepo/cmsweb
19
Successfully packaged chart and saved it to: /home/runner/work/CMSKubernetes/CMSKubernetes/helm/rucio-consistency-0.4.2.tgz
20
Error: unknown flag: --username
And I confirm that none of the Rucio charts have been updated on the CERN repo in months. Can one of you fix this ASAP? We need this to be able to fully clean up the CERN site.
We were contacted by the DMWM who reported that,
For the last few days we started observing huge delays in DNS lookup queries from the MSUnmerged pods in the K8 production cluster. This results in an infinite polling cycle of the service and in practice prevent it from ever completing the iteration through all sites. Yesterday, @germanfgv alarmed us that it became noticeable for T2_CH_CERN.
When are the plans to test it?
I could not reproduce this issue yesterday but after the migration of some of the nodes, I will continue with it.
Any alternate comments?
It looks related to the issue dmwm/WMCore#11330.
As discussed in this HN thread: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/1693/1.html
it would be useful to move those yum update & yum clean all
to the cmsweb base image only, such that services can use a consistent set of software/versions and also benefit from a building speed up.
Our main focus would be to make it descriptive enough to not cause any major outages after the migration. Also, add some additional scripts to make it more automated i.e. deployments based on namespaces, automatic configuration of fluetnd, e.t.c.
I am adding all the points that we need to focus on/include in the documentation, so that we don't miss anything.
I will add more points to this.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.