Would like to open discussion on this topic. Some suggestions: <li

Cassandra: monitoring about firecamp HOT 30 CLOSED

cloudstax commented on May 18, 2024

Cassandra: monitoring

from firecamp.

Comments (30)

JuniusLuo commented on May 18, 2024 1

Sounds good. We could add these 3 metrics.

from firecamp.

cloudstax commented on May 18, 2024 1

The custom metrics is supported for Cassandra. Close this issue.

from firecamp.

JuniusLuo commented on May 18, 2024

The Cassandra metrics could be collected using nodetool, JConsole, or JMX. See Cassandra Monitoring. The datadog blog you posted also mentioned the same methods. Datadog would also collect the Cassandra metrics using one of these ways.

The initial plan is to follow the standard way (nodetool) to get Cassandra metrics, and send the metrics/alert to CloudWatch Metrics/Alarms. We will not add any additional library to the Cassandra container. We could have a general policy-based framework. The framework allows the customer to customize the policy, such as the metric collecting interval, the metrics to collect, etc. The framework will schedule the task(s) accordingly. The task(s) will use nodetool to connect to Cassandra nodes to get metrics and send to CloudWatch. The task will end after that, so the task only consumes resource when it is running. Each service could define its own monitoring task.

We will define the standard metrics/alarm APIs, and have one implementation for CloudWatch. In the future, we could easily add the implementation for Azure/GCP and other implementation, which may use TICK.

from firecamp.

jazzl0ver commented on May 18, 2024

Thanks for the detailed explanation of your point! I agree that injecting a side library does not sound very well, but:

Jolokia library is a wrapper for JMX counters. It just adds an option to collect them thru simple HTTP requests. No local java needed to query them. Thus it requires much less of resources for the service which will query C* metrics. And it's opensourced.
nodetool is a java app, so it starts slowly and unable to query multiple metrics at once.
nodetool's output should be parsed before injecting metrics into CloudWatch. How are you going to deal with that? Also the output might change slightly in next C* releases, which will require extra work for the parser adaptation.
I'm not sure DataStax uses nodetool to collect metrics for opsCenter, since it installs a proprietary agent on every C* node and might query it to get metrics.

Regarding the monitoring task. Do you really think it's a good idea to continuously start and stop it? For example, for 1 minute metrics collection interval, it seems to me it might be started just a moment later after it was stopped. Just b/c a lot of metrics must be collected, processed and injected into CloudWatch.

from firecamp.

JuniusLuo commented on May 18, 2024

Good point! There are many existing monitoring solutions. We will explore the existing solutions first, and leverage the open source solutions as much as possible. We will only consider to build our own solution when we could not find the suitable solution.

Jolokia is a good tool. It actually supports the proxy mode. We could test to see if the proxy mode works for Cassandra.

Telegraf is a good project. It supports to get metrics for many services, and could send the metrics to CloudWatch. It may be a better framework than CollectD. This blog has a good comparison.

from firecamp.

JuniusLuo commented on May 18, 2024

For the monitoring task, it would be ok to keep the monitoring task running. The monitoring service would be a better name than task. Assume the framework such as Telegraf only has a small memory footprint.

It would not be a problem to keep the task short as well. Collecting the metrics of one Cassandra or other service node will be fast, unless something happens. For example, Cassandra itself is stuck at gc. The metrics data will be small. The processing and sending to like CloudWatch would be fast as well. The monitoring collection/handling would not take more than a few seconds. But it requires more work for the scheduling framework. We could start with the long run monitoring service first.

from firecamp.

JuniusLuo commented on May 18, 2024

It turns out adding Jolokia into Cassandra container is the simplest way. Telegraf is also supported. Monitoring Cassandra, Redis and ZooKeeper are supported. You could create a Telegraf service for the Cassandra service and see the metrics on CloudWatch. Please take a look and share your comments/suggestions.

Note: currently Cassandra Keyspaces and tables are not monitored. The system keyspaces introduces more than 1000 metrics. Further enhancements will be added to monitor the user keyspaces.

from firecamp.

jazzl0ver commented on May 18, 2024

That's a great news!! What is the upgrade path? If possible, I wouldn't like to re-create our cassandra services.

Please, add Telegraf service creation tutorial to the Wiki.

And how to restrict Telegraf's container memory?

from firecamp.

JuniusLuo commented on May 18, 2024

yep, upgrade will be supported for service created in 0.9.4 and 0.9.3.

Telegraf itself does not restrict the memory. We could leverage the container max memory/cpu limits. You could set the max-memory and max-cpuunits when creating the Telegraf service. This will set the max memory and cpu for the container. If Telegraf exceeds the max memory, container will be killed.

from firecamp.

jazzl0ver commented on May 18, 2024

Could you please share the options for setting max-memory for Telegraf service creation command?

from firecamp.

JuniusLuo commented on May 18, 2024

"max-memory" and "max-cpuunits"

from firecamp.

JuniusLuo commented on May 18, 2024

Looks CLI help does not include these options. Will add it.

from firecamp.

jazzl0ver commented on May 18, 2024

Have you updated the manage server and cli? Looks like the cli is still the old one:

-rwxrwxr-x junius/junius 7648808 2018-03-14 04:32 firecamp-service-cli

from firecamp.

JuniusLuo commented on May 18, 2024

CLI does include the "max-memory" option. Just the help, such as firecamp-service-cli -op=create-service --help does not show the "max-memory" option. You could still use it.

from firecamp.

jazzl0ver commented on May 18, 2024

I'm sorry - I meant the telegraf service absence:

# ./firecamp-service-cli -region=us-east-1 -cluster=firecamp-prod -op=create-service -help
Usage: firecamp-service-cli -op=create-service
...
  -service-type string
        The catalog service type: mongodb|postgresql|cassandra|zookeeper|kafka|kafkamanager|redis|couchdb|consul|elasticsearch|kibana|logstash
...

from firecamp.

JuniusLuo commented on May 18, 2024

oops, uploaded the latest cli.

from firecamp.

jazzl0ver commented on May 18, 2024

Works very well, thank you!

It would be great to have some storage metrics, like free/total space available
It would also be great to have some summarized metrics per cluster (like Latency, for example). Is that possible?

from firecamp.

JuniusLuo commented on May 18, 2024

Unfortunately, cassandra does not provide them. For #1, cassandra storage load metrics provides "Total disk space used (in bytes) for this node", but not provide the free space for the node. In the later release, we could integrate with CloudWatch to create an alarm when the used space reaches some threshold of the total data volume size.
For #2, cassandra only provides the per node metrics. not aggregate all nodes. You could easily create the dashboard for the "cassandraClientRequest_Latency_Mean" for all nodes. This would be enough?

from firecamp.

jazzl0ver commented on May 18, 2024

Yeah, that's a good solution, thank you! I'll create a separate issue for CloudWatch alarm on the used space.

from firecamp.

jazzl0ver commented on May 18, 2024

Are you aware why some metrics are not available?
For example, Streaming metrics (http://cassandra.apache.org/doc/latest/operating/metrics.html)

from firecamp.

JuniusLuo commented on May 18, 2024

Yes, not all metrics are monitored, such as Streaming metrics, CQL metrics, DroppedMessage metrics, etc. If you think some metrics is important and want to add, please let us know. Thanks.

from firecamp.

jazzl0ver commented on May 18, 2024

Well, I think the aggregation of metrics across all keyspaces and tables are good to have. Streaming and DroppedMessage metrics seem also important.

from firecamp.

jazzl0ver commented on May 18, 2024

It would be great to have an option to update the list of currently fetched metrics according to one's requirements. For example, at the moment around 100 metrics per node are fetched from Cassandra while I need just a few.

from firecamp.

JuniusLuo commented on May 18, 2024

There are lots of Cassandra metrics. How do you want to configure it?

I am not sure if this is really necessary. Collecting 100 metrics per node would not impact Cassandra, as metric data is very small. If you only care a few, you could easily filter them on CloudWatch. It would be better to collect the important metrics. When something goes wrong, we may get some hints from the metrics.

from firecamp.

jazzl0ver commented on May 18, 2024

It's not about impacting C*, it's about money: custom metrics cost ($0.30 x 100 = $30 per node per month) and it's not very wise to pay for the things you don't really need.
I thought we could have a file with the list of metrics (one per line) that could be uploaded to the Telegraf service thru a firecamp-manager-cli update call. It would replace the current metrics list with the new one. Another "get" call might return the current list of metrics.

from firecamp.

JuniusLuo commented on May 18, 2024

I see. This makes sense. CloudWatch is not cheap.

from firecamp.

JuniusLuo commented on May 18, 2024

The commit was in. You could put all the custom metrics in one file, and pass the file "-tel-metrics-file=pathtofile" when creating the service.

please pay attention to the data format in the metrics file. Each line includes one metric. Every metric should have the quotation marks and end with comma. The last metri should not end with comma. Example:

    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Read,name=Latency",
    "/org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency",
    "/org.apache.cassandra.metrics:type=Storage,name=Load"

from firecamp.

JuniusLuo commented on May 18, 2024

We just published 0.9.5 release, which supports Telegraf. You could try the latest firecamp quickstart.

If you have a cassandra service in 0.9.4 release, you could follow the upgrade guide to upgrade the cluster. While, there is one limit that you will have to stop all services first before upgrade. The upgrade will take around 10 minutes. Upgrade will be further enhanced in the next release.

from firecamp.

jazzl0ver commented on May 18, 2024

That's great! Thanks for the implementation as well as for the upgrade feature!
If I'm running the "latest" release, what are my steps to upgrade correctly?

from firecamp.

JuniusLuo commented on May 18, 2024

Upgrade is not supported for the "latest" release. There is no way to know what needs to be upgraded between commits of the latest release.

from firecamp.

Cassandra: monitoring about firecamp HOT 30 CLOSED

Comments (30)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent