Giter Site home page Giter Site logo

scalaconsultants / mesmer Goto Github PK

View Code? Open in Web Editor NEW
71.0 8.0 12.0 98.38 MB

OpenTelemetry agent for Scala applications

Home Page: https://mesmer.io

License: Apache License 2.0

Scala 80.86% Java 17.06% JavaScript 1.58% CSS 0.50%
akka javaagent opentelemetry scala zio

mesmer's Introduction

Project Stage CI Release Snapshot
Project stage Scala CI Release Artifacts Snapshot Artifacts

Mesmer

Mesmer is an OpenTelemetry instrumentation library for Scala applications.

Compatibility:

  • Scala: 2.13.x
  • JVM: 1.11+

See the docs for more information.

Contributors

Local testing

examples subproject contains a test application that uses Akka Cluster. Go here for more information.

Contributor setup

  1. You're encouraged to use the sbt native client. It will speed up your builds and your pre-commit checks (below). Just set export SBT_NATIVE_CLIENT=true and sbt will use the native client.
  2. Install pre-commit
  3. Run pre-commit install
  4. If you're using Intelij Idea:
    • Download "google-java-format" plugin and use it
    • Go to "Editor" -> "Code Style" -> "YAML". Uncheck "Indent sequence value" and "Brackets" (in the "Spaces" menu)

Documentation

Mesmer project uses Docusaurus v2 with mdoc to produce type-checked documentation. All is configured with the sbt-mdoc plugin according to this document.

There are 3 directories relevant to the process:

  • website/ - Docusaurus application
  • docs/ - markdown pages with the documentation
  • mesmer-docs/ - markdown pages compiled by mdoc

To run Docusaurus locally:

  • install node (version >= 14) and yarn
  • go to the "website" directory:
cd website
  • run the following:
yarn
yarn run start

To see the documentation changes in your running Docusaurus instance you need to recompile with the following command:

sbt docs/mdoc

This will put them into mesmer-docs/target/mdoc where the Docusaurus can pick them up (the location where Docusaurus looks for these pages is configured in website/docusaurus.config.js)

The homepage (in case you need to make changes to it) resides in website/src/pages/index.js.

mesmer's People

Contributors

alastor1729 avatar aokomorowski avatar emanueloliveira23 avatar guda249 avatar hwgonz avatar jczuchnowski avatar levarix avatar lgajowy avatar mbut avatar mend-for-github-com[bot] avatar mtk avatar ptrdom avatar scala-steward avatar skipper1982 avatar vpavkin avatar worekleszczy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mesmer's Issues

PoC: is it possible to instrument akka metrics without actors, using otel api only?

Opentelemetry API defines span, context and instruments that in theory could be used directly in bytebuddy instrumentations. The PoC should try instrumenting metrics this way and allow comparing if the result is different (and how is it different).

Best possible outcome/PoC Goal: we use only otel API and adding metrics is a lot easier. It is also possible to instrument other libraries without the use of akka actors.

Bonus: Try to instrument other libraries, eg. ZIO + Quill + Otel Agent EDIT: This was moved to another issue: #352

Refactor agent classes

  1. Agent - it's clearly a Monoid itself but we happen to wrap it in Option[] monoid which does not seem to make sense. Option[Agent] could be flattened everywhere to be just Agent (with values of either a working agent or Agent.empty).

  2. We don't have to check ifSupported every time we define a new agent instance we can do it at the beginning and then short-circut if the agent is not supported. None of the implementations of ifSupported methods they do not rely on a specific instrumentation. They only take jar versions into account. Moreover, after we migrate to otel extension, classpath scanning will not be needed at all so this will become redundant.

  3. Note there is also #301 that should simplify how the config is checked. So it seems that this issue should be done after #301 is done/together with it.

@worekleszczy let's discuss the above findings before implementing #301

Error whem starting the example app

When I start the example according to the documentation on Windows 10 (sbt "project example" runWithAgent), I get an error:
running (fork) example.Boot
[error] Exception in thread "main" java.lang.reflect.InvocationTargetException
[info] FATAL ERROR in native method: processing of -javaagent failed
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error] at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.base/java.lang.reflect.Method.invoke(Method.java:566)
[error] at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:513)
[error] at java.instrument/sun.instrument.InstrumentationImpl.loadClassAndCallPremain(InstrumentationImpl.java:525)
[error] Caused by: java.lang.NoClassDefFoundError: com/typesafe/config/ConfigFactory
[error] at io.scalac.mesmer.agent.Boot$.premain(Boot.scala:23)
[error] at io.scalac.mesmer.agent.Boot.premain(Boot.scala)
[error] ... 6 more
[error] Caused by: java.lang.ClassNotFoundException: com.typesafe.config.ConfigFactory
[error] at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
[error] at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
[error] at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
[error] ... 8 more
[error] Nonzero exit code returned from runner: 1
[error] (Compile / run) Nonzero exit code returned from runner: 1

HOSTNAME required to start example application

When HOSTNAME is not set as an environment variable my example akka cluster doesn't want to start up as it cannot find seed nodes. I think that we should provide a sensible default for it and then try to override value if there is any variable env set. Sth like:

  remote {
    artery {
      enabled = on
      transport = tcp
      canonical.hostname = ${?HOSTNAME}
      canonical.port = 2551
    }
  }

example/docker/docker-compose.yaml does not start the otel collector properly

The otel collector fails to start due to:

mesmer_example_otel_collector  | 2021-12-13T12:44:13.097Z       info    service/collector.go:303        Starting otelcol...     {"Version": "v0.33.0-50-g0594aa1a", "NumCPU": 5}
mesmer_example_otel_collector  | 2021-12-13T12:44:13.101Z       info    service/collector.go:242        Loading configuration...
mesmer_example_otel_collector  | Error: cannot load configuration: unknown extensions type "health_check" for health_check
mesmer_example_otel_collector  | 2021/12/13 12:44:13 collector server run finished with error: cannot load configuration: unknown extensions type "health_check" for health_check
mesmer_example_otel_collector exited with code 1

Some initial investigation shows that it's due to the fact that the health_check extension was moved to: https://github.com/open-telemetry/opentelemetry-collector-contrib

The PR that moved the collector: open-telemetry/opentelemetry-collector-contrib#4894

Possible solutions:

  1. use image otel/opentelemetry-collector-contrib-dev in docker compose.yaml. BTW: should we use the latest version as it is used now?
  2. stop using the healthcheck extension

Add formatting check in a git hook and document it

as the title states. In the future this can be improved so that the formatting is automatically applied during commit but this is not necessary right now.

Additional item: add the "contributor setup" section in README.md to describe what needs to be done in order to have all the checks.

Exporter push interval time vs cluster ping offset

I started wondering what if we set export inverval to NR to 5 sec but ping offest would be 10 sec. The correct behavior would be for the system to present stale cluster state between on every 2nd push to NR but I fear that currently we would push no data that will confuse users. I guess this requires more investigation, but is we prove that this is the case I'd suggest using opentelemetry's LongValueObserver as solution.

Don't rely on sorting a set of AgentInstrumentations when it comes to order of the instrumentation installation

in Agent.installOn method we determine the order of installation by sorting a set. I think we need more explicit way of defining an order because right now it's very brittle and prone to refactoring errors.

Example error: it's enough to change the hashCode implementation of the AgentInstrumentation and the order is completely different. Imho changing the order should not be that easy.

Optimization: Actors Tree Server

The problem: traverse the actors tree isn't a cheap operation.
The idea: to build an actor that serves the actors tree. It traverses the tree periodically and serves it in a request-response fashion for the other actors.
Further work:

  • Serialization issues;
  • Storage impact;
  • Predefined specialized queries on the server (e.g., number of elements)

Make the agent exporter configurable

I can't seem to find where the java agent can be configured to use different protocols and ports. It would be nice if the java agent could take command line arguments like the core OpenTelemetry java agent does [https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/agent-config.md] so we can at least set the otlp protocol and port used between the agent and the collector.

Rename Counter / UpCounter to follow conventions

For now we use UpCounter for monotonically incrementing counters and Counter for gauges. This might confuse users - OT uses UpDownCounter for gauges and Counter for monitonically increasing values - we could follow their convention or prometheus one - counters and gauges

Add automated regression tests (research task)

Mesmer currently has no automated way of detecting if some metrics are missing/broken for some reason. We could use some automated solution to detect such situations and run them on every PR/release.

No persistency data in dashboards

After running example project on current main, generating traffic for /api/v1/account/UUID/deposit/10.0 endpoint and opening up a grafana dashboard no data is shown on page:
image

Configuration always set to default values due to naming error

When MesmerModule gets initialized, mesmerConfig value gets initialized to "module.null". Later in code this is used to determine what is the path to a config section in application.conf file. Since there is no such config entry, default values are always set. Effectively the modules are not configurable.

Create mesmer otel extension with HttpExtRequestsAdvice

This is an inaugural task for: #272

Goals/TODO:

  • Create new "instrumentations" module and make it visible for the mesmer agent
  • Move HttpExtRequestsAdvice and all necessary instrumentation code to it. The Mesmer agent should not loose the existing functionality provided by this advice (refactor, so far)
  • Add an "otel-extension" module that also depends on "instrumentations".
  • Use the HttpExtRequestsAdvice in TypeInstrumentation and InstrumentationModule`
  • Configure the instrumentation via otel config (not via akka configuration). The configuration should allow turning request-related metrics on/off just as it is possible thanks to application.conf now. Document the configuration. => See #288 - there's a bug
  • Add a README.md file describing how to run Mesmer as an otel extension

Investigate potential data loss with unbinds

What is correct outcome after calling unbind? If we destroy whole history for a counter this would mean start counting from 0 - prometheus can deal with this scenario, but what about other vendors?

Rename API_KEY environment variable

I would like to recommend us to rename API_KEY environment variable for New Relic's API key. It's a pretty common name and can conflict with other clients' settings.

WDYT?

Reorganize project structure after decommisioning the Mesmer Agent

This discussion sums it up: #295 (comment)

  • otel-extension should only hold extension related code. It should not depend on the agent module
  • instrumentations should only have the instrumentation code that's utilized by agent and otel-extension
  • agent should have only mesmer agent code and is a subject to deletion in the future (once we fully migrate to otel agent)

Potential race condition in CachedQueryResult

Currently we have double-check locking but implementation is seems to be wrong

class CachedQueryResult[T] private (q: => T, validBy: FiniteDuration = 1.second) {
  private var lastUpdate: Option[Timestamp] = None
  private var currentValue: Option[T]       = None

  def get: T = {
    // Disclaimer: this double check exists to:
    // 1. have more throughput when update is not needed
    // 2. ensure secure updates
    if (needUpdate) { // if one thread access this it's possible that lastUpdate was updated but currentValue was not
      synchronized { // everything in synchronized block is allowed to be reordered by JMM
        if (needUpdate) {
          lastUpdate = Some(now)
          currentValue = Some(q)
        }
      }
    }
    currentValue.get // can return stale data
  }

  private def needUpdate: Boolean = lastUpdate.forall(lu => now > lu + validBy)
  private def now: Timestamp      = Timestamp.create()
}

Blogpost summarising our release

Since the release contains a direction shift to opentelemetry agent, it makes sense to write a blogpost describing the shift and next plans for the project.

Scala 2.12 support

In the codebase we use scala.jdk.* converters that will limit our capability for scala 2.12 support. One solution for this would be to write a wrapper for those converters and implement it separately for 2.12 and 2.13.

The example confuses me

Your guys doing work is so great, there is no doubt about it

I'm trying to use this repo for my Akka Application Monitor, the exmaple was working well,but what confuses me is the otlp exporter in the example.

Why not just use the Prometheus exporter? It should be simple and easier to read the code, and the docker container environment could be simpler.

If needed, I can push a commit to "fix" the issue, just like my application does.

Error on example project startup

Steps to reproduce:

cd example/docker; docker-compose up

Then, in the project root:

sbt "project example" runWithAgent

This at some point yields the following console log:

[error] Could not load Logmanager "wvlet.log.AirframeLogManager"
[error] java.lang.ClassNotFoundException: wvlet.log.AirframeLogManager
[error]         at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
[error]         at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
[error]         at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
[error]         at java.logging/java.util.logging.LogManager$1.run(LogManager.java:239)
[error]         at java.logging/java.util.logging.LogManager$1.run(LogManager.java:223)
[error]         at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
[error]         at java.logging/java.util.logging.LogManager.<clinit>(LogManager.java:222)
[error]         at java.logging/java.util.logging.Logger.demandLogger(Logger.java:649)
[error]         at java.logging/java.util.logging.Logger.getLogger(Logger.java:718)
[error]         at java.logging/java.util.logging.Logger.getLogger(Logger.java:702)
[error]         at io.grpc.ManagedChannelRegistry.<clinit>(ManagedChannelRegistry.java:40)
[error]         at io.grpc.ManagedChannelProvider.provider(ManagedChannelProvider.java:41)
[error]         at io.grpc.ManagedChannelBuilder.forTarget(ManagedChannelBuilder.java:76)
[error]         at io.opentelemetry.exporter.otlp.internal.grpc.DefaultGrpcExporterBuilder.build(DefaultGrpcExporterBuilder.java:117)
[error]         at io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporterBuilder.build(OtlpGrpcMetricExporterBuilder.java:128)
[error]         at io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporter.getDefault(OtlpGrpcMetricExporter.java:30)
[error]         at example.Boot$.initOpenTelemetryMetrics(Boot.scala:47)
[error]         at example.Boot$.startUp(Boot.scala:77)
[error]         at example.Boot$.delayedEndpoint$example$Boot$1(Boot.scala:126)
[error]         at example.Boot$delayedInit$body.apply(Boot.scala:33)
[error]         at scala.Function0.apply$mcV$sp(Function0.scala:39)
[error]         at scala.Function0.apply$mcV$sp$(Function0.scala:39)
[error]         at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:17)
[error]         at scala.App.$anonfun$main$1(App.scala:76)
[error]         at scala.App.$anonfun$main$1$adapted(App.scala:76)
[error]         at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563)
[error]         at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561)
[error]         at scala.collection.AbstractIterable.foreach(Iterable.scala:919)
[error]         at scala.App.main(App.scala:76)
[error]         at scala.App.main$(App.scala:74)
[error]         at example.Boot$.main(Boot.scala:33)
[error]         at example.Boot.main(Boot.scala)

Probably a missing dependency that needs to be added to example deps.

[discussion] OpenTelemetry

@jczuchnowski @worekleszczy Hi!

I've been looking at OpenTelemetry docs and some examples and right now I'm not really sure if it's right tool for our case - let's discuss it.

According to our requirements (and related PoC dashboard for panopticon that we've been working on) we want to provide insight into events occurring in a cluster on membership level (see joining and leaving nodes) and cluster status (like provided by Akka Cluster HTTP Management endpoints), so cluster monitoring functionality.

OpenTelemetry provides us with tracing for requests by collecting traces across services (e.g. which services request visited and how much time spent there), metrics for quantitative measurements (like avg. request duration, request sizes, number of failures) (here's official specification of metrics' instruments), and logs which should be available in the future, according to the official site:

OpenTelemetry will not initially support logging, though we aim to incorporate this over time.

I can't see how current version can fit our case of collecting custom events regarding cluster condition, that can be inspected, and displaying current cluster status. I think our cluster monitoring is a specific case that doesn't match to current OpenTelemetry.
But I could misunderstood it. WDYT?

P.S. In comparison Grafana dashboard provided by Lightbend Cinnamon in Cluster Fundamentals course, for Akka Cluster tab, is comprised of "reachable/unreachable nodes" plot and "Split brain resolver events" plot only:

which I think we can easily recreate with events or Akka Cluster HTTP Management.

Questions

Hi ๐Ÿ‘‹

sorry about the shitty title, I just have some questions regarding the architecture but on different levels, happy to separate them in any way if that's better for you.

What's the plan/motivation/bigger idea behind the separation into a JVM agent and an Akka extension?

  • Akka extension - that runs in the background and is responsible for exporting the metrics to your chosen backend
  • JVM agent - that instruments Akka classes to expose metrics for the extension

The OpenTelemetry Instrumentation for Java provides a JVM agent that does both, it hooks into Akka and you can expose exporters like Prometheus. The setup was trivial, we're currently running it in our system but only for the tracing for now, haven't looked into the metrics yet.

It seems Mesmer provides more metrics which would be nice of course, and especially this makes me think:

As Mesmer uses OpenTelemetry underneath [...]

Theoretically, could the core of Mesmer be wrapped into an extension for their agent ๐Ÿค”

Besides that, I'm a bit confused by this whole paragraph because I don't understand Mesmer's intended architecture (the provided overview isn't detailed enough):

As Mesmer uses OpenTelemetry underneath to export data to metric backend you need to set up an exporter. All exporters require OpenTelemetry SDK present, so make sure you have one added to your project - without this all measurement operations will be NoOp. You can for example set up your project with OTLP metrics exporter that also includes the SDK

While writing this text I'm now understanding more and more but it's quite hard to see where Mesmer fits in, a detailed architecture or data flow diagram would really help.

Hope you can help out ๐Ÿ™

Remove deprecated & failing deploy job

The deploy job was used to deploy mesmer on aws so that manual testing can be done. Since there is no need for such thing now, let's delete not to waste resources.

reinstate/modify/remove max metric values in Grafana dashboard

After bumping the opentelemetry version to version > 1.6 we no longer have a value recorder. It was replaced with histograms. The problem is that we do not interpret the data provided by the histograms properly in the grafana dashboard. This is why all the "max" value graphs show "no data".

Goal: investigate what should we do. Should we:

  • remove the max graphs?
  • (or) fix them somehow

Fixing/removing the graphs is part of the solution.

No metrics getting exported

Hello!

I set up the example on my system and was able to see the Akka metrics generated by Mesmer and exported to Prometheus via the OTel collector. However when I try exactly the same set up with our Akka application, no metrics get exported. The logs seem to be very similar to the example application ( I can see the ByteBuddy & Mesmer extension log messages ). I also tried the logging exporter instead of the OTel exporter but again 0 metrics were logged. So it doesn't seem to be an exporting issue. Not sure if the metrics are getting generated and propogated the way they should. How can I debug this further?

Correct documentation

Hi,

in your documentation the following needs to be corrected:

akka.actor.typed.extensions= ["io.scalac.mesmer.extension.Mesmer"] --> this class does not exist. It should read:

akka.actor.typed.extensions= ["io.scalac.mesmer.extension.AkkaMonitoring"]

Fix README.md after opentelemetry version bump (1.7.0)

For sure, the exporter initialization is now outdated and is done differently:

IntervalMetricReader
  .builder()
  .setMetricExporter(metricExporter)
  .setMetricProducers(Collections.singleton(meterProvider))
  .setExportIntervalMillis(exportInterval)
  .buildAndStart()

see: https://github.com/ScalaConsultants/mesmer-akka-agent/blob/0b9207fb27932b483c6dba694679a66a1c1a29c6/example/src/main/scala/example/Boot.scala#L49

Objective: Try to find other places like this and fix them along with the one mentioned above.

artifact is missing from Maven central

It appears that there is a problem publishing the akka core artifact to Maven repository. Perhaps this is from recent repackaging from io.scalac to io.scalac.mesmer? Haven't tried an SBT build yet, but with Maven:

 Could not find artifact io.scalac:mesmer-akka-core_2.13:jar:0.4 in central (https://repo.maven.apache.org/maven2)

When I look in Maven central, this artifact is indeed missing.

Akka Static Dispatcher Metrics

Implement the following static akka dispatcher metrics

  1. Executor threads min
  2. Executor threads max
  3. Executor parallelism

Should be viewable in prometheus or even better in grafana

New configuration approach

  • Use opentelemetry configuration instead of akka config ewerywhere in our codebase
  • Make sure it's picked up properly (write tests for that)

Support Akka persistence in otel extension

Similar to #295. The task is to enable akka persistence metrics in our otel extension. It can be done similarly to: #291.

NOTE: please enable -Dotel.javaagent.debug=true when running with the otel agent as there might be some errors (given my initial investigation).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.