Giter Site home page Giter Site logo

open-telemetry / semantic-conventions Goto Github PK

View Code? Open in Web Editor NEW
211.0 31.0 140.0 3.76 MB

Defines standards for generating consistent, accessible telemetry across a variety of domains

License: Apache License 2.0

Makefile 18.55% Shell 15.11% Go 2.69% Roff 32.58% JavaScript 4.92% Jinja 26.15%

semantic-conventions's Introduction

OpenTelemetry Icon OpenTelemetry Semantic Conventions

Checks GitHub tag (latest SemVer) Specification Version

Semantic Conventions define a common set of (semantic) attributes which provide meaning to data when collecting, producing and consuming it.

Read the docs

The human-readable version of the semantic conventions resides in the docs folder. Major parts of these Markdown documents are generated from the YAML definitions located in the model folder.

Contributing

See CONTRIBUTING.md

Approvers (@open-telemetry/specs-semconv-approvers):

Find more about the approver role in community repository.

Maintainers (@open-telemetry/specs-semconv-maintainers):

Find more about the maintainer role in community repository.

semantic-conventions's People

Contributors

alanwest avatar alexanderwert avatar arminru avatar bertysentry avatar bogdandrutu avatar carlosalberto avatar chalin avatar chrsmark avatar gregkalapos avatar jack-berg avatar joaopgrassi avatar jsuereth avatar jtmalinowski avatar justinfoote avatar lmolkova avatar mralias avatar mx-psi avatar oberon00 avatar pyohannes avatar rakyll avatar reyang avatar sergeykanzhelev avatar songy23 avatar tedsuo avatar thisthat avatar tigrannajaryan avatar trask avatar trisch-me avatar tylerbenson avatar zeitlinger avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

semantic-conventions's Issues

Semantic conventions for database should be explicit about parameters / placeholder values

What are you trying to achieve?

I would like to be able to see parameter values for SQL queries. I was about to raise an issue against the SQLAlchemy instrumentation, but noticed no specific guidance in the database conventions page.

The specs do mandate that the statement should be captured, but that statement may or may not be complete, depending on whether placeholders are used.

Additional context.

The full path (= with parameter values) is captured for HTTP requests, e.g. the flask instrumentation (for example) captures:

"http.route": "/pet/<int:helloid>"  # placeholders
"http.target": "/pet/123"  # parameterized

However if that HTTP requests triggers a SQL requests, it will show up as :

"db.statement": "SELECT * FROM pets WHERE id = %(param_1)s"

This seems inconsistent.

In addition, parameter values are exposed in systems that don't use SQL syntax, and are shown explicitly in the Redis example in the docs ("HMSET myhash field1 'Hello' field2 'World") which increases confusion.

There might be security considerations (see also open-telemetry/oteps#100), but they are probably not fully mitigated as data not using placeholders will still appear in the trace (see also open-telemetry/opentelemetry-specification#1659).

I think in any case, there should be some guidance in the spec, even if it is the opposite recommendation to the one I would personally want. The current one does not mention placeholders at all, even though they seem commonly considered when making decisions in this bug tracker.

Add a PR check to enforce Schema file presence/content if a semantic convention is changed

To ensure the Schema files correctly capture changes we make to semantic conventions I suggest that we add an automatic check that verifies that any PR that changes a semantic convention in a way that matters from Schema file perspective enforces that the Schema file change is also present (and ideally verifies that the changes match).

The only supported Schema file change is currently the renaming of attributes so the check needs to see any of attribute name is changed semantic convention yaml file and see if there is a corresponding change recorded in the Schema file.

DB Convention does not cover batch/multi/envelope operations

Our API has a small alphabet of relatively simple operations key-value operations (get, put, delete, &c.) For these, the operation names seem clear. We also have a set of operations that support bulk/batching of operations. these can be homogeneous or heterogeneous. For example, batch can accept a set of any combination of get, put, delete, &c. We also support a generic system for server-side compare-and-mutate, where some predicate based on a query over existing data is provided, and when the predicate returns true, some operation is applied — that operation can be a simple or a batch operation. for these collections of heterogenous operations, how should be annotate the span?

Better guidance on semantic conventions for database client call span names in case of missing information

What are you trying to achieve?
When tracing an activity from Npgsql (.NET Database provider for PostgreSQL) we want to set a valid span name according to the semantic conventions for database client calls without having to parse the database statement on the client side and without having to guess (or rather infer from our knowledge on the current default behavior of PostgreSQL) information that we don't have when issuing a database client call.

What did you expect to see?
Guidance on creating a valid span name in cases where all three of db.operation, db.name and db.sql.table are not available to the database client.
Per discussion following npgsql/npgsql#4757 (comment) we think that it is probably pretty safe to infer the database name from our knowledge on PostgreSQL's default behavior but we may still want to research/discuss what good alternatives could be.

Additional context.
https://opentelemetry.io/docs/reference/specification/trace/semantic_conventions/database/
npgsql/npgsql#4757 (comment)

Create prototypes to validate the proposed semantic conventions for messaging

The messaging workgroup has already merged and is still working on several changes to the existing semantic conventions, the most important one being a proposal for span structures for messaging scenarios (open-telemetry/oteps#220).

In order to ground and illustrate the proposed changes, prototypes for several messaging scenarios utilizing those proposed changes need to be created. This is a pre-requisite for declaring stability on any messaging related conventions.

Clarify possible incompatibilities with CloudEvents Distributed Tracing extension

CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems. It is a CNCF project which is very popular with messaging systems.

CloudEvents provides an extension for Distributed Tracing. This extension specifies an additional channel for context propagation (in addition to context propagation via e. g. HTTP or AMQP). It is not clear whether and how this extension should be used when instrumenting with OpenTelemetry and when utilizing OpenTelemetry semantic conventions.

When stabilizing OpenTelemetry semantic conventions for messaging and HTTP, it needs to be worked out how this CloudEvents extension can be utilized in OpenTelemetry scenarios, and whether it's needed at all.

Add semantic conventions for Elasticsearch client instrumentation

What are you trying to achieve?

There is a detailed specification and semantic conventions provided for AWS technologies, for example, for DynamoDB-related spans here and there isn't one for Elasticsearch.

In the last week, I implemented instrumentation of the Elasticsearch Ruby client that I will soon open as a pull request to the OpenTelemetry Ruby project. I found that the existing Elasticsearch client instrumentations I referenced (mainly the Python and Java implementations) differed from each other in terms of what span attributes were set, what values were used for the attributes and what custom attributes were set. For example, the Python implementation sets the request body as the db.statement while the Java implementation only sets the request method and url as db.statement. The Python implementation also sets custom elasticsearch span attributes.
I also know that we have someone from Elastic working on a Elasticsearch PHP client instrumentation to propose to the PHP OpenTelemetry project. That effort would benefit from a detailed spec as well.

I think having a similar set of semantic conventions for Elasticsearch as are provided for AWS technologies would be valuable so that we don't have more instrumentations of Elasticsearch clients that differ from each other.

Additional context.

One concern when setting Elasticsearch span attributes is cardinality when the url path is used. Some endpoints contain index names and/or document ids that can greatly increase the cardinality of the attribute values that are set using the url path. The specification will propose rules as are listed here for how to refer to a given API endpoint.

If adding a specification for Elasticsearch client instrumentation is approved, I'll open a pull request with proposed span attributes to set and their values.

Add CosmosDb Otel Specification

Context

CosmosDB SDK has 2 modes, Gateway (HTTP) and Direct (RNTBD), In different modes, for one operation call there can be multiple networks calls behind the scenes as failover/retries/replica selection, such logic runs in SDK itself.
Cosmos DB SDK is going to generate Activity at Operation level and Network level carrying some attributes with information.

Proposed Semantic conventions for SDK operation level calls

Azure/azure-cosmos-dotnet-v3#3058

Proposed Semantic conventions for SDK network (RNTBD) calls to CosmosDb

Attribute Value Comment
rntbd.url rntbd url with partitionid and replicaid
rntbd.operation_type Operation Type
rntbd.resource_type Resource Type
rntbd.status_code 201/200/204 network status code
rntbd.sub_status_code 1000/1002 Cosmos Db SubStatus Code

Status

Work is in progress to instrument .Net CosmosDb SDK with open telemetry support.

What I need

I need approval from open telemetry community to include cosmosdb specification in official opentelemetry specifications. As soon as I got approval here, I will get a PR out with the specifications.

Note: I am new in this community please let me know if I need to schedule a call or something in order to discuss this.

Add attribute `db.values` to database semantic conventions

What are you trying to achieve?
In some DBs, the query can contain a placeholder that should be filled with given values:
in MySQL: connection.query('SELECT * FROM 'books' WHERE 'author' = ?', ['David'])
in Postgres: client.query('SELECT $1::text as message', ['Hello world!']

These values are not part of db.statement attribute and are added to the span when the user gives the flag: enhancedDatabaseReporting:true in the instrumentation configuration. However, there is no suitable parameter for it in the semantic conventions.

What did you expect to see?
I expected that each relevant DB will have a suitable attribute in the semantic convention. something like: MYSQL_VALUES or PG_VALUES.

db.name should be broken down into individual layers

What are you trying to achieve?
Semantic convention for db client calls (database.md) states that for db.name "the database name to be used is the more specific layer". This may cause problems because as a user I don't know beforehand what the tag actually describes. For one db engine the tag may describe the "schema name" and for another a "database name".

This approach also makes it difficult to distinguish e.g. between two schemas with the same schema name running under two different instances.

I believe that db.name should behave consistently regardless of the database engine used (always store "database name"). Moreover it might be helpful to add more tags to describe database-specific layers (like db.schema_name or something similar).

Additional context.
https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md

Capture request and response bodies

What are you trying to achieve?

I want to capture request and response bodies. The content can be used for business transaction monitoring and security.

To move this forward we need:

  • data semantics e.g. http.request.body/http.response.body or be transport agnostic e.g. request.body, response.body.
  • a configuration for auto instrumentations e.g. OTEL_INSTRUMENTATION_CAPTURE_REQUEST_BODY=bool

Additional context.

Add Geo fields from Elastic Common Schema

What are you trying to achieve?

Some observability use cases require localization of resources. Examples are:

  • Observability of mobile devices: Geographically localizing the mobile devices provides valuable insights into how mobile apps perform depending on their location.
  • For applications that are geographically distributed across multiple regions and data centers, the geo location can provide useful insights as well.

Having an additional geo namespace on the resource attributes serves use cases as outlined above and allows to filter and group data by geo locations.

As part of the OTEP to support the Elastic Common Schema (ECS), we propose to start with adopting the rich set of the ECS geo fields as a new namespace in the resource attributes.

Additional context.

This is related to this OTEP: open-telemetry/oteps#199

See attached Draft PR for detailed proposal / context: open-telemetry/opentelemetry-specification#2835

Adding aws.region span attribute to the spec

We noticed that the AWS SDK instrumentation in various languages sets a (client) span attribute named aws.region which is currently not part of the spec.

I found the following instrumenation libraries that set aws.region:

Currently, the spec contains the cloud.region resource attribute intended for server-side spans and faas.invoked_region for client-side spans of FaaS invocations. The latter is therefore not suitable for e.g. calls to DynamoDB using the AWS SDK.

I'd like to propose the following:

  • document aws.region in the spec as an existing legancy span attribute
  • introduce a more generic alternative such as cloud.invoked_region that fits client spans for multiple cloud platforms and SDK call types
  • deprecate faas.invoked_region in favor of the new more generic attribute

I am happy to create a PR after the discussion of this issue.

Messaging: which semantics auxiliary operations should follow

Messaging libraries have multiple operations unrelated to publishing/consuming or settling.

For example, Azure ServiceBus supports the following operations:

  • producer side:
    • send message(s)
    • schedule message
    • cancel scheduled message
  • consumer side:
    • peek
    • receive
    • process callback
    • settlement
  • aux operations, not strictly related to message flow
    • start/end session
    • renew locks
    • renew tokens
    • configure routing, filtering and other things (more on a control plane side)

Should we have a language in the spec limiting applicability of semconv to a message-related set of defined operations?

Proposal:

If messaging instrumentation wants to cover other operations, they are free to use messaging attributes, but there is no guidance or guarantees there and backends can distinguish message-flow-related operations with messaging.operation attribute value.

Dealing with batching

In many situations, especially database calls, it is common for the server to support some sort of batching mechanism. For example, one JDBC statement can actually hold a batch of SQL statements, all of completely unique syntax. Similarly, multiple redis commands can be sent in a single request. As there is only a single request, I think we can only create a single CLIENT span for these cases. But it makes it unclear how to fill attributes such as the span name, db.*, etc.

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/database.md#call-level-attributes

DB sanitization uniform format

What are you trying to achieve?

According to the db spec, the db.statement value can be sanitized, but it is not defined how to do so.

Currently, the sanitization is being dealt with differently in few places.
I suggest to add a uniform format that will describe how to do the sanitization.
(It will be best if this format will apply to all different DB's and syntaxes)

Different implementations examples:

  • JS mongo db - implements the sanitization by replacing the information with question marks
  • Python pymongo - implements the sanitization by deleting the information completely, and leaving the query method name only
  • Python elasticsearch (WIP) - suggests to replace the information with a string that will explain that the data is sanitized

I suggest a few options to replace the value with:

  • Keep the method name, and add a sanitized text.
    for example: db.statement = "SELECT {query information is sanitized}"
    advantages: quite easy to implement, easy to keep consistent across libraries.
    dis-advantages: will require to research whether all different libraries can handle this format effectively.

  • Simple text that will describe that the value is sanitized.
    for example: db.statement = "query information is sanitized"
    advantages: easy to implement, easy to keep consistent across different libraries.
    dis-advantages: doesn't supply basic information about the query that could be useful.

  • Replace the values with question marks.
    for example: db.statement = "SELECT ? FROM ?"
    advantages: keeps more amount of information, while still not exposing sensitive or private data.
    dis-advantages: harder to implement, harder to keep consistent across libraries.

I would like to hear opinions about the suggested solutions, or hear different ideas.

Additional context.

open-telemetry/opentelemetry-specification#3104 - Issue regarding changing the recommendation to sanitize the information by default.
#708 - Issue about missing examples for sanitization in specs.

RPC `message` namespace is too generic

Names of RPC message attributes (message.id, message.compressed_size, etc) are too generic, while their semantics are very narrow.

https://github.com/open-telemetry/opentelemetry-specification/blob/b5c6a6dc48752a3ea740f08bed0c2975bae19673/semantic_conventions/trace/rpc.yaml#L193

We also have messaging.message.* namespace which contains messaging-specific attributes with quite similar names, but different semantics.

Assuming we create a registry of attributes and suggest reusing them across signals, we should either have a general-purpose message namespace, or, preferably, put RPC messages under rpc.message.*.

Remove copies of metric-requirement-level and attribute-requirement-level from semconv repository.

See: #5 (comment)

TL;DR: we are currently soft-linking to requirement levels for attributes/metrics defined by the specification.

  • These should become deep links to the specification
  • We need more automation around updating the specification version
  • The version semconv relies on (1.21) has not been released yet.

This is an issue to remind us to clean this up once semconv 1.21 is released and we can link to stable versions.

Related see #9

Add db.statement sanitization/masking examples

What are you trying to achieve?

The trace DB spec mentions that the db.statement attribute value may be sanitized to exclude sensitive information but provides no example of such sanitization.
The aim of this issue is to add several examples of how we should sanitize database statements, e.g. INSERT INTO payment_cards (CC, EXP_DATE) values (?, ?) for SQL or HMSET cards cc ? exp_date ? for Redis.

Additional context.

See: open-telemetry/opentelemetry-java-instrumentation#1405

Clarify relationship between messaging, faas, and RPC

I want to clarify the relationship between messaging, RPC, and FaaS which is confusing to me right now. Let me give an example, based on my understanding of the current spec, and let me know if this aligns with expectations. This is an example of lambda + sqs, something I'm working on now and is sufficiently complicated that hopefully other cases are simpler.

SQS - an HTTP-based RPC API for publishing and receiving messages
Lambda - a FaaS that can be integrated with SQS. Lambda runtime polls for messages using SQS's RPC API, and when a message comes in, it runs a function with the received batch of messages

SQS Queue - NiceQueue
Lambda function - process_function

So I can invision many spans. These are all for the same request so up to the producer spans is all in the same trace
Request - Request span for the "trace", with current span lasts until end of production
Message1 - Message 1
Message2 - Message 2
Producer1 - Producer Span for Message 1
Producer2 - Producer Span for Message 2
APIPublish1 - Client span using SQS API to send Message1
APIPublish2 - Client span using SQS API to send Message2
PollMessages - Internal, infinite? span that corresponds to the lambda runtime process itself
APIReceive - Client span using SQS API to poll for messages and receive Message 1 and 2
FaaSInitialize - Span encompassing lambda function initialization and invocation. May have cold start
FaaSInvoke - Span encompassing lambda function invocation by runtime
FaasFunction - Span encmpassing lambda function execution in user's app
Receive - Consumer span encompassing the recived batch
Process1 - Processing span for Message1
Process2 - Processing span for Message2

Trace 1
|--------------------------OrderService.ProcessOrders-------------------------|
 |NiceQueue send|                                    |NiceQueue send|
                     |SQS.SendMessage|                               |SQS.SendMessage|


Trace 2
|-------------------------pollmessages------------------------------------------------------------------------------------------|
|APIReceive||APIReceive||APIReceive||APIReceive||APIReceive||APIReceive||APIReceive|
                                                                                    |-----FaaSInitialize-----------------|                                                                                                                  
                                                                                               |---FaasInvoke--------------|
                                                                                               |---FaasFunction------------|
                                                                                                  |--Receive--------------|
                                                                                                   |Process1||Process2|

Phew lots of spans. So does it look like this?

id name kind duration parent link semantic types owner
Request OrderService.ProcessOrders SERVER 50ms none none rpc,http user
Producer1 NiceQueue send PRODUCER 0 Request none messaging user
ApiPublish1 SQS.SendMessage CLIENT 20ms Producer1 none rpc,http user
Producer2 NiceQueue send PRODUCER 0 Request none messaging user
ApiPublish2 SQS.SendMessage CLIENT 20ms Producer2 none rpc, http user
PollMessages pollmessages INTERNAL infinity none none none infra
APIReceive SQS.ReceiveMessage CLIENT 30ms PollMessages none rpc, http infra
FaaSInitialize Lambda::Initialize INTERNAL 500ms none none none infra
FaaSInvoke Lambda::Invoke INTERNAL 300ms FaasInitialize none none infra
FaaSFunction process_function INTERNAL 300ms FaaSInvoke none faas user
Receive NiceQueue receive CONSUMER 250ms FaaSFunction none messaging user
Process1 NiceQueue process CONSUMER 100ms Receive Producer1 messaging user
Process2 NiceQueue process CONSUMER 100ms Receive Producer2 messaging user

My currently open questions

  • This is an ideal case, but especially when using auto instrumentation, it may not be possible to have Process1 and Process2 - Lambda presents the entire batch to the user. Should Receive also have links to Producer1 and Producer2? Should Receive be a process span instead of a receive span?

  • How to connect the API calls and the messaging spans. Completely separate like I listed? It means there are mysterious 0 duration producer spans and APIPublish are child or PRODUCER. Or combine e.g., Producer1 and ApiPublish1? Would the name be NiceQueue.send but we still include rpc.service == SQS, rpc.method == SendMessage to not lose that information? Is such a span's kind producer or client?

  • Are FaaSInvoke and Receive separate or should be combined? In reality, if the runtime is a cloud provider, there may not be control and FaasInvoke is automatically created.

  • APIReceive is a CLIENT span so I wouldn't expect it to be a parent of the rest - is it basically an orphan span? Anyways, while I may be able to model it as a parent of something if I was implementing all this myself, it wouldn't be for lambda = SQS.

Appreciate any guidance :)

Add agent resource type

What are you trying to achieve?

I'd like to define semantic conventions for agent resources.

Additional context.

Agents are a key part of the software stack, and need to be monitored just as any other component. Several vendors already offer self-monitoring capabilities, for instance:

The Opentelemetry collector also offers a set of best practices for monitoring.

While agents can be considered services, we might want to add additional attributes to define them in a more specific manner. Possible examples include:

  • agent.type: com.dynatrace.one_agent, com.newrelic.infra_agent, io.openetelemetry.collector
  • agent.version
  • agent.distro: github.com/signalfx/splunk-otel-collector

Note that this was first discussed in the context of OpAmp in this issue. However, since agent self monitoring happens outside of the context of OpAmp, I think it makes sense to define semantic conventions in this repo.

Consider `http.status_code_class` attribute

Inspired by open-telemetry/opentelemetry-specification#2943 (comment).

because metrics are sensitive to cardinality, I've seen instrumentations using strings like 4xx, 5xx for status code.

Proposal is to add a new attribute for grouping status codes by class, i.e., 1xx, 2xx, 3xx, etc. See: https://datatracker.ietf.org/doc/html/rfc9110#section-15.

Open questions:

  • Do we want a new attribute?
  • Should the value of the attribute be 1xx, 2xx, etc OR informational, successful?
  • For metrics, is this attribute conditionally required if status is present?
  • For metrics, would http.status_code become optional?
  • Should traces also have this attribute?

network semconv change: local/remote side no longer attributable

Background

This regression was introduced in open-telemetry/opentelemetry-specification#3402, see comment thread from here: open-telemetry/opentelemetry-specification#3402 (comment)

Description

Previous semantic conventions used the concepts "peer" to indicate the remote end and "host" to indicate the local end of a network connection. It is now no longer generally possible to know which is which, as the new concepts of server/client and source/destination are orthogonal to that.

There is one common case in which the assignment can still be made for server/client: Spans, as long as they don't have INTERNAL as SpanKind can use the rule "if server/consumer kind then server is the local side, otherwise client".

There is also the case of metrics where the metric definition makes it clear whether the local side is the client or server, according to @AlexanderWert open-telemetry/opentelemetry-specification#3402 (comment). IIUC, this requires a case-by-case knowledge to define a mapping metric ID -> "is client or server the local side".

Besides these two particular cases, it seems now impossible to tell which side is the local vs. remote. Most commonly, this may affect logs, metrics which are not known to the system wanting to know the local/remote end, and any use of source/destination where the remote/local-ness is completely unclear.

Proposal

Introduce a new semantic attribute network.role which may be one of server, client, sender or receiver (closed set). Then the mapping is as follows:

network.role server.* client.* source.* destination.*
server local remote not allowed not allowed
client remote local not allowed not allowed
sender not allowed not allowed local remote
receiver not allowed not allowed remote local

For peer to peer operations, esp. using source/destination, there may be cases where in a single operation that a metric/span/log line is recorded for, both the role of sender and receiver applies. In this case, the role should be arbitrarily chosen (by default, I suggest "sender") and source/destination attributes set consistently to allow correct local/remote attribution. It would also be possible to define an additional role "mixed" with same mapping as "sender" or define that role remains unset and "source" is by default the local end and "destination" the remote end.

Bikeshedding alternative: network.position instead of network.role would work as well, with values corresponding 1:1 to the prefixes that are then local, i.e. one of server/client/source/destination.

CC @lmolkova @trask

Proposal to add system.memory.slab

What are you trying to achieve?

At the moment, the assumption that the sum of all metrics states should be equal to the limit (or total) is broken for the Linux system.memory provided by the hostmetrics receiver. The issue is that Slab memory is included as a memory state.

Related issues/PRs:
open-telemetry/opentelemetry-collector-contrib#14909
open-telemetry/opentelemetry-collector-contrib#7417
open-telemetry/opentelemetry-collector-contrib#19149

The proposed metric, "system.memory.slab", would track the amount of memory used by the kernel for Slab caching. This metric would be helpful for monitoring system memory usage on Linux-based systems, particularly in environments where Slab memory usage may be a significant contributor to overall memory usage.

Why not include it into system.memory? Because slab memory is already included in the used state provided by the receiver.

What did you expect to see?

system.memory.slab.usage
system.memory.slab.utilization

Attributes: reclaimable + unreclaimable

Additional context.

FreeBSD's systems uses a similar memory management technique called "uma" (Unified Memory Architecture), but I could not find the value on Gopsutils library, neither a generic name for all the systems. In Windows, the kernel uses a memory allocation technique called the "Non-paged pool" and the "Paged pool" to manage kernel memory.
Open to any other naming proposal, the most generic name I could think of to refer to Slab memory on any operating system is "object caching".

cc @rmfitzpatrick @dmitryax

Add semantic conventions for "Process" spans for messaging scenarios

Currently, the messaging workgroup works on an open-telemetry/oteps#220 which covers span structures in messaging scenarios.

The group decided to focus on a consistent set of conventions that can be applied across all messaging scenarios, which resulted into a proposal for conventions for "Publish", "Create", "Deliver", "Receive", and "Settle" operations, as those share common characteristics across all messaging scenarios. The same can't be said of "Process" operations, which can vary considerable depending on the individual use-case.

However, as interest was expressed from many sides to also achieve some consistency for the instrumentation of "Process" operations, it is necessary to provide conventions for "Process" operations as an addition to what's already in open-telemetry/oteps#220.

This work doesn't block merging open-telemetry/oteps#220.

Semantic conventions for SIP protocol

Please define semantic conventions for SIP protocol. It is similar to HTTP protocol, so mostly it would be enough take existing conventions for HTTP and replace http with sip.

One tricky part which needs special attention is how to create spans for 3-way handshake: INVITE -> 200 OK -> ACK. Additionally client can send CANCEL before it receives 200 OK for INVITE to cancel it - this scenario also needs some explanation.

Change peer.service to service.origin.name and service.target.name?

@AlexanderWert to confirm that ECS doesn't have an equivalent to peer.service.

ECS service.* fields can be self-nested under service.origin.* and service.target.*, which has a similar purpose as the peer.service in OTel. Similar to our previous discussions on client / server and source / destination, here service.origin.* and service.target.* make the relationship more explicit, without the need to know the context of the field.

_Originally posted by @AlexanderWert in #1012

Define metric semantic conventions for database operations

What are you trying to achieve?
Defined metric semantic conventions for database operations.

What did you expect to see?
Database metric semantic convention documentation in markdown generated from YAML definitions as per open-telemetry/opentelemetry-specification#547.

Additional context.
These should be roughly analogous to the trace semantic conventions for database client calls defined here. For reference, metric semantic conventions for HTTP operations are already defined here.

Database semantic conventions may violate namespacing guidelines

Problem Description

We have database conventions like this: db.name, db.statement, etc. Here the db is the namespace and we have attributes that are common for all database under this namespace. We have a large number of these.

We also have conventions like this: db.mssql.instance_name where db.mssql is the namespace. The implied idea is that database specific attributes are placed in db.<database-name> namespace, although it is not explicitly called out anywhere.

This is a problem. The enumeration is not bounded and can contain any value in the future. In the future we may need to add database-specific conventions support for a database that has a name that matches any of the numerous attributes under the db namespace.

However, it will be impossible because it will be a violation of namespacing guidelines, which say:

Names SHOULD NOT coincide with namespaces. For example if service.instance.id is an attribute name then it is no longer valid to have an attribute named service.instance because service.instance is already a namespace. Because of this rule be careful when choosing names: every existing name prohibits existence of an equally named namespace in the future, and vice versa: any existing namespace prohibits existence of an equally named attribute key in the future.

We have a situation when future evolution of semantic conventions may be impossible because of the current design.

Possible Solutions

I list a couple solutions below. If you can think of another way please comment so that we can discuss that too.

Solution 1

Move all database specific conventions to a properly isolated namespace, e.g. instead of using db.<database-name> as the namespace use db.special.<database-name> as the namespace or some other namespace that is guaranteed not to clash with any other attributes in other namespaces.

The downside is that we need to change existing conventions and also that database-specific conventions will use somewhat longer attribute names (db.special.cassandra.page_size is longer and less readable than db.cassandra.page_size that we use currently).

Solution 2

Explicitly call out that certain database names are disallowed. This list will contain everything that is already an attribute under db namespace. We can probably also reserve some names for use either as attributes under db namespace (and thus disallow them as database names) or as database names (and thus disallow them as attribute names).

Any future database that has a name that clashes with existing attribute under db namespace will need to have its name transformed such that it no longer conflicts with an attribute name.

For example if a hypothetical future database called "system" needs to have some specific attributes in the conventions then we can place such attribute under db.systemdb namespace to make sure it does not conflict with db.system generic attribute.

The benefit of this solution is that we don't need to change existing conventions.

Semantic Conventions for MongoDB

What are you trying to achieve?

Reach a conclusion on what fields to expose for MongoDB, on top of the existing ones.

Standardizing on the right keys would be great, but I don't have a specific proposal in mind. Just raising the thread in case anyone has a perspective.

Additional context.

Here's a couple examples of Mongo instrumentation:

renaming changes from 1.17.0 to 1.18.0 are not listed

Release notes

Rename google openshift platform attribute from google_cloud_openshift to gcp_openshift to match the existing cloud.provider prefix. (open-telemetry/opentelemetry-specification#3095)

@mx-psi pointed out that this change should be reflacted in the 1.18.0 schema.

Can we fix this in 1.18.0 or do we need another release?

Another question, is this schema file maintained by hand? I searched for some documentation, but i was unable to find it 🤕.

Set up github ISSUE templates

We should create templates for the following:

  • Bug/Issues with current semantic conventions and/or tooling
  • Project proposal for starting new working group in a semantic convention area
  • Feature request for existing semantic convention area (maybe one for each area that have approvers)
  • Feature request for tooling

Add system uptime metric

What are you trying to achieve?

I want to add a metric to the semantic conventions that will describe the system uptime. How about system.uptime?

Additional context.

This is reported by Telegraf as uptime field of the system metric (in seconds).

Here's a related proposal on the hostmetrics receiver to add this metric: open-telemetry/opentelemetry-collector-contrib#14130.

Remove technical committee from CODEOWNERS

One of the main reasons for separating SemConv into a different repo was to have more flexible approvers model. But the CODEOWNERS still lists TC, meaning TC will be tagged as approver for every PR - why do we want that?

Add sustainability metrics and attributes to semantic conventions for hardware metrics

Use Case

We need to be able to report the carbon footprint of servers, network, storage, applications, and services. To allow that across the entire infrastructure, semantic conventions are required, starting with the underlying physical infrastructure.

Specifications

In addition to hw.power and hw.energy, add metrics to semantic conventions for hardware metrics, like:

  • hw.abiotic_depletion_potential
  • hw.product_carbon_footprint

Define a Site entity, i.e. a physical location with specific properties that can be measured with metrics like:

  • hw.site.pue (more)
  • hw.site.itue, hw.site.tue (more)
  • hw.site.cue (more)
  • hw.site.wue
  • hw.site.ere, hw.site.erf (more)
  • hw.site.ref, hw.site.oef (more)
  • hw.site.cer (more)
  • hw.site.electricity_cost
  • hw.site.carbon_intensity

Additional context

More examples and links can be found in various places, but Green Software Foundation's Awesome Green Software is a good reference to start with, as well as its Hardware Efficiency page.

Semantic convention for span link names

What are you trying to achieve?
Spans can be created with one of more links leading to other spans (other than parent span link). When such links are present, Jaeger displays all links in small menu next to the trace. Every line in this menu is some ID (probably Span ID) and link type (parent/external). This is not very user friendly. When only one external link is present (in case of 1-to-1 relation between traces) this is not a big deal. However when many links are added, user have to open every one of them to find the one he is looking for.

What did you expect to see?
Please create semantic convention for links. At the beginning it should contain one attribute only to specify user-friendly link name, e.g. link.name. This attribute could be later used by tools like Jaeger to present more user-friendly names on link list.

Add `process.cpu.count` metric to semantic conventions for OS process metrics.

What are you trying to achieve?

Add a metric to expose number of available processors to the current process to
semantic conventions for OS process metrics.

Proposed instrument name, type, unit and description:

Name Instrument Type (*) Unit Description Labels
process.cpu.count UpDownCounter {processors} Number of processors (CPUs) available to the current process.

Currently, the definition of process.cpu.utilization is "Difference in process.cpu.time since the last measurement, divided by the elapsed time and number of CPUs available to the process", which requires maintaining the state (in this case, the last collection time) of the instrument.

The challenge encountered during implementation in .NET is:
open-telemetry/opentelemetry-dotnet-contrib#831

Potential workarounds:
open-telemetry/opentelemetry-dotnet-contrib#948

What did you expect to see?

Add process.cpu.count metric to the semantic conventions and let the backend do the computation.
Given instrument values of process.cpu.time and process.cpu.count, the backend will have sufficient data to calculate the CPU utilization metric.
open-telemetry/opentelemetry-dotnet-contrib#981

Additional context.

Previous discussion related to this topic:
open-telemetry/opentelemetry-specification#2392

Capture kafka cluster.id

What are you trying to achieve?

Kafka clusters have an identifier called cluster_id that is useful in environments with multiple clusters. For example, you may have clusters for different domains, or you may have use something like MirrorMaker to copy data between two clusters.

In these situations, its useful for cluster_id to be included in telemetry data. I propose that we add cluster_id as a kafka specific attribute in the messaging semantic conventions.

Additional context.

Unclear if this has already been considered.

http.route for client & server both as opposed to only server?

What did you expect to see?
image

http.route is only present for server. I expect a similar tag for client also to capture the request route.

Additional context.

Add any other context about the problem here. If you followed an existing documentation, please share the link to it.

hw.host.power/energy versus hw.power/energy metrics

What are you trying to achieve?

Improve the spec to provide guidelines for:

  1. hw.host.power versus hw.power metric.
  2. hw.host.energy versus hw.energy metric.
  3. hardware components that can report both IN/OUT power/energy utilization.

What did you expect to see?

Better guidelines for hw.host.power and hw.power metrics.

Additional context.

Use hw.power instead of hw.gpu.power

I see hw.gpu.power has specifically been defined for GPUs. Shouldn't GPU power be standardized to use hw.power?

Device with In/Out power

Some devices transfer some of the energy they receive, i.e., they have IN and OUT power. What metrics and attributes should be used to report power/energy utilization? In particular, how does one report output power?

A possible solution is to add metrics to report both input and output power:

Metric Name Description
hw.power The power drawn by the component (but not necessarily fully consumed by the component)
hw.power_out The output power delivered by the component
hw.energy The energy drawn by the component (but not necessarily fully consumed by the component)
hw.energy_out The energy power delivered by the component (power delivered externally).

hw.power could potentially be renamed to hw.power_in.

For example:

  1. A network device that supports power over Ethernet. The device may consume 500W and some of that power is transferred over Ethernet to connected devices, which themselves may report their own power utilization. In this case, the switch is the host resource and reports power usage. An appliance connected to the switch may be a separate host resource that also reports power usage.
  2. A PSU draws 148W in and its output power is 122W. The PSU provides power to an attached component.
    1. The PSU reports hw.power = 148W and hw.power_out = 122W.
    2. The attached component reports hw.power = 121W.
    3. The SUM of hw.power - SUM of hw.power_out across components indicates how much power is drawn without double counting the power.
    4. A limitation of this approach is that it would not be able to account for loss over the power medium (e.g. a wireless charge incurs significant power loss as heat).
  3. A smart PDU can report the input/output power, energy, voltage, current. The PDU itself consumes very little energy, most of the power is transferred to the connected devices. Suppose the PDU has 10 connected devices, each consuming 500 Watts. The PDU may consume 20 Watts, so overall the PDU "consumes" 5,020 Watts. How should the PDU report its power? If hw.host.power reports 5,020 Watts, and each connected device reports hw.host.power with 500 Watts, then in aggregate the power is double counted.

Smart meters

Smart meters can report the power utilization. For example, what metrics should a house smart meter report?

Reporting both hw.host.power and hw.power metrics

Suppose a physical system has:

  1. Multiple Power Supply Units (PSUs).
  2. Sub-components that each consume power, e.g., memory, disks, CPUs, GPUs.
  3. Power usage of each of the sub-components can be measured and needs to be reported.

For example, a physical server has power supply units (PSUs), CPUs, DIMMs, disks, GPU, PCI components, etc. Each of these consume energy and typically have sensors that can report power utilization. The power supply units can report the total energy consumed by the host, and each sub-component can have an instrument that reports the power utilization of that component.

Questions/Issues:

  1. Should the physical system calculates the sum of power usage across all PSUs and reports hw.host.power metric for the whole system to be the sum of all PSUs power usage?
  2. The hw.power metric is used to report power usage for each of the sub-components?
  3. Since the PSUs are themselves hardware components, how should they report power utilization? Use hw.power metric?
  4. I would expect that hw.host.power is greater than or equal to the sum of the hw.power across sub-components. But if PSU reports hw.power, that may double count power usage for the sub-components unless somehow we can distinguish between input and output PSU power.

Should metrics description be a full sentence?

OpenMetrics provided such example: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#overall-structure

image

The current metrics semantic convention is different (and inconsistent):

image

Ask:

  1. Make it consistent at least for the semantic conventions that are in scope for the initial stable release.
  2. Align with OpenMetrics, description should have the first letter uppercase, and end with ..

Additional Ask (not blocking):

  1. The build tool should fail/block the CI if there is any description not following the convention/rule.

The protocol attribute should be removed for `system.network.connections` metrics

What are you trying to achieve?

Since UDP has no state, I propose UDP and TCP connection metrics should be split into protocol specific metric for each supported protocol.

If folks agree, I will open a PR to mark system.network.connections as deprecated in favour of:

system.network.tcp.connections
system.network.udp.connections

Additional context.

The work for this was previously done in open-telemetry/opentelemetry-specification#2675 and reverted in open-telemetry/opentelemetry-specification#2748. See open-telemetry/opentelemetry-specification#2726 for additional context

Analyze the overlap between OpenTelemetry tracing attributes and ECS attributes

HTTP

Exceptions

General remote service attributes

@AlexanderWert to confirm that ECS doesn't have an equivalent to peer.service.

General identity attributes

OpenTelemetry ECS
enduser.id user.id
enduser.role, enduser.scope user.roles (?)

General thread attributes

OpenTelemetry ECS
thread.id process.thread.id
thread.name process.thread.name

Source code attributes

OpenTelemetry ECS
code.function log.origin.function
code.namespace
code.filepath log.origin.file.name
code.lineno log.origin.file.line
code.column

Messaging

This is also discussed/tracked at open-telemetry/opentelemetry-specification#3196.

Only impacted by network attribute changes above (#3199 and open-telemetry/opentelemetry-specification#3371).

(ECS doesn't have any messaging-specific attributes, @AlexanderWert to confirm)

Could also be impacted by conflict with source.* and destination.* namespace, see open-telemetry/opentelemetry-specification#3407

Database

Only impacted by network attribute changes above (#3199 and open-telemetry/opentelemetry-specification#3371).

(ECS doesn't have any database-specific attributes, @AlexanderWert to confirm).

RPC

Only impacted by network attribute changes above (#3199 and open-telemetry/opentelemetry-specification#3371).

(ECS doesn't have any RPC-specific attributes, @AlexanderWert to confirm).

FaaS tracing and resources

OpenTelemetry ECS
faas.coldstart faas.coldstart
faas.invocation_id faas.execution
cloud.resource_id faas.id
faas.name faas.name
faas.trigger.request_id
faas.trigger faas.trigger.type
faas.version faas.version

CloudEvents

ECS doesn't have any cloud event specific attributes (@AlexanderWert to confirm)

Feature Flags

ECS doesn't have any feature flag specific attributes (@AlexanderWert to confirm)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.