Giter Site home page Giter Site logo

profefe / kube-profefe Goto Github PK

View Code? Open in Web Editor NEW
78.0 78.0 15.0 9.36 MB

continuous profiling made easy in Kubernetes with profefe

Home Page: https://kubernetes.profefe.dev/

License: MIT License

Go 99.90% Shell 0.10%
golang kubernetes performance pprof

kube-profefe's Introduction

profefe

Build Status Go Report Card Docker Pulls MIT licensed

profefe, a continuous profiling system, collects profiling data from a fleet of running applications and provides API for querying profiling samples for postmortem performance analysis.

Why Continuous Profiling?

"Continuous Profiling and Go" describes the motivation behind profefe:

With the increase in momentum around the term “observability” over the last few years, there is a common misconception amongst the developers, that observability is exclusively about metrics, logs and tracing (a.k.a. “three pillars of observability”) [..] With metrics and tracing, we can see the system on a macro-level. Logs only cover the known parts of the system. Performance profiling is another signal that uncovers the micro-level of a system; continuous profiling allows observing how the components of the application and the infrastructure it runs in, influence the overall system.

How does it work?

See Design Docs documentation.

Quickstart

To build and start profefe collector, run:

$ make
$ ./BUILD/profefe -addr=localhost:10100 -storage-type=badger -badger.dir=/tmp/profefe-data

2019-06-06T00:07:58.499+0200    info    profefe/main.go:86    server is running    {"addr": ":10100"}

The command above starts profefe collector backed by BadgerDB as a storage for profiles. profefe supports other storage types: S3, Google Cloud Storage and ClickHouse.

Run ./BUILD/profefe -help to show the list of all available options.

Example application

profefe ships with a fork of Google Stackdriver Profiler's example application, modified to use profefe agent, that sends profiling data to profefe collector.

To start the example application run the following command in a separate terminal window:

$ go run ./examples/hotapp/main.go

After a brief period, the application will start sending CPU profiles to profefe collector.

send profile: http://localhost:10100/api/0/profiles?service=hotapp-service&labels=version=1.0.0&type=cpu
send profile: http://localhost:10100/api/0/profiles?service=hotapp-service&labels=version=1.0.0&type=cpu
send profile: http://localhost:10100/api/0/profiles?service=hotapp-service&labels=version=1.0.0&type=cpu

With profiling data persisted, query the profiles from the collector using its HTTP API (refer to documentation for collector's HTTP API below). As an example, request all profiling data associated with the given meta-information (service name and a time frame), as a single merged profile:

$ go tool pprof 'http://localhost:10100/api/0/profiles/merge?service=hotapp-service&type=cpu&from=2019-05-30T11:49:00&to=2019-05-30T12:49:00&labels=version=1.0.0'

Fetching profile over HTTP from http://localhost:10100/api/0/profiles...
Saved profile in /Users/varankinv/pprof/pprof.samples.cpu.001.pb.gz
Type: cpu

(pprof) top
Showing nodes accounting for 43080ms, 99.15% of 43450ms total
Dropped 53 nodes (cum <= 217.25ms)
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
   42220ms 97.17% 97.17%    42220ms 97.17%  main.load
     860ms  1.98% 99.15%      860ms  1.98%  runtime.nanotime
         0     0% 99.15%    21050ms 48.45%  main.bar
         0     0% 99.15%    21170ms 48.72%  main.baz
         0     0% 99.15%    42250ms 97.24%  main.busyloop
         0     0% 99.15%    21010ms 48.35%  main.foo1
         0     0% 99.15%    21240ms 48.88%  main.foo2
         0     0% 99.15%    42250ms 97.24%  main.main
         0     0% 99.15%    42250ms 97.24%  runtime.main
         0     0% 99.15%     1020ms  2.35%  runtime.mstart

profefe includes a tool, that allows importing existing pprof data into the collector. While profefe collector is still running, run the following:

$ ./scripts/pprof_import.sh --service service1 --label region=europe-west3 --label host=backend1 --type cpu -- path/to/cpu.prof

uploading service1-cpu-backend1-20190313-0948Z.prof...OK

Using Docker

You can build a docker image with profefe collector, by running the command:

$ make docker-image

The documentation about running profefe in docker is in contrib/docker/README.md.

HTTP API

Store pprof-formatted profile

POST /api/0/profiles?service=<service>&type=[cpu|heap|...]&labels=<key=value,key=value>
body pprof.pb.gz

< HTTP/1.1 200 OK
< Content-Type: application/json
<
{
  "code": 200,
  "body": {
    "id": <id>,
    "type": <type>,
    ···
  }
}
  • service — service name (string)
  • type — profile type ("cpu", "heap", "block", "mutex", "goroutine", "threadcreate", or "other")
  • labels — a set of key-value pairs, e.g. "region=europe-west3,dc=fra,ip=1.2.3.4,version=1.0" (Optional)

Example

$ curl -XPOST \
  "http://<profefe>/api/0/profiles?service=api-backend&type=cpu&labels=region=europe-west3,dc=fra" \
  --data-binary "@$HOME/pprof/api-backend-cpu.prof"

Store runtime execution traces (experimental)

Go's runtime traces are a special case of profiling data, that can be stored and queried with profefe.

Currently, profefe doesn't support extracting the timestamp of when the trace was created. Client may provide this information via created_at parameter, see below.

POST /api/0/profiles?service=<service>&type=trace&created_at=<created_at>&labels=<key=value,key=value>
body trace.out

< HTTP/1.1 200 OK
< Content-Type: application/json
<
{
  "code": 200,
  "body": {
    "id": <id>,
    "type": "trace",
    ···
  }
}
  • service — service name (string)
  • type — profile type ("trace")
  • created_at — trace profile creation time, e.g. "2006-01-02T15:04:05" (defaults to server's current time)
  • labels — a set of key-value pairs, e.g. "region=europe-west3,dc=fra,ip=1.2.3.4,version=1.0" (Optional)

Example

$ curl -XPOST \
  "http://<profefe>/api/0/profiles?service=api-backend&type=trace&created_at=2019-05-01T18:45:00&labels=region=europe-west3,dc=fra" \
  --data-binary "@$HOME/pprof/api-backend-trace.out"

Query meta information about stored profiles

GET /api/0/profiles?service=<service>&type=<type>&from=<created_from>&to=<created_to>&labels=<key=value,key=value>

< HTTP/1.1 200 OK
< Content-Type: application/json
<
{
  "code": 200,
  "body": [
    {
      "id": <id>,
      "type": <type>
    },
    ···
  ]
}
  • service — service name
  • from, to — a time frame in which profiling data was collected, e.g. "from=2006-01-02T15:04:05"
  • type — profile type ("cpu", "heap", "block", "mutex", "goroutine", "threadcreate", "trace", "other") (Optional)
  • labels — a set of key-value pairs, e.g. "region=europe-west3,dc=fra,ip=1.2.3.4,version=1.0" (Optional)

Example

$ curl "http://<profefe>/api/0/profiles?service=api-backend&type=cpu&from=2019-05-01T17:00:00&to=2019-05-25T00:00:00"

Query saved profiling data returning it as a single merged profile

GET /api/0/profiles/merge?service=<service>&type=<type>&from=<created_from>&to=<created_to>&labels=<key=value,key=value>

< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Content-Disposition: attachment; filename="pprof.pb.gz"
<
pprof.pb.gz

Request parameters are the same as for querying meta information.

Note, "type" parameter is required; merging runtime traces is not supported.

Return individual profile as pprof-formatted data

GET /api/0/profiles/<id>

< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Content-Disposition: attachment; filename="pprof.pb.gz"
<
pprof.pb.gz
  • id - id of stored profile, returned with the request for meta information above

Merge a set of individual profiles into a single profile

GET /api/0/profiles/<id1>+<id2>+...

< HTTP/1.1 200 OK
< Content-Type: application/octet-stream
< Content-Disposition: attachment; filename="pprof.pb.gz"
<
pprof.pb.gz
  • id1, id2 - ids of stored profiles

Note, merging is possible only for profiles of the same type; merging runtime traces is not supported.

Get services for which profiling data is stored

GET /api/0/services

< HTTP/1.1 200 OK
< Content-Type: application/json
<
{
  "code": 200,
  "body": [
    <service1>,
    ···
  ]
}

Get profefe server version

GET /api/0/version

< HTTP/1.1 200 OK
< Content-Type: application/json
<
{
  "code": 200,
  "body": {
    "version": <version>,
    "commit": <git revision>,
    "build_time": <build timestamp>"
  }
}

FAQ

Does continuous profiling affect the performance of the production?

Profiling always comes with some costs. Go collects sampling-based profiling data and for the most applications the real overhead is small enough (refer to "Can I profile my production services" from Go's Diagnostics documentation).

To reduce the costs, users can adjust the frequency of collection rounds, e.g. collect 10 seconds of CPU profiles every 5 minutes.

profefe-agent tries to reduce the overhead further by adding a small jiggling in-between the profiles collection rounds. This distributes the total profiling overhead, making sure that not all instances of application's cluster are being profiled at the same time.

Can I use profefe with non-Go projects?

profefe collects pprof-formatted profiling data. The format is used by Go profiler, but thrid-party profilers for other programming languages support of the format too. For example, google/pprof-nodejs for Node.js, tikv/pprof-rs for Rust, arnaud-lb/php-memory-profiler for PHP, etc.

Integrating those is the subject of building a transport layer between the profiler and profefe.

Further reading

While the topic of continuous profiling in the production is quite unrepresented in the public internet, some research and commercial projects already exist:

profefe is still in its early state. Feedback and contribution are very welcome.

License

MIT

kube-profefe's People

Contributors

gianarb avatar gtb3nw avatar henry-jackson avatar omissis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

kube-profefe's Issues

No Auth Provider found for name "gcp"

Adding the following import should allow for GCP clients:

_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"

I imagine other managed kubernetes services would appreciate having the auth packages included for them too :)

wrapper around `go tool pprof` from the kubectl plugin

I would like to have a command that does go tool pprof with the right value for me.

So I am thinking about two things:

kubectl profefe pprof <profile_id>
Fetching profile over HTTP from http://localhost:10100/api/0/profiles/<profile_id>
Saved profile in /home/gianarb/pprof/pprof.queryd.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
File: queryd
Build ID: 44dcff18a83acb2297e8fb8c209662cd7c3b27a5
type=heap
url=http://10.154.237.244:8093/debug/pprof
Type: inuse_space
Time: Jan 22, 2020 at 2:06am (CET)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) 

To get into the interactive pprof console from a profile_id

But I would also like to have a utility to get merged profiles:

kubectl profefe pprof --service web --from 20h --to 12h --profile_type heap

It brings you inside an interactive pprof console but with the merged profile for the specified time range

Returns labels from get profiles

We use labels a lot and it will be useful to get them when returning the list of profiles.
It is useful to get more context and change the label filtering in order to select only what we really need

[Security] Workflow test.yaml is using vulnerable action actions/checkout

The workflow test.yaml is referencing action actions/checkout using references v1. However this reference is missing the commit a6747255bd19d7a757dbdda8c654a9f84db19839 which may contain fix to the some vulnerability.
The vulnerability fix that is missing by actions version could be related to:
(1) CVE fix
(2) upgrade of vulnerable dependency
(3) fix to secret leak and others.
Please consider to update the reference to the action.

Make the parallelization factor dynamic

At the moment the number of goroutines kprofefe spin up is fixed to 10. We can do better.

First, we should look at GOMAXPROCS has to a limit. No more goroutines that what it specifies.

The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.

In second based on the number of pods to gather, we can calculate the right number of goroutines.

continuous delivery to krew

krew is a plugin manager for the kubectl. We support it and kubectl-profefe
can be installed via:

krew install profefe

They have an action to keep the index up to date with new releases.

We should integrate with that

It does not exists! Write it!

kprofefe is a daemon that will run inside a kubernetes cluster and it is able to accept query selectors and namespace just as the kubectl does, but it is not designed to be run by humans but from cronjobs.

It does not need to do port forwarding because it is already inside the cluster, it will use the same annotations as the kubectl plugin:

  • profefe.com/enable=true to identify if the pod wants to be profiled
  • profefe.com/port and profefe.com/path to configure where the pprof handler is

Other than using localhost as we do in the kubectl plugin the address will be the pod id.

We do not need to write on disk, we just need a profefe url.

Other things to do:

  • improve goreleaser to handle the new binary
  • improve goreleaser to build a docker image
  • write documentation about how to deploy it in kubernetes as a cronjob

get profiles by id

The command:

kubectl get profiles

should support a list of arguments, those arguments are profefe IDs.

Convert from and to in UTC

Currently, we do not convert the time pass as from and to in UTC, this is a bug we need to convert it to UTC

Update krew index

If I'm understanding krew correctly, we need to update the index to make the newer versions of the client available.

Off the back of the latest release which includes auth updates I think it would be a good idea to update. I did notice the version on krew is not parallel with the version of kprofefe, so might be worth updating that too?

Use profefe.com/service to set service name

At the moment profefe uses the pod name as service but it is not always the right behavior. Because the pod name may change so frequently that it will make very trivial to filter by service.

We can use another annotation profefe.com/service to specify the service name for the profile. If not present kube-profefe will keep using the pod name

Monitor kprofefe with opentelemetry

We should figure out if it is useful to get aggregated metrics from kprofefe (probably yes, I would like to be notified if we stop collecting profiles for some reason).

OpenTelemetry is probably the way to go those days

kubectl get profiles timerange is buggy

You can get the list of profiles for a particular service in a timerange with
the command:

kubectl profefe get profiles --service <service-name>

It supports --from and --to it can be a duration -30m, -2h or a fixed
time formatted in RFC3339.

This feature does not work as expected.

Add support to query profiles

In short we need a comand get that accept -l (label), --profile-type, --service, --from, --to to get a list of profiles links from profefe:

kubectl profefe get profiles -l app=gateway --profile-type=cpu --from 10m

--to by default is now. I would like for --to and --from tu support range and time but I don't know how hard it is.

--service, --from are required

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.