felixge / fgprof Goto Github PK

View Code? Open in Web Editor NEW

2.8K 39.0 88.0 536 KB

🚀 fgprof is a sampling Go profiler that allows you to analyze On-CPU as well as Off-CPU (e.g. I/O) time together.

License: MIT License

Go 100.00%

golang go performance performance-analysis profiling profiling-library

fgprof's Introduction

🚀 fgprof - The Full Go Profiler

fgprof is a sampling Go profiler that allows you to analyze On-CPU as well as Off-CPU (e.g. I/O) time together.

Go's builtin sampling CPU profiler can only show On-CPU time, but it's better than fgprof at that. Go also includes tracing profilers that can analyze I/O, but they can't be combined with the CPU profiler.

fgprof is designed for analyzing applications with mixed I/O and CPU workloads. This kind of profiling is also known as wall-clock profiling.

⚠️ Please upgrade to Go 1.19 or newer. In older versions of Go fgprof can cause significant STW latencies in applications with a lot of goroutines (> 1-10k). See CL 387415 for more details.

Quick Start

If this is the first time you hear about fgprof, you should start by reading about The Problem & How it Works.

There is no need to choose between fgprof and the builtin profiler. Here is how to add both to your application:

package main

import(
	_ "net/http/pprof"
	"github.com/felixge/fgprof"
)

func main() {
	http.DefaultServeMux.Handle("/debug/fgprof", fgprof.Handler())
	go func() {
		log.Println(http.ListenAndServe(":6060", nil))
	}()

	// <code to profile>
}

fgprof is compatible with the go tool pprof visualizer, so taking and analyzing a 3s profile is as simple as:

go tool pprof --http=:6061 http://localhost:6060/debug/fgprof?seconds=3

Additionally fgprof supports the plain text format used by Brendan Gregg's FlameGraph utility:

git clone https://github.com/brendangregg/FlameGraph
cd FlameGraph
curl -s 'localhost:6060/debug/fgprof?seconds=3&format=folded' > fgprof.folded
./flamegraph.pl fgprof.folded > fgprof.svg

Which tool you prefer is up to you, but one thing I like about Gregg's tool is that you can filter the plaintext files using grep which can be very useful when analyzing large programs.

If you don't have a program to profile right now, you can go run ./example which should allow you to reproduce the graphs you see above. If you've never seen such graphs before, and are unsure how to read them, head over to Brendan Gregg's Flame Graph page.

The Problem

Let's say you've been tasked to optimize a simple program that has a loop calling out to three functions:

func main() {
	for {
		// Http request to a web service that might be slow.
		slowNetworkRequest()
		// Some heavy CPU computation.
		cpuIntensiveTask()
		// Poorly named function that you don't understand yet.
		weirdFunction()
	}
}

One way to decide which of these three functions you should focus your attention on would be to wrap each function call like this:

start := time.Start()
slowNetworkRequest()
fmt.Printf("slowNetworkRequest: %s\n", time.Since(start))
// ...

However, this can be very tedious for large programs. You'll also have to figure out how to average the numbers in case they fluctuate. And once you've done that, you'll have to repeat the process for the functions called by the function you decide to focus on.

/debug/pprof/profile

So, this seems like a perfect use case for a profiler. Let's try the /debug/pprof/profile endpoint of the builtin net/http/pprof pkg to analyze our program for 10s:

import _ "net/http/pprof"

func main() {
	go func() {
		log.Println(http.ListenAndServe(":6060", nil))
	}()

	// <code to profile>
}

go tool pprof -http=:6061 http://localhost:6060/debug/pprof/profile?seconds=10

That was easy! Looks like we're spending all our time in cpuIntensiveTask(), so let's focus on that?

But before we get carried away, let's quickly double check this assumption by manually timing our function calls with time.Since() as described above:

slowNetworkRequest: 66.815041ms
cpuIntensiveTask: 30.000672ms
weirdFunction: 10.64764ms
slowNetworkRequest: 67.194516ms
cpuIntensiveTask: 30.000912ms
weirdFunction: 10.105371ms
// ...

Oh no, the builtin CPU profiler is misleading us! How is that possible? Well, it turns out the builtin profiler only shows On-CPU time. Time spent waiting on I/O is completely hidden from us.

/debug/pprof/trace

Let's try something else. The /debug/pprof/trace endpoint includes a "synchronization blocking profile", maybe that's what we need?

curl -so pprof.trace http://localhost:6060/debug/pprof/trace?seconds=10
go tool trace --pprof=sync pprof.trace > sync.pprof
go tool pprof --http=:6061 sync.pprof

Oh no, we're being mislead again. This profiler thinks all our time is spent on slowNetworkRequest(). It's completely missing cpuIntensiveTask(). And what about weirdFunction()? It seems like no builtin profiler can see it?

/debug/fgprof

So what can we do? Let's try fgprof, which is designed to analyze mixed I/O and CPU workloads like the one we're dealing with here. We can easily add it alongside the builtin profilers.

import(
	_ "net/http/pprof"
	"github.com/felixge/fgprof"
)

func main() {
	http.DefaultServeMux.Handle("/debug/fgprof", fgprof.Handler())
	go func() {
		log.Println(http.ListenAndServe(":6060", nil))
	}()

	// <code to profile>
}

go tool pprof --http=:6061 http://localhost:6060/debug/fgprof?seconds=10

Finally, a profile that shows all three of our functions and how much time we're spending on them. It also turns out our weirdFunction() was simply calling time.Sleep(), how weird indeed!

How it Works

fgprof

fgprof is implemented as a background goroutine that wakes up 99 times per second and calls runtime.GoroutineProfile. This returns a list of all goroutines regardless of their current On/Off CPU scheduling status and their call stacks.

This data is used to maintain an in-memory stack counter which can be converted to the pprof or folded output format. The meat of the implementation is super simple and < 100 lines of code, you should check it out.

The overhead of fgprof increases with the number of active goroutines (including those waiting on I/O, Channels, Locks, etc.) executed by your program. If your program typically has less than 1000 active goroutines, you shouldn't have much to worry about. However, at 10k or more goroutines fgprof might start to cause some noticeable overhead.

Go's builtin CPU Profiler

The builtin Go CPU profiler uses the setitimer(2) system call to ask the operating system to be sent a SIGPROF signal 100 times a second. Each signal stops the Go process and gets delivered to a random thread's sigtrampgo() function. This function then proceeds to call sigprof() or sigprofNonGo() to record the thread's current stack.

Since Go uses non-blocking I/O, Goroutines that wait on I/O are parked and not running on any threads. Therefore they end up being largely invisible to Go's builtin CPU profiler.

Known Issues

There is no perfect approach to profiling, and fgprof is no exception. Below is a list of known issues that will hopefully not be of practical concern for most users, but are important to highlight.

Internal C functions are not showing up in the stack traces, e.g. runtime.nanotime which is called by time.Since in the example program.
The current implementation is relying on the Go scheduler to schedule the internal goroutine at a fixed sample rate. Scheduler delays, especially biased ones, might cause inaccuracies.

Credits

The following articles helped me to learn more about how profilers in general, and the Go profiler in particular work.

How do Ruby & Python profilers work? by Julia Evans
Profiling Go programs with pprof by Julia Evans

License

fgprof is licensed under the MIT License.

fgprof's People

Contributors

Stargazers

Watchers

Forkers

bonedaddy forkkit cxz nwiizo suvrajeet01 couchbasedeps regalius neodigm 0x1a0b leetcode-notes tchigher felix0080 feng2012 hantmac isgasho shaunstanislauslau andydodo mid-summer kennethfan hnakamur chonlasit907 sjanulonoks trendingtechnology zeta1999 amasser zhanglei sabandi ushitora-anqou hopkings2008 wrule sosojustdo gosundy sumerc bingoohuang 2733284198 icewater2020 cybvon jilinhao orvillezhao 510909033 amell93 ttys3 adangadang whiteccinn wangfeiing newcomingsoon firedtoad ycjmail cristaloleg tpphu fanweixiao wormon dumpmemory lubaoyilang shyan1 all-less mfrank2016 mrliu-647 brancz danielqsj cuishuang zingneo yekoufeng icodein xingyong vgopinathan pockees hokdre anima-os-dev tomershafir karelbilek iq-scm polov2874 yvonedrelic bgf0506 hasibmani6 jimmeryzou yonderblue jaffee evenchen6 xiatianqiuyu vicky66s

fgprof's Issues

fgprof skipping my functions, just showing useless graph

I used to use fgprof in the past, it showed me real time well.

However, now, it is "skipping" my functions, and I just see this.

This is useless. I know that there is a slow http request, but I don't see which ones because it skips my functions.

Pprof is working on the same codebase.

I am using go 1.19.7, I cannot move to 1.20. I tried to update indirect pprof dependency here too with no result

A small performance optimization

Hey Felix,

I was playing with runtime package/profilers a bit for learning and while trying out different things, I happen to see following code is used to retrieve all Goroutines:

for {
	n, ok := runtime.GoroutineProfile(p.stacks)
	if !ok {
		p.stacks = make([]runtime.StackRecord, int(float64(n)*1.1))
	} else {
		return p.stacks[0:n]
	}
}

However it turns out following is a bit faster as runtime.NumGoroutine() just returns len(gcount()) whereas GoroutineProfile seems to do bunch of other things.

p.stacks = make([]runtime.StackRecord, int(float64(runtime.NumGoroutine())*1.1))
for {
   ...
}

I have benchmarked following function:

func GoroutineProfile(useNumG bool) {
	var stacks []runtime.StackRecord

	if useNumG {
		nGoroutines := runtime.NumGoroutine()
		stacks = make([]runtime.StackRecord, int(float64(nGoroutines)*1.1))
	}

	for {
		n, ok := runtime.GoroutineProfile(stacks)
		if !ok {
                        // this can happen regardless of our approach
			stacks = make([]runtime.StackRecord, int(float64(n)*1.1))
		} else {
			break
		}
	}
}

And the results are:

go test -bench=. -benchmem                                                                                                   ~/Desktop/p/go-tryouts  
goos: linux
goarch: amd64
BenchmarkGoroutineProfile/Using_runtime.NumGoroutine()-12                 111543              9944 ns/op             768 B/op          1 allocs/op
BenchmarkGoroutineProfile/Not_using_runtime.NumGoroutine()-12              66451             16140 ns/op             768 B/op          1 allocs/op

While not too significant, it is not bad for a single line of change.

Would you be interested for a PR for this? Is there any other problem with the approach?

PeriodType, PeriodUnit and Duration unset in resulting pprof

I just noticed that the period type, period unit, and duration are not set in the resulting profiles created by fgprof. While not strictly required I feel it would be good to set them. Duration can be measured, and period type I assume should be cpu and period unit should be nanoseconds?

Please mention format GET parameter in the quick start guide

In my first test of fgprof I profiled my app by curling a remote host from another remote host where neither have go tools installed, copying the artifact local for analysis. I was expecting the pprof format as the default based on my reading of the quick start. It took a bit of digging through the source to find guessFormat in handler.go to figure out why I was getting text vs binary output. Your quick start guide is excellent, but a quick mention of the extra GET parameter somewhere on that first page would have saved me significant time.

"No source information" in pprof format

I added github.com/felixge/fgprof to one of the Prometheus unit tests, which gives me good charts but no source line info:

$ go tool pprof -list processWALSamples ./tsdb.test cpu2
Total: 59.55s
No source information for github.com/prometheus/prometheus/tsdb.(*walSubsetProcessor).processWALSamples

I can work on a repro if it's interesting; didn't know whether I should expect it to work.

I'm running go version go1.17.8 linux/amd64.

Don't support profile labels?

Thanks for providing such a great tool. It would be helpful if profile labels were supported to troubleshoot io problems.

Default time to 30s, to align with the pprof/profile default timeout

Currently if you call the standard pprof endpoint without ?timeout set, it will default to 30. With fgprof, this just fails. It would be nice to retain this defaulting behavior.

Automatically include fgprof in /debug/index

There is an open issue on conprof (a continuous profiling project), discussing how there could be support for fgprof. It more generally started a discussion around custom/non-built-in profiles.

I haven't had a chance to dig into the fgprof code a lot yet, but looking at the bits and pieces involved, it appears that if profiles are created using the pprof.NewProfile function, then they are automatically included in profiles returned by pprof.Profiles(), which is what the /debug/pprof index page calls to retrieve the list of available profiles.

I'll try to understand the code more and see if this is possible, but maybe you can tell directly off the top of your head, and whether if possible this is something you would accept a patch for? :)

@felixge

Update pprof dependency

Would it be possible to update the pprof dependency? Its pretty old.

profile was collecting within 120s, but Total samples is 29838s

Hello!

Could you help me to explain it, please?

What may be a reason of it? Is it commonly happens with fgprof or it is just my weird case?

What did I do:

I added the module and code from README to my app source code:

	// Start profiling http endpoint if enabled
	if profileEnabled {
		http.DefaultServeMux.Handle("/debug/fgprof", fgprof.Handler()) // I've added this line to the existing part of code
		go func() {
			logger.Infof("Starting profiling server with listenAddress = %s", profileListenAddress)
			if profileErr := http.ListenAndServe(profileListenAddress, nil); profileErr != nil {
				logger.Errorf("Error starting profiler: %s", profileErr)
			}
		}()
	}

I ran my application which consists of 5 different docker containers.
I ran command go tool pprof http://localhost:6060/debug/pprof/goroutine?seconds=120
I downloaded profiling files to my client machine from the host.
I ran command go tool pprof -http :9098 pprof.samples.time.005.pb.gz and saw weird "Total samples" time for this profile:

Type: time
Time: Sep 29, 2022 at 10:16am (MSK)
Duration: 120s, Total samples = 29838.05s (24864.86%)
Showing nodes accounting for 29838.05s, 100% of 29838.05s total
See https://git.io/JfYMW for how to read the graph

lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 57 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel Xeon Processor (Icelake)
Stepping: 0
CPU MHz: 1999.990
BogoMIPS: 3999.98
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64 KiB
L1i cache: 64 KiB
L2 cache: 8 MiB
L3 cache: 16 MiB
NUMA node0 CPU(s): 0-3
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_t
sc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline
_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase
bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsa
vec xgetbv1 arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear arch_capabi
lities

go env

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/ivanl/.cache/go-build"
GOENV="/home/ivanl/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/ivanl/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/ivanl/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18.4"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build2800466746=/tmp/go-build -gno-record-gcc-switches"

docker images' go-version: 1.16.7

what about memory?

Hi there, first of all, thank you for this good package, it helps me profling cpu better than have to run go tool pprof multiple times. I was just wondering if this will also support memory/heap profiling?

problem on AWS EC2

This package can't work on AWS EC2. I use the gin web frame，when I send a request to my api，it just can‘t catch the calling-stack for this api

Reduce risk of running fgprof in production

The goal for this issue is to figure out how to reduce the risk of running fgprof in production. People should feel safe turning it on, knowing that in the worst case it will just bail out or reduce the amount of data being gathered rather than having a negative impact on the application.

There has been some initial discussion here: #11 (comment)