Problem Definition
There are two potential issues related to GOMAXPROCS.
1. π BadgerDB can be faster with higher GOMAXPROCS
BadgerDB and possibly other 3rd-party packages may benefit from increased GOMAXPROCS. BadgerDB random reads on NVMe storage on a 4-CPU system (BadgerDB TPS β Flow TPS):
- Β Β 57501.74 TPS using MAXPROCS=4
- 104680.17 TPS using MAXPROCS=64 <-- nice speedup but how does it compare to 8, 16, 24, or 32?
- 105601.16 TPS using MAXPROCS=128 <-- likely not worth extra overhead on other parts of system
We don't want to maximize BadgerDB TPS at the cost of Flow TPS, so the best GOMAXPROCS for us is probably going to be lower than what is suggested by BadgerDB in their docs (currently 128 π in their Quickstart Guide).
2. πͺ² Entire Go program potentially stalling on mmap
Some articles claim entire Go programs can stall if all goroutines get stuck accessing cold data regions of mmap files:
What happens if GOMAXPROCS goroutines concurrently access cold data regions in mmaped file? Complete stall of the whole program until the OS resolves major page faults caused by these goroutines!
I haven't verified the article's claim about program stalls and the Go runtime says:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
There are 26 go.sum files in onflow (across various projects) with "mmap-go". I think BadgerDB uses mmap without using edsrzf/mmap-go, so maybe it's Prometheus, etc.
Proposed Solution
-
Determine if increasing GOMAXPROCS improves Flow TPS, etc. By using high-level metrics, we can avoid pitfalls of relying solely on micro-benchmarks and save time. We can compare Flow TPS after bumping up a single number before we explore more time-consuming changes related to this issue.
Rather than using a hard-coded GOMAXPROCS such as 128 proposed by BadgerDB, we can use benchmarks comparisons with real data from mainnet snapshot to measure Flow TPS, etc.
Maybe on a 16-CPU server with non-shared NVMe storage, we can try GOMAXPROCS=32 and then adjust up/down depending on Flow TPS and other relevant metrics. Unfortunately, BadgerDB used a 4-CPU system for their benchmarks, so we won't benefit from the gains they got bumping up to 16 GOMAXPROCS.
-
Determine if mmap-related stalls of entire Go programs are possible and if that might occur in Flow. If the article's claims are confirmed to be valid with Go 1.16, then open a separate ticket to mitigate risks.
In Go 1.10, the limit on GOMAXPROCS was removed. In Go 1.9, GOMAXPROCS max limit was 1024. If we look back far enough, we'll find ancient posts arguing about the risks of GOMAXPROCS being higher than 1 due to overhead. Let's keep an open mind and measure Flow TPS before debating overhead or proposing design changes to avoid that overhead, etc.
Definition of Done
More info
BadgerDB and MAXPROCS benchmarks on a 4-CPU machine
From: https://github.com/dgraph-io/badger-bench/blob/master/randread/maxprocs.txt
With fio
$ fio --name=randread --ioengine=libaio --iodepth=32 --rw=randread --bs=4k --direct=0 --size=2G --numjobs=16 --runtime=240 --group_reporting
Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
Average: xvda 0.09 4.57 0.71 59.20 0.00 0.00 0.00 0.00
Average: nvme0n1 118063.07 944503.71 0.86 8.00 12.75 0.11 0.01 100.36
With Go (default GOMAXPROCS, should be 4 because 4 core machine)
Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
Average: xvda 1.27 12.00 4.76 13.19 0.00 0.21 0.21 0.03
Average: nvme0n1 57501.74 548921.95 0.43 9.55 6.43 0.11 0.02 99.76
With Go, GOMAXPROCS=64
Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
Average: xvda 0.11 0.17 3.30 32.00 0.00 0.80 0.80 0.01
Average: nvme0n1 104680.17 981817.39 0.00 9.38 12.82 0.12 0.01 100.04
With Go, GOMAXPROCS=128
Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
Average: xvda 0.40 0.32 5.92 15.60 0.00 0.20 0.20 0.01
Average: nvme0n1 105601.16 989440.35 0.00 9.37 12.79 0.12 0.01 100.04
With GOMAXPROCS=32
Command being timed: "./randread --dir /mnt/data/fio --mode 1 --seconds 60"
User time (seconds): 23.34
System time (seconds): 100.91
Percent of CPU this job got: 207%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3820
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 9
Minor (reclaiming a frame) page faults: 416
Voluntary context switches: 2958129
Involuntary context switches: 2525
Swaps: 0
File system inputs: 59343840
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
With GOMAXPROCS=128
Command being timed: "./randread --dir /mnt/data/fio --mode 1 --seconds 60"
User time (seconds): 21.59
System time (seconds): 104.34
Percent of CPU this job got: 209%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:00.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3968
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 11
Minor (reclaiming a frame) page faults: 590
Voluntary context switches: 2956871
Involuntary context switches: 2591
Swaps: 0
File system inputs: 59264616
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
Caveats
I stumbled across this while helping with the design of the next storage architecture and haven't confirmed the data presented by third parties (such as the claim about mmap in Go programs).
I'm not the best person (at this time) to tackle this. If you're interested in tackling this issue, please feel free to assign yourself.
My apologies if the optimal settings for GOMAXPROCS on various nodes for common CPU counts were already investigated.
Updates
June 6, 2021 -- Update content to make it more clear there are potentially two separate issues. Make it more clear that our goal is to optimize for Flow TPS (not just BadgerDB). Mention Prometheus using mmap. Update title to replace "performance" with "Flow TPS".