As I mentioned on HN, I can run this on SKL, SKX and CNL (CannonLake) for you. <p

OK, I will run 2x with: --output=rounded.json and <co

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I have opened <a class="issue-link js-issue-link" data-error-text="Failed to load titl

My first CNL results look all wrong: <div class="snippet-clipboard-content notrans

Results for other systems,about asmjit/cult

Comments (29)

kobalicek commented on July 19, 2024 1

It's kinda pity I don't have Intel hardware anymore at the moment. I would even experiment with this a bit, but it's impossible to make it right on the first time. But, I would research this a bit.

If I think of it I think this would never be 100% reliable tool, but if I can make it close enough I would be happy.

from cult.

kobalicek commented on July 19, 2024

Thanks! No arguments are necessary, but --output=somefile.json is a handy option.

BTW I have never tested this with AVX-512, I have no idea whether it would all work flawlessly, so fingers crossed :)

from cult.

travisdowns commented on July 19, 2024

OK, I will run 2x with: --output=rounded.json and --output=raw.json --no-rounding and provide you those files.

FWIW on my SKL system I get some wrong results, like:

  bt r16, r16                           : Lat:  0.50 Rcp:  0.50
  bt r16, i8                            : Lat:  0.50 Rcp:  0.50
  bt r32, r32                           : Lat:  0.50 Rcp:  0.50
  bt r32, i8                            : Lat:  0.50 Rcp:  0.50
  bt r64, r64                           : Lat:  0.50 Rcp:  0.50
  bt r64, i8                            : Lat:  0.50 Rcp:  0.50

Where 0.5 lat is ... unlikely. I guess the problem is maybe CULT doesn't know that the first argument to bt is write only? That is, if you do bt eax, ecx you aren't testing latency (I don't know what asm is actually generated, it's just a guess).

from cult.

kobalicek commented on July 19, 2024

Yeah, I think it's the opposite - bt reg, reg is read-only for registers, it only modifies the carry flag, so it's hard to make the asm that has dependencies without introducing other instructions in there. This is something I would like to fix in a future version.

BTW: you don't have to use --no-rounding, there is always small error that gets corrected by the rounding.

from cult.

travisdowns commented on July 19, 2024

@kobalicek - oops, good point I forgot that bt is totally read only.

Here's another one I noticed:

 blendvpd xmm, xmm, xmm0               : Lat:  0.50 Rcp:  0.50

I also got a lot of 0.2 recip throughput results which should be wrong (max 4 ops/cycle), but it seemed to go away after I turned off turbo. Do I need to turn off turbo to get good results?

from cult.

kobalicek commented on July 19, 2024

Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache. However, if it says 0.2 it's probably true even on Intel although it's possible that I miscalculate the cycles wasted for each loop iteration, which is currently set to 1 cycle - hard to say whether that could cause reporting 0.2 instead of 0.25 in such cases.

from cult.

kobalicek commented on July 19, 2024

I have opened #7 to track the latency issue

from cult.

travisdowns commented on July 19, 2024

Yeah I'm also getting 0.2 reciprocal throughput on some instructions on Ryzen, but apparently Ryzen is capable of executing 5 instructions per cycle if they are in uop cache. However, if it says 0.2 it's probably true even on Intel although it's possible that I miscalculate the cycles wasted for each loop iteration, which is currently set to 1 cycle - hard to say whether that could cause reporting 0.2 instead of 0.25 in such cases.

Yes, on Ryzen that is expected.

It's definitely not 0.2 on Intel though, I've tested this stuff exhaustively down to the cycle using lots of different calibration and cycle measurement techniques and I have never seen any case you can do 5 ops/cycle.

As I mentioned it could be turbo effects - how are you doing the timing? Do you use a clock-based timing and then convert to cycles using a calibration based on a well-known timing, say a loop of dependent instructions?

from cult.

travisdowns commented on July 19, 2024

My first CNL results look all wrong:

  add r8, r8                            : Lat:  0.66 Rcp:  0.20
  add r8, i8                            : Lat:  0.66 Rcp:  0.20
  add r16, r16                          : Lat:  0.66 Rcp:  0.20
  add r16, i16                          : Lat:  2.25 Rcp:  2.25
  add r32, r32                          : Lat:  0.66 Rcp:  0.20
  add r32, i32                          : Lat:  0.66 Rcp:  0.20
  add r64, r64                          : Lat:  0.66 Rcp:  0.20

I will try to turn off turbo.

Update: Looks OK with turbo off.

from cult.

kobalicek commented on July 19, 2024

Hmm, I don't know how to fix this though. It seems the readings are incorrect in that case. It uses rdtscp when available, I followed Intel manual here.

from cult.

travisdowns commented on July 19, 2024

Yes, but rdtscp measures wall-clock time, not cycles. So it will always be wrong (in cycles) if the chip has turbo.

from cult.

kobalicek commented on July 19, 2024

I think the manual I followed was written when the turbo didn't exist :) Do you have any suggestion about improving it? The logic is in basebench.cpp if you wanna see the current code.

from cult.

travisdowns commented on July 19, 2024

The "fix" is either to force the user to turbo off turbo, you can see how I do this programatically here:

https://github.com/travisdowns/uarch-bench/blob/master/uarch-bench.sh#L66

Or to do a calibration that allows you to convert from "nominal cycles" as read by rdtsc into CPU cycles, one way is shown here:

https://github.com/travisdowns/avx-turbo/blob/master/tsc-support.cpp

from cult.

travisdowns commented on July 19, 2024

Yeah many moons ago, there was no frequency scaling (neither turbo nor anti-turbo, i.e., scaling below the nominal freq) so rdtsc and real cycles were always the same.

Then there was a brief period after Intel added fequency scaling where rdtsc still measured true CPU cycles, and thus no longer wall-clock time (that's the easy way to implement this counter in hardware, after all). But everyone hated that because rdtsc is mostly use for efficient gettimeofday or QueryPerformanceCounter and other calls which want real time, not some non-constant "cycles", so it was quickly changed to run in wall clock time and that's were we are today (that was like a decade ago though).

Turning off turbo is good because you get much more stable results since you don't have the forces frequency switch when another core spins up (then the current core has to slow down because modern chips have turbo multipliers that depend on how many cores are running), but there are also a lot of problems like even figuring out how to turn off turbo on all systems, user has to be root, etc.

from cult.

travisdowns commented on July 19, 2024

@kobalicek - my experience with uarch-bench indicates that the calibration approach is fairly robust. At most you sometimes get a wrong calibration due to a wrong assumption: e.g., when I ran on POWER9 I found out that dependent instructions always have a latency of at least 2, so the calculated frequency was half of the real frequency, but at least the error was obvious and you can correct it once you notice it.

Do you have AMD hardware, or something non-x86? I may be interested in some AMD numbers for some random microbenchmarks since I don't have easy access to AMD to test.

from cult.

travisdowns commented on July 19, 2024

BTW, running now in parallel on SKX, SKL and CNL, results should be available in a few more minutes. FWIW here's the script I used which might be useful for anyone else who wants to automate this (heavily based on your README):

#!/bin/bash
set -e

ROOT_DIR=$HOME/dev/cult
mkdir -p $ROOT_DIR
cd $ROOT_DIR

# install new CMAKE version privately
CMAKE_INSTALLER=cmake-3.14.5-Linux-x86_64.sh
wget -N https://github.com/Kitware/CMake/releases/download/v3.14.5/$CMAKE_INSTALLER
chmod +x $CMAKE_INSTALLER
mkdir cmake
./$CMAKE_INSTALLER --exclude-subdir --skip-license --prefix=./cmake
CMAKE=$(readlink -e cmake/bin/cmake)

# Clone CULT and AsmJit (next-wip branch)
git clone --depth=1 https://github.com/asmjit/asmjit --branch next-wip
git clone --depth=1 https://github.com/asmjit/cult

# Create Build Directory
mkdir cult/build
cd cult/build

# Configure and Make
$CMAKE .. -DCMAKE_BUILD_TYPE=Release
make -j4

# Run CULT!
./cult --output=$1_rounded.json
./cult --output=$1_raw.json --no-rounding

echo "DONE"

You run it like ./do-cult.sh SKX and the output is SKX_rounded.json and SKX_raw.json.

from cult.

kobalicek commented on July 19, 2024

Nice thanks!

I have reduced all my machines to only one, which is Ryzen 1700 atm (but planning upgrade to 16c/32t at the end of the year). Then only ARM devices like raspberry for testing, interested in RISC-V though.

from cult.

kobalicek commented on July 19, 2024

BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it could go that off initially.

from cult.

travisdowns commented on July 19, 2024

BTW don't wanna waste more of your time on this. I would have to fix the timing issues if I want better numbers, as I really didn't know it could go that off initially.

Don't worry, I turned off turbo and the numbers seem good.

from cult.

travisdowns commented on July 19, 2024

SKL_rounded.txt

from cult.

travisdowns commented on July 19, 2024

SKX_rounded.txt

from cult.

travisdowns commented on July 19, 2024

CNL_rounded.txt

from cult.

kobalicek commented on July 19, 2024

Thanks a lot! I have updated the web-app with the new data here: https://asmjit.com/asmgrid/ - The architectures look pretty similar to me - Selecting few architectures and enabling "Hide equal cols" would only show rows that differ, which is useful when looking at differences between microarchitectures.

I think I have some work to do here as I can see that AVX-512 instructions that use k and zmm registers are not executed, but that would take me some time as it's not that high priority to me at the moment. In addition, I would really want to have the timings calibrated so they are precise, so there is a lot to do now :)

from cult.

travisdowns commented on July 19, 2024

I'll take a look at why they don't run.

…

On Wed, Jun 12, 2019, 5:27 PM Petr Kobalicek ***@***.***> wrote: Thanks a lot! I have updated the web-app with the new data here: https://asmjit.com/asmgrid/ - The architectures look pretty similar to me - Selecting few architectures and enabling "Hide equal cols" would only show rows that differ, which is useful when looking at differences between microarchitectures. I think I have some work to do here as I can see that AVX-512 instructions that use k and zmm registers are not executed, but that would take me some time as it's not that high priority to me at the moment. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6?email_source=notifications&email_token=AASKZQM2JZWXBFLKTIXOF33P2FZ37A5CNFSM4HXQAO32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXR7BNY#issuecomment-501477559>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASKZQP7KTBIBHQ5XPUP55TP2FZ37ANCNFSM4HXQAO3Q> .

from cult.

kobalicek commented on July 19, 2024

No need, I have to iterate over instruction signatures instead of doing what I do at the moment, asmjit has now all the information I need in cult to do this properly.

from cult.

travisdowns commented on July 19, 2024

Ah, OK! Ping me if it gets fixed and I can redo the runs.

…

On Thu, Jun 13, 2019, 2:28 AM Petr Kobalicek ***@***.***> wrote: No need, I have to iterate over instruction signatures instead of doing what I do at the moment, asmjit has now all the information I need in cult to do it properly. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6?email_source=notifications&email_token=AASKZQPQNIZALOATS4KPQV3P2HZILA5CNFSM4HXQAO32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXSZDJQ#issuecomment-501584294>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASKZQM2AJ2X7GPHI6NDO7DP2HZILANCNFSM4HXQAO3Q> .

from cult.

kobalicek commented on July 19, 2024

@travisdowns

I have fixed some issues regarding AVX-512 (now it properly tests all supported instructions with ZMM and K registers)
I have fixed incorrect latency in some instructions that have different kind of destination and source registers (like cvtsi2ss and friends)
Also other issues I guess

There are still some things that are not proper (for example it's hard to test latency of cmp, test, bt, and such instructions as the result is just flags. I will think of something, however, it's a minority of instructions so it's not that severe I think).

I have also added get_tsc_freq(), heavily inspired in your implementation, but I still don't know how to properly use the value to calculate correct clock cycles in case of active turbo.

from cult.

travisdowns commented on July 19, 2024

* I have fixed some issues regarding AVX-512 (now it properly tests all supported instructions with ZMM and K registers)

Cool! Would you like me to run it on any systems? In addition to the ones above I now have access to Zen 2 and Ice Lake.

(for example it's hard to test latency of cmp, test, bt, and such instructions as the result is just flags.

Right. Have you seen what uops.info does? They consider each instruction to have a matrix of latencies, one for each combination of input and output. For for a typical instruction like add reg, reg there are 2 inputs and 2 outputs (the destination register and the flag output), so there are actually 2x2=4 different possible latencies.

Here's cmp, and they show the latency to the flag ouput (which is 1 from either input in this case, but other cases are more interesting).

This is how I think of instruction latency now, although admittedly it often does simplify to the "single figure" for many instructions with N register inputs and 1 register output and where the latency is the same for each input. Not all instructions fit that pattern though, particularly instructions with more than 1 uop.

I have also added get_tsc_freq(), heavily inspired in your implementation, but I still don't know how to properly use the value to calculate correct clock cycles in case of active turbo.

The TSC frequency alone doesn't do that, it just lets you convert rdtsc values into time units. To measure true CPU cycles, there are several approaches. A reasonable one is a calibration like this one which measures how long a loop taking a known number of cycles (actually, this breaks on Ice Lake and Zen 2/3 because you can do 2 stores a cycle: an addition chain would be better), to allow conversion between realtime and CPU cycles.

Then you run your benchmark and measure realtime and use the conversion factor to get cycles. Of course, this only works if the CPU frequency is the same during the calibration and the benchmark. That's not always the case. Approaches that are robust against that problem include:

Use cycles performance counter (I have some examples)
Use the APERF and MPERF MSRs (I think you can only read these directly as root, but they are also available via the perf subsystem)

from cult.

kobalicek commented on July 19, 2024

@travisdowns If you have time to run cult on any Intel hardware I would be interested in results. I have updated cult to test more stuff, also memory ops, etc... There are still instructions where latency is wrong (as write-only memory ops don't create a dependency, for example), but these are things I would fix in the future and that don't bother me much as you can clearly see in the results that the timings are impossible.

I have at the moment only Zen4 desktop and Tigerlake laptop, so any other arch would help me to improve asmgrid as I would have to delete all previous tables.

from cult.

Results for other systems about cult HOT 29 OPEN

Comments (29)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent