rrze-hpc / likwid Goto Github PK

Performance monitoring and benchmarking suite

Home Page: https://hpc.fau.de/research/tools/likwid/

License: GNU General Public License v3.0

Makefile 1.18% C 79.32% C++ 0.22% Perl 11.27% Fortran 0.18% Lua 4.22% Assembly 2.09% Python 1.14% Shell 0.11% CMake 0.11% Cuda 0.14%

likwid hardware-performance-counters threading x86 c lua hwloc linux pin assembly

likwid's People

Contributors

Stargazers

Watchers

Forkers

gbfree narayana-glassbeam waldonchen imranashraf gentryx chubbymaggie mykesmith gjmulder yongjianxu hoangt zakkak sayounara haibo031031 hengwang shurakai molguin-qc danieldc varunnagpaal lmingcsce andiry rrzefox shekkbuilder juncgu wrightrocket threege programfan cfeld cdfpaz jlopezvilanova rschoene opticaller stainless5792 app-genesis bryonglodencissp ugiwgh yang123vc k-wic fuchsto rjmcguire bbsun fanyao nimisolo vadlamak chuckatkins jackbro davydden orlovan msuchard kraused schandra rrzeschorscherl mgottschlag ottolu jeorsch curioustauseef mherkazandjian cyjseagull gumi-presentation-by-dzh aaronknister louietsai mcolmant emaxerrno abhisekkumar jianbinfang kronbichler mjklemm zhuoyw kennen00 justplay jameslinus twang15 peterrum lluchs kwccoin farck lunixoid tonythomascn spurnaye kpliu yogeshvu cooljiansir fendor ppppower rohgarg kammoh ohlmann kennyht elisabortoli andrewlixin termim nanding0701 cozzini hannesschweiger xiexiguo alifahmed nisaldilshan shkodm jklinkenberg colinianking wjflyhigh

likwid's Issues

Remove registers from group files

Why do I have to worry about registers in group files? Likwid could automatically allocate registers to measured events. It is also cumbersome to write the metrics with the register names instead of the events.

Idea:
Build list of interesting metrics and let user select metrics instead of groups. If insufficient registers are available offer user to multiplex measurement.

likwid-topology -g crashes for cache sizes not divisible by 2^20

Thanks for the nice work on Likwid, I really like the tool.

On line 337, likwid-topology implicitly assumes that the LLC cache size can be divided by 2^20.

On Broadwell E5-2680 v4, the LLC is 35 MB, which doesn't cause an issue, unless the node is configured to have 4, rather than 2 NUMA domains. In that case, the division no longer results in an integer, and the format function fails since the format code is '%d'.

It is of course not a big deal, and few people will notice.

likwid-perfctr -e should separate events and counters clearly

At first, I didn't even notice that both are being output. (I only read that on the wikipage and then checked on my system.)

So, I just think that having no blank lines above/below the descriptions, such as

This architecture has 97 counters.
Counter tags(name, type<, options>)

has the effect of "hiding" these lines within the flood of events/counters.

In my case, some of the output looks like this:

[lots of lines here]
PBOX3, Physical Layer box, EDGEDETECT|THRESHOLD|INVERT
This architecture has 695 events.
Event tags (tag, id, umask, counters<, options>):
TEMP_CORE, 0x0, 0x0, TMP0
[lots of lines here]

Here, the first line is still a counter, then two lines of description and then the events are listed. It is almost impossible to see lines 3 and 4 if one doesn't know they are there.

Allow definitions of events during runtime

Allow users to define new events during runtime (either through arguments or in a config file), rather then recompiling likwid.

Low power readings on Haswell-EP

I'm getting low power measurements -- 14W idle power which is much lower than the minimum package power of 37W. Active power and DRAM power is also similarly low.
Is likwid supported on Haswell-EP? What could be the reason of this behavior?

$ likwid-powermeter -s 0.5s 
--------------------------------------------------------------------------------
CPU name:   Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
CPU type:   Intel Xeon Haswell EN/EP/EX processor
CPU clock:  3.49 GHz
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Runtime: 0.500555 s
Measure for socket 0 on CPU 0
Domain PKG:
Energy consumed: 7.19739 Joules
Power consumed: 14.3788 Watt
Domain PP0:
Energy consumed: 0 Joules
Power consumed: 0 Watt
Domain DRAM:
Energy consumed: 0.364752 Joules
Power consumed: 0.728695 Watt
--------------------------------------------------------------------------------

$ likwid-powermeter -i
--------------------------------------------------------------------------------
CPU name:   Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz
CPU type:   Intel Xeon Haswell EN/EP/EX processor
CPU clock:  3.49 GHz
--------------------------------------------------------------------------------
Base clock: 3500.00 MHz
Minimal clock:  1200.00 MHz
Turbo Boost Steps:
C0 3600.00 MHz
C1 3600.00 MHz
C2 3600.00 MHz
C3 3600.00 MHz
--------------------------------------------------------------------------------
Info for RAPL domain PKG:
Thermal Spec Power: 140 Watt
Minimum Power: 37 Watt
Maximum Power: 140 Watt
Maximum Time Window: 39040 micro sec

Info for RAPL domain DRAM:
Thermal Spec Power: 8 Watt
Minimum Power: 2.375 Watt
Maximum Power: 8 Watt
Maximum Time Window: 18544 micro sec

Info about Uncore:
Minimal Uncore frequency: 1200 MHz
Maximal Uncore frequency: 3000 MHz

Performance energy bias: 7 (0=highest performance, 15 = lowest energy)

Permissions after make install not correct

Hi,

I run make install as root since I use the accessDaemon. After that, the permissions for any folder in $PREFIX/share/likwid/perfgroups are 0700 as well as for all the files in any subfolder.

This causes likwid-perfctr -a to return exactly nothing (no error, no output, no errorcode) and took me quite a while to figure that out.

I also propose here to add an error message if likwid-perfctr doesn't find anything.

How to handle realloc when it fails and returns NULL?

[../likwid-master/ext/hwloc/hwloc/topology-linux.c:1593]: (error) Common realloc mistake: 'maps' nulled but not freed upon failure
[../likwid-master/ext/hwloc/hwloc/topology-linux.c:2466]: (error) Common realloc mistake: 'ret' nulled but not freed upon failure
[../likwid-master/ext/hwloc/hwloc/topology-linux.c:3504]: (error) Common realloc mistake: 'Lprocs' nulled but not freed upon failure

If realloc cannot find enough space, it returns a null pointer, and leaves the previous region allocated. This solution from Stack Overflow may be helpful in fixing the error:

tmp = realloc(orig, newsize);
if (tmp == NULL)
{
    // could not realloc, but orig still valid
}
else
{
    orig = tmp;
}

NOTE: Similar errors exist in the following three files:

[../likwid-master/ext/hwloc/hwloc/topology-synthetic.c:911]: (error) Common realloc mistake: 'loops' nulled but not freed upon failure
[../likwid-master/ext/hwloc/hwloc/topology.c:290]: (error) Common realloc mistake: 'infos' nulled but not freed upon failure
[../likwid-master/ext/hwloc/hwloc/topology.c:320]: (error) Common realloc mistake: 'dst_infos' nulled but not freed upon failure

Found by https://github.com/bryongloden/cppcheck

Validate Phi events

sursri wrote:
I was surprised to see L2_DATA_READ_MISS_MEM_FILL, L2_DATA_READ_MISS_CACHE_FILL as 0 for few programs when measured in PMC1, so i made a comparison with VTUNE and feel that there is some problem with likwid-perfcntr?

Later when I ran experiment running only 1 event with PMC0 i got some small values. I dont see any document mentioning that it can be measured only with PMC0. And there is a huge difference in L2_DATA_WRITE_MISS_MEM_FILL, L2_DATA_READ_MISS_CACHE_FILL compared to VTUNE. Can you help me with this?

What steps will reproduce the problem?

I ran an array copying program using both VTUNE and likwid.

What is the expected output? What do you see instead?

	likwid	Vtune
DATA_READ	200292452	216400000
DATA_WRITE	130125076	380400000
BANK_CONFLICTS	5197	100000
BRANCHES	110225835	123100000
INSTRUCTIONS_EXECUTED	1291182962	1370940000
DATA_READ_OR_WRITE	330451293	609500000
DATA_READ_MISS_OR_WRITE_MISS	42962603	74130000
L2_DATA_READ_MISS_CACHE_FILL	4389	120000
L2_DATA_WRITE_MISS_CACHE_FILL	129949990	129870000
L2_DATA_READ_MISS_MEM_FILL	200038669	199920000
L2_DATA_WRITE_MISS_MEM_FILL	3876	30140000

What version of the product are you using?
likwid-perfctr 3.0

Please provide any additional information below.
I ran if for add program of stream benchmark with just 1 thread.

[email protected]:

Hi, the event ID and umask I use for these events are according to the documentation. In the
documentation it says that FUB is CRI :-). No idea what this means, they do not introduce those
terms. The only way to say who is right is to compare against a microbenchmark where you know
the result. I plan to do this for Phi also. I find it suspicious that the vtune results are all flat to the
fifth digit. Are those end to end measurements?

sursri:

I am running the same program pinned on different core on xeon phi and
measuring the same event. and the values are different in different
cores. please checkout the result of multiple runs.

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 58 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 58
L2_READ_HIT_M,10761.000000

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 58 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 58
L2_READ_HIT_M,11010.000000

~/perf_anal $ ./work_1.sh
~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 40 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 40
L2_READ_HIT_M,0.000000

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 10 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 10
L2_READ_HIT_M,10768.000000

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 40 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 40
L2_READ_HIT_M,0.000000

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 54 -O -m /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
=====================
Region: Compute
=====================
Region Info,core 54
RDTSC Runtime [s],0.021591
call count,1.000000
Event,core 54
L2_READ_HIT_M,158.000000

~/perf_anal $ /home/snataraj/perf_anal/likwid/likwid-perfctr -g 
L2_READ_HIT_M:PMC0 -C 4 -O  /home/snataraj/perf_anal/copy
-------------------------------------------------------------
-------------------------------------------------------------
CPU type:       Intel Xeon Phi Coprocessor
CPU clock:      1.05 GHz
-------------------------------------------------------------
/home/snataraj/perf_anal/copy
K=1048577
Status: 0x0
Event,core 4
L2_READ_HIT_M,524291.000000

sursri:
Reply from intel forum:

FUB must stand for something like "Functional Unit Block" because P54C
refers to the processor core (P54C is a specific version of the Pentium
core, though the actual core in the Xeon Phi has been heavily upgraded
from the original P54C), CRI refers to the "Cache-Ring-Interface", and
VPU refers to the "Vector-Processing-Unit". In the Xeon Phi performance
counters, the UMASK field actually specifies the functional unit for
which the event is requested, with 0x00 referring to the core (P54C),
0x10 referring to the CRI, and 0x20 referring to the VPU. This was
confusing to me at first because because on other processors the UMASK
is almost always used to modify the specific details of what is measured
by an Event Select code, rather than actually specifying the Unit on
which the measurements should occur. On Xeon Phi the UMASK field is
used in a way that is much closer to what you would expect a "Unit Mask"
to mean -- it specifies the "Unit" for which you want the measurements
to be taken.

API: Declare parameters const where applicable

Hi Thomas,

I'm currently working on embedding Likwid into a PDE framework (written in C++) to enable profiling of the individual compute kernels. Following common design guidelines most of the output functions of the generic profiler interface provided by the framework are declared "const".

In the implementation of these const methods in my Likwid-wrapping profiler I need to call methods from the Likwid API, e.g. double power_printEnergy(PowerData* data). The problem is that the PowerData instance in my case is a member of the profiler, and inside a const method I can only provide a const PowerData*, not a PowerData* (leaving aside const_cast as a dirty solution).

Considering the implementation of power_printEnergy which clearly doesn't modify its arguments (one could even say it would be unexpected if it did) and some other similar examples in the Likwid API, I wanted to ask if it would be OK to explicitly declare these arguments constant in some future version of Likwid.

Please let me know what you think. Thanks.

(I can send a PR for the obvious examples that I encountered if that helps, but wanted to check with you first.)

Correctly support sharing uncore among multiple processes in generic read functions

When running multiple single-thread MPI processes on a single node, all processes get non-zero uncore events values (for example, memory read/write events). However, when running single MPI process with multiple threads, only NSOCKETS threads get non-zero uncore events values. Is is possible to support the same behavior in the multiple process case? This is highly demanded for single-threaded MPI applications.

‘dev’ undeclared

just compiled master to check on new features, but received this upon a call to "make":

$ make
===>  COMPILE  GCC/perfmon.o
In file included from ./src/perfmon.c:60:0:
./src/includes/perfmon_haswell.h: In function ‘perfmon_setupCounterThread_haswell’:
./src/includes/perfmon_haswell.h:1002:43: error: ‘dev’ undeclared (first use in this function)
                  if (haveLock && HPMcheck(dev, cpu_id))
                                           ^
./src/includes/perfmon_haswell.h:1002:43: note: each undeclared identifier is reported only once for each function it appears in
Makefile:174: recipe for target 'GCC/perfmon.o' failed
make: *** [GCC/perfmon.o] Error 1

I am using 12dd580 on fc21 with default gcc. looks to me that something along the lines of

PciDeviceIndex dev = counter_map[index].device;

is missing here: https://github.com/rrze-likwid/likwid/blob/master/src/includes/perfmon_haswell.h#L917

Honor proposed Intel Hardware performance counter usage policy

Problem:

Likwid overwrites anything in the counter registers without asking. Intel proposed to allow using multiple tools without interference.

Solution:

Honor the solution of the Intel Whitepaper. Provide a overwrite mode.
Make init and setup of counters specific for the thread group and performance group. (Only overwrite register which are needed)

[email protected]:

Intel's Performance Monitoring Unit Sharing Guide is attached.
http://software.intel.com/file/30388

[email protected]:

The paper is outdated. Further counters as Uncore (PCI based) and RAPL are not addressed.
Therefore this issue is delayed.

[email protected]:

I really would like to see some method that checks whether the counters are already in use - Intel
PCM reports within its init phase if the counters are not zeroed and ask to enforce a clearing.
But there is also a conflict with tools like PAPI that rely on the Linux kernel perf_event interface:
these tools do not set the MSRs directly and do not clean up at the end. So monitoring a node as
root using likwid (or Intel PCM) might always fail if an application used the perf_event interface.

[email protected]:

The current trunk version checks the counter at the beginning. It skipps counters where the
control register is not empty. In general we wanted to avoid additional register reads at the
beginning as we have seen long access times for the MSRs in particular situations. But these
checks are needed so that we can use common control files for e.g. SandyBridge and
SandyBridge EP.
In the end, LIKWID zeros the control registers but leaves the counter registers untouched.

[email protected]:

This is not so easy to realize. Currently Likwid checks if registers are readable and writable but
does not fail when config registers are already in use. There is no flag like "--force".
We think about it in Likwid 4.1

Can not build with --jobs = to more then 1

Trying to build likwid using more then one "make" job will cause it to fail.

Example, (replace 6 with any number higher than 1) ;

$ make -j6
===>  GENERATE HEADER GCC/perfmon_westmere_events.h
===>  GENERATE HEADER GCC/perfmon_k8_events.h
===>  GENERATE HEADER GCC/perfmon_sandybridge_events.h
===>  GENERATE HEADER GCC/perfmon_nehalem_events.h
===>  GENERATE HEADER GCC/perfmon_silvermont_events.h
===>  GENERATE HEADER GCC/perfmon_core2_events.h
===>  GENERATE HEADER GCC/perfmon_ivybridge_events.h
===>  GENERATE HEADER GCC/perfmon_p6_events.h
===>  GENERATE HEADER GCC/perfmon_haswellEP_events.h
===>  GENERATE HEADER GCC/perfmon_pm_events.h
===>  GENERATE HEADER GCC/perfmon_kabini_events.h
===>  GENERATE HEADER GCC/perfmon_interlagos_events.h
===>  GENERATE HEADER GCC/perfmon_broadwell_events.h
===>  GENERATE HEADER GCC/perfmon_atom_events.h
===>  GENERATE HEADER GCC/perfmon_sandybridgeEP_events.h
===>  GENERATE HEADER GCC/perfmon_westmereEX_events.h
===>  GENERATE HEADER GCC/perfmon_phi_events.h
===>  GENERATE HEADER GCC/perfmon_ivybridgeEP_events.h
===>  GENERATE HEADER GCC/perfmon_k10_events.h
===>  GENERATE HEADER GCC/perfmon_nehalemEX_events.h
===>  GENERATE HEADER GCC/perfmon_haswell_events.h
===>  COMPILE  GCC/configuration.o
===>  COMPILE  GCC/numa.o
===>  COMPILE  GCC/pci_proc.o
===>  COMPILE  GCC/perfmon.o
===>  COMPILE  GCC/numa_hwloc.o
===>  COMPILE  GCC/topology_proc.o
===>  COMPILE  GCC/hashTable.o
===>  COMPILE  GCC/pci.o
===>  COMPILE  GCC/luawid.o
===>  COMPILE  GCC/cpuFeatures.o
===>  COMPILE  GCC/pci_hwloc.o
===>  COMPILE  GCC/power.o
===>  COMPILE  GCC/topology.o
===>  COMPILE  GCC/thermal.o
===>  COMPILE  GCC/memsweep.o
===>  COMPILE  GCC/topology_hwloc.o
===>  COMPILE  GCC/bitUtil.o
===>  COMPILE  GCC/topology_cpuid.o
===>  COMPILE  GCC/ghash.o
===>  COMPILE  GCC/affinity.o
===>  COMPILE  GCC/accessClient.o
===>  COMPILE  GCC/libperfctr.o
===>  COMPILE  GCC/timer.o
===>  COMPILE  GCC/access.o
===>  COMPILE  GCC/msr.o
===>  COMPILE  GCC/bstrlib.o
===>  COMPILE  GCC/numa_proc.o
===>  COMPILE  GCC/tree.o
===>  COMPILE  GCC/loadData.o
===>  ENTER  ext/hwloc
===>  ENTER  ext/lua
gcc: error: ./GCC/lvm.o: No such file or directory
gcc: error: ./GCC/lapi.o: No such file or directory
gcc: error: ./GCC/lgc.o: No such file or directory
gcc: error: ./GCC/lcode.o: No such file or directory
gcc: error: ./GCC/ldump.o: No such file or directory
Makefile:49: recipe for target 'liblikwid-lua.so' failed
make[1]: *** [liblikwid-lua.so] Error 1
make[1]: *** Waiting for unfinished jobs....
Makefile:161: recipe for target 'ext/lua/liblikwid-lua.so' failed
make: *** [ext/lua/liblikwid-lua.so] Error 2
$

Now try again with only one job ;

$ make -j1
===>  ENTER  ext/hwloc
===>  ENTER  ext/lua
===>  CREATE SHARED LIB  liblikwid.so
===>  CREATE LIB  liblikwidpin.so
===>  ADJUSTING  likwid-perfctr
===>  ADJUSTING  likwid-pin
===>  ADJUSTING  likwid-powermeter
===>  ADJUSTING  likwid-topology
===>  ADJUSTING  likwid-memsweeper
===>  ADJUSTING  likwid-agent
===>  ADJUSTING  likwid-mpirun
===>  ADJUSTING  likwid-perfscope
===>  ADJUSTING  likwid-genTopoCfg
===>  ADJUSTING  likwid-setFrequencies
===>  ADJUSTING  likwid.lua
===>  BUILD access daemon likwid-accessD
===>  BUILD frequency daemon likwid-setFreq
===>  ENTER  bench
===>  COMPILE C GCC/strUtil.o
===>  COMPILE C GCC/threads.o
===>  COMPILE C GCC/barrier.o
===>  COMPILE C GCC/allocator.o
===>  COMPILE C GCC/bstrlib.o
===>  COMPILE C GCC/bench.o
===>  GENERATE BENCHMARKS
===>  ASSEMBLE  GCC/triad_avx.o
===>  ASSEMBLE  GCC/store_sse.o
===>  ASSEMBLE  GCC/striad_plain.o
===>  ASSEMBLE  GCC/sum_plain.o
===>  ASSEMBLE  GCC/vtriad_avx.o
===>  ASSEMBLE  GCC/copy_mem.o
===>  ASSEMBLE  GCC/copy_mem_avx.o
===>  ASSEMBLE  GCC/clstore.o
===>  ASSEMBLE  GCC/stream.o
===>  ASSEMBLE  GCC/stream_avx.o
===>  ASSEMBLE  GCC/update_plain.o
===>  ASSEMBLE  GCC/vtriad_sse.o
===>  ASSEMBLE  GCC/striad_mem_avx.o
===>  ASSEMBLE  GCC/copy_mem_sse.o
===>  ASSEMBLE  GCC/striad_avx.o
===>  ASSEMBLE  GCC/striad_mem_sse.o
===>  ASSEMBLE  GCC/copy.o
===>  ASSEMBLE  GCC/load.o
===>  ASSEMBLE  GCC/copy_avx.o
===>  ASSEMBLE  GCC/load_avx.o
===>  ASSEMBLE  GCC/store_plain.o
===>  ASSEMBLE  GCC/sum.o
===>  ASSEMBLE  GCC/load_sse.o
===>  ASSEMBLE  GCC/sum_avx.o
===>  ASSEMBLE  GCC/triad_mem.o
===>  ASSEMBLE  GCC/sum_sse.o
===>  ASSEMBLE  GCC/clcopy.o
===>  ASSEMBLE  GCC/update.o
===>  ASSEMBLE  GCC/update_avx.o
===>  ASSEMBLE  GCC/store_mem_avx.o
===>  ASSEMBLE  GCC/vtriad_plain.o
===>  ASSEMBLE  GCC/copy_sse.o
===>  ASSEMBLE  GCC/update_sse.o
===>  ASSEMBLE  GCC/store_mem_sse.o
===>  ASSEMBLE  GCC/store_mem.o
===>  ASSEMBLE  GCC/store_avx.o
===>  ASSEMBLE  GCC/triad.o
===>  ASSEMBLE  GCC/vtriad_mem_avx.o
===>  ASSEMBLE  GCC/clload.o
===>  ASSEMBLE  GCC/triad_split.o
===>  ASSEMBLE  GCC/store.o
===>  ASSEMBLE  GCC/striad_sse.o
===>  ASSEMBLE  GCC/stream_mem.o
===>  ASSEMBLE  GCC/copy_plain.o
===>  ASSEMBLE  GCC/load_plain.o
===>  ASSEMBLE  GCC/vtriad_mem_sse.o
===>  LINKING  likwid-bench
$

Provide Java API for markers

Actually, I did it with JNI and it looks like it is working fine. Would you guys consider including it if I created a pull request?

[Question] Monitoring the disk activities

Hello,

First of all, I would like to thank you for developing this tool, it's very helpful for my work.
I have one question regarding the hardware counters available in the Haswell EP architecture.

Are they any counters available to monitor the disk activities?
I have some insights by using several counters related to uncore or memory but is it the right way to achieve such monitoring?

Thanks for your help.

Review overflow handling for RAPL counters

In some cases the overflow recognition for the RAPL counters does not work. See https://groups.google.com/forum/#!topic/likwid-users/z2CYJGmqOv8

Using all 8 Performance Counters When Hyperthreading is Off?

Intel's docs say that for Sandy Bridge onward, all 8 PMC's are available for the single active thread if hyperthreading is disabled. See Tables 18-30, 18-42, and 18-53 all entitled "Core PMU Comparison": http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

Does Likwid allow for use of more than 4 PMC's at a time?

Compilation fails with ICC on KNL (patch)

This might be a local configuration issue, but I had trouble compiling Likwid on KNL. The Intel specific libraries weren't being found when compiling the lua library. The solution was adding an RPATH so that liblikwid-lua.so was able to find libimf.so and friends.

I also removed the '-vec-report=0' flag from the command line since it produced lots of "deprecated" messages. I switched from -O1 to -Ofast since there didn't seem to be any reason not to. I dropped -Wno-format as it didn't seem to make a difference. These changes are optional, while the RPATH fix is required.

I have not yet tested this on non-KNL using ICC, but don't see much chance of it causing breakage.

diff --git a/make/include_ICC.mk b/make/include_ICC.mk
index 9dfe66b..06dd39b 100644
--- a/make/include_ICC.mk
+++ b/make/include_ICC.mk
@@ -9,7 +9,7 @@ GEN_PMHEADER = ./perl/gen_events.pl

 ANSI_CFLAGS  = -std=c99 #-strict-ansi

-CFLAGS   =  -O1 -Wno-format -vec-report=0 -fPIC -pthread
+CFLAGS   =  -Ofast -fPIC -pthread
 FCFLAGS  = -module ./
 ASFLAGS  = -gdwarf-2
 PASFLAGS  = x86-64
@@ -25,4 +25,8 @@ DEFINES  += -DPAGE_ALIGNMENT=4096
 INCLUDES =
 LIBS     = -lrt

-
+# colon separated list of paths to search for libs
+ICC_LIB_RPATHS = /opt/intel/compilers_and_libraries/linux/lib/intel64_lin
+ifneq (strip $(ICC_LIB_RPATHS),)
+RPATHS += -Wl,-rpath=$(ICC_LIB_RPATHS)
+endif

I've barely tested, but with this tiny fix, it seems to be working for me on KNL!

Integer Benchmarks using likwid-bench

According to the likwid-bench wiki page, TYPE INT is supported, but the following bench code does not compile for me:

STREAMS 1
TYPE INT
FLOPS 0
BYTES 4
DESC Integer load, only 32-bit scalar operations
LOADS 1
STORES 0
LOOP 1
movl ISCALAR, [STR0 + GPR1 * 4]

with this error message:

===> ASSEMBLE GCC/load_int32.o
./GCC/load_int32.s: Assembler messages:
./GCC/load_int32.s:32: Error: no such instruction: type INT' ./GCC/load_int32.s:36: Error: no such instruction:movl ISCALAR,[rsi+rax * 4]'
make[1]: *** [GCC/load_int32.o] Error 1

Thanks and cheers from ATX.

Failed Build for MIC

Hello,
I was trying to build likwid for Intel xeon phi, but although I followed the configurations indicated, I got the following error:

===> COMPILE MIC/timer.o
/tmp/icckuOoB2as_.s: Assembler messages:
/tmp/icckuOoB2as_.s:109: Error: rdtscp' is not supported onk1om'
make: *** [MIC/timer.o] Error 1

I looked into config_defines.mk for the flag DHAS_RDTSCP but I couldn' t find, I also added it manually to default value 0 but the problem still occurs.
Is there any way around?

Thanks in advance

Issue of likwid-perfctr Command with '-t' Option

Hi,

I would like to reopen this issue; with the current correction in the source code, the command still shows weird outputs.

Here is the command I've always tried:
likwid-perfctr -f -c 0-3 -g BRANCH -t 2s

And the outputs:
--------------------------------------------------------------------------------
# CORES: 0|1|2|3

1 8 4 2.0013075179619 2.0013102503959 2.0013102503959 2.0013102503959 2.0013102503959 6.1923223535304e-06 1.158459678048e-06 1.1490437761027e-06 9.5708087192693e-07 3389.9079014806 3397.4141617506 3369.8001504725 3417.0112020124 2.7854898210138 1.5682565789474 1.5612876599257 1.5499262174127 0.19770460445416 0.19366776315789 0.19397441188609 0.19281849483522 0.003825659243066 0.0024671052631579 0.0028889806025588 0.0029513034923758 0.019350380096752 0.012738853503185 0.014893617021277 0.01530612244898 5.0580511402903 5.1634819532909 5.1553191489362 5.1862244897959
1 8 4 4.004496399955 2.0009526654214 2.0009526654214 2.0009526654214 2.0009526654214 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 8 4 6.00749528636 2.0007557606874 2.0007557606874 2.0007557606874 2.0007557606874 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 8 4 8.0101798471776 2.0007722017634 2.0007722017634 2.0007722017634 2.0007722017634 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
--------------------------------------------------------------------------------

Only the first line shows some values, however, the following lines shows many 0's.

Best,
S.

Collecting pmu values at regular interval

sursri wrote:
Is it possible to collect the event stats at a regular interval of 10 million instructions or so? I don't see the timeline function working properly for multithreaded program using perfcntr.
Is it possible to make modifications in existing likwid-perfcntr?

[email protected]:

Please provide more details on why timeline mode does not work for threaded case.

[email protected]:

Since no additional information was postet by 'sursri' we cannot work on the problem with the
multithreaded timeline measurements.

For your first question. Yes it is possible, but there is currently no implementation of this behavior.
The steps are the following:

Initialize counter as normal but enable interrupt on overflow (polling would also be possible but
does not provide accurate results)

Set desired counter register ( Something like PMC0 with event INSTRUCTIONS_RETIRED to
2^(register width) - (instruction interval) )

Start counters

On interrupt read all configured counters and reinitialize counter registers.
And so on.

[email protected]:

Not implemented in Likwid 4.0
Likwid does not use the interrupts, so only polling possible.

likwid-topology confused by non-standard core assignements?

This is about likwid 4.1.1 (release), built to use hwloc that comes with it (config.mk included for completeness).

For some obscure reason the assignment of processors to physical address/core-id is not what one would expect on Intel hardware. Normally, one expects on a dual socket, 12-core machine (haswell E5-2680 v3), hyperthreading disabled:
0 -> 0:0
1 -> 0:1
..
11 -> 0:11
12 -> 1:0
13 -> 1:1
...
23 -> 1:11
The left-hand number is the processor, the first right-hand number the physical address, the second the core-id according to /proc/cpuinfo.
On some machines however, we get:
0 -> 0:0
1 -> 0:2
2 -> 0:4
..
5 -> 0:10
6 -> 1:0
7 -> 1:2
...
11 -> 1:10
12 -> 0:1
13 -> 0:3
..
17 -> 0:11
18 -> 1:1
19 -> 1:3
...
23 -> 1:11
Obviously, it is not what we want, but that is our problem.

However, when likwid-topology is run on such a node, it seems to get confused. It reports:
Sockets: 2
Cores per socket: 6
Threads per core: 2
Apparently, the weird round-robin assignment tricks likwid-topology into assuming that hyperthreading is enabled. The complete output of likwid-topology is in attachment.

lscpu and lstopo (version 1.10.1) reports are consistent with /proc/cpuinfo htough (output of both in attachment as well). So it would seem that the information coming for hwloc is somehow misinterpreted.

Thanks, best regards, Geert Jan Bex

lscpu_out.txt
cpuinfo_out.txt
likwid_topology_out.txt
lstopo_out.txt
config.txt

Shared Library

Hello again,
I am using the latest version of likwid and when I run likwid-perfctr with an executable, which uses omp, i get the following error:
./helloflops3_xphi: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory

I looked around and found that because likwid-lua has set the setuid bit, it ignores env LD_LIBRARY_PATH, thus the exec cannot link with libiomp5.so
Moreover the exec is dynamically linked with:
linux-vdso.so.1 => (0x00007fff67bff000)
libm.so.6 => /lib64/libm.so.6 (0x00007f28b3d64000)
libiomp5.so => /home/echristof/libraries/libiomp5.so (0x00007f28b3a4e000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f28b383c000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f28b361f000)
libc.so.6 => /lib64/libc.so.6 (0x00007f28b32c7000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007f28b30c3000)
/lib64/ld-linux-k1om.so.2 (0x00007f28b3f93000)

and likwid-lua:
linux-vdso.so.1 => (0x00007fff279dd000)
liblikwid-lua.so.4 => /home/echristof/likwid_mic/lib/liblikwid-lua.so.4 (0x00007ff84f2c9000)
libm.so.6 => /lib64/libm.so.6 (0x00007ff84f099000)
libdl.so.2 => /lib64/libdl.so.2 (0x00007ff84ee95000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007ff84ec83000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00007ff84ea65000)
libc.so.6 => /lib64/libc.so.6 (0x00007ff84e70d000)
libimf.so => /home/echristof/libraries/libimf.so (0x00007ff84e2b4000)
libsvml.so => /home/echristof/libraries/libsvml.so (0x00007ff84dade000)
libirng.so => /home/echristof/libraries/libirng.so (0x00007ff84d8cb000)
libintlc.so.5 => /home/echristof/libraries/libintlc.so.5 (0x00007ff84d6aa000)
/lib64/ld-linux-k1om.so.2 (0x00007ff84f55d000)

Only libiomp5.so is missing.
One solution would be to link them statically. Is there another solution through likwid-lua maybe?

Thanks in advance!

Use as much machine readable documentation as possible

Build event definitions for Intel architectures from https://download.01.org/perfmon/, with the optional introduction of patches to fix known issues with the provided JSON definitions.

Do other vendors provide similar machine readable documentation?

Bug when building in direct mode and disables access daemon

When setting ACCESSMODE to direct in config.mk and BUILDDAEMON to false, LIKWID returns Unable to get path to access daemon
When the daemon is build although direct access is selected, the error does not occur.

Specify what the short name for the processor microarchitecture is

The wiki states on page likwid-perfctr:

in the directory $HOME/.likwid/groups/ARCH/, where ARCH is a short name for the processor
microarchitecture. You can get the short name by running likwid-perfctr -i.

I did run that command and my output looks like this:

Output

CPU name: Intel(R) Xeon(R) CPU E5-2630 0 @ 2.30GHz
CPU type: Intel Xeon SandyBridge EN/EP processor
CPU clock: 2.30 GHz
CPU family: 6
CPU model: 45
CPU stepping: 7

CPU features: ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE3 SSE4.1 SSE4.2 AES AVX SSE3

PERFMON version: 3
PERFMON number of counters: 8
PERFMON width of counters: 48

PERFMON number of fixed counters: 3

Supported Intel processors:
Intel Core 2 65nm processor
Intel Core 2 45nm processor
Intel Xeon MP processor
Intel Atom 45nm processor
Intel Atom 32nm processor
Intel Atom 22nm processor
Intel Core Bloomfield processor
Intel Core Lynnfield processor
Intel Core Westmere processor
Intel Nehalem EX processor
Intel Westmere EX processor
Intel Core SandyBridge processor
Intel Xeon SandyBridge EN/EP processor
Intel Core IvyBridge processor
Intel Xeon IvyBridge EN/EP/EX processor
Intel Core Haswell processor
Intel Xeon Haswell EN/EP/EX processor
Intel Atom (Silvermont) processor
Intel Atom (Airmont) processor
Intel Xeon Phi Coprocessor
Intel Core Broadwell processor

Supported AMD processors:
AMD Opteron single core 130nm processor
AMD Opteron Dual Core Rev E 90nm processor
AMD Opteron Dual Core Rev F 90nm processor
AMD Barcelona processor
AMD Shanghai processor
AMD Istanbul processor
AMD Magny Cours processor
AMD Interlagos processor
AMD Family 16 model - Kabini processor

Output End

I have, admittedly, as a novice user no idea what I should use for "ARCH" and this should be clarified (also in the documentation); for me, architecture is something like "x86_64" but that does not seem reasonable here (it does not even appear in the output of said command).

likwid-perfctr -E is case sensitive

% likwid-perfctr -E cache
Found 0 event(s) with search key cache:

but

% likwid-perfctr -E CACHE
Found 13 event(s) with search key CACHE:
L3_LAT_CACHE_REFERENCE, 0x2E, 0x4F, PMC
L3_LAT_CACHE_MISS, 0x2E, 0x41, PMC

I think case-insensitivity would be quite nice here - or give a warning in case 0 matches were found. (It may also be useful to add a hint to the case sensitivity to the helptext (-h switch) and to the wiki text (https://github.com/RRZE-HPC/likwid/wiki/likwid-perfctr)

UOPs per cycle filtered events

Add filter events to all supporting architectures. Candidates are:

UOPS_ISSUED*
UOPS_EXECUTED*
UOPS_RETIRED*
LSD_UOPS*
IDQ_MITE*
IDQ_DSB*
IDQ_MS_DSB*
IDQ_MS_MITE*

As recent architectures support up to 6 UOPs per cycle, the corresponding groups should be only available for systems with deactivated HyperThreading.

likwid-mpirun hardcode ssh connector

The line:

print(string.format("EXEC: %s/mpdboot -r ssh -n %d -f %s", path, nrNodes, hostfile))

makes ssh the default connector between nodes but it shouldn't be that way, it should be modifiable via an environment variable because ssh is not allowed in all clusters (this was my case, in the cluster we use oarsh to connect between nodes, for security reasons ssh is prohibited)

so if someone is trying to use likwid-mpirun like that an error will occur:

mpdboot_moonshot1-40 (handle_mpd_output 932): mpdboot: can not get anything from the mpd daemon; please check connection to moonshot2-14 mpiexec_moonshot1-40: cannot connect to local mpd (/tmp/mpd2.console_dlagunes); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpdallexit: cannot connect to local mpd (/tmp/mpd2.console_dlagunes); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)

RAPL Energy Unit not correct on Haswell EP DRAM domain

We have a user that reports over-estimated power-readings for the Haswell-EP DRAM domain (factor 4). I had a look into the source code and found the bug. You only determine the energy unit by reading MSR_RAPL_POWER_UNIT.
In [1, Section 5.3.3] Intel reports that the “ENERGY UNIT for DRAM domain is 15.3 μJ". This should be used for uncore-PCI-register RAPL readings. Hackenberg et al. described in [2, Section IV] that the statement about the energy unit is also correct for the DRAM RAPL reading from MSRs.

Best,
Robert

[1] Intel® Xeon® Processor E5-1600 and E5-2600 v3 Product Families, Volume 2 of 2, Registers Datasheet, Intel Corp.
[2] An Energy Efficiency Feature Survey of the Intel Haswell Processor, Hackenberg et al., Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015

Implement multiplexing support for using more events than physical counters in groups

Problem:

The physical counters are not enough to measure all interesting metrics in one run. Some architectures even have not enough counter to compute a single metric in one run.

Solution:

Implement multiplexing which allows to use virtual counters. In a multiplexing groups physical counters are mapped in multiple event sets on a unlimited number of virtual counters. The virtual counters can be used in computing derived metrics.

Tasks:

Extend group syntax to enable multiplexing groups
Create a module managing the multiplexing of multiple event sets
Plan timing issues (accumulate or extrapolate)

Thomas Roehl

Not implemented in Likwid 4.0

Since the results of multiplexed events are calculated, this could introduce a high error to the results.
Moreover, in order to get reliable results, it will introduce a high overhead because of steady
switching of event configurations.

likwid-mpirun hostfile parsing

Hello,

There is a problem when likwid-mpirun read the hostfile the user provides, if the hosts names contains, for example, a dash in the name the program will ignore all its name after the dash making:

host1-10
host1-11
host1-12
host2-10

into:

host1
host1
host1
host2

And so the mpi execution will fail because it will not found the good names for the hosts.

The solution is to replace the line

hostname = line:match("^([%.%a%d]+)")

of all 4 readHostfile* methods to:

hostname = line:match("^([%.%a%d-]+)")

but anyway this solution is for this specific problem, I think that the regular expression should take all the line from the hostfile without modifications.

BOF-Slides link does not work

The wiki (https://github.com/RRZE-HPC/likwid/wiki) links to slides from the BOF session at ISC '13. That link does not work and should be updated.

likwid-bench uses mismatching array sizes to compute results

There's a mismatch between the dataset size used in benchmarks and the dataset size used to compute the benchmark results which can lead to significantly off results.

Example:
./likwid-bench -t my_benchmark -w N:10kB:1
My loop stride is 512. Data type is SINGLE.
The loop limit is set to 2048 instead of 2500 (10000/sizeof(float = 2500).
The results are calculated as follows:
cycPerUp = ((double) maxCycles / (double) (threads_data[0].data.iter * realSize));
with realSize is the user-supplied 10000B.
The reported measurement in this case is off by about 20%!

attempt to call a nil value (field 'getGroups')

I've pulled the latest version that builds successfully. When I run I get the following error.

$ sudo likwid-perfctr -a
/usr/local/bin/likwid-lua: /usr/local/bin/likwid-perfctr:363: attempt to call a nil value (field 'getGroups')

RAPL Energy Counters for Knight's Landing Architecture

Hi,

We're trying to measure energy usage of some parallel codes on a KNL machine and I wanted to check if support to profile energy usage on KNL architecture is already available on a devel-branch or if it's in the pipeline. Let me know when time permits.

Thanks in advance.

Integrate latency benchmark facitlities in likwid-bench

[email protected] wrote:
Currently likwid-bench focuses on streaming and instruction throughput limited benchmarks.
Task:
Extend likwid-bench to also allow for latency bound data access benchmarks.

[email protected]:

Basic application ready but until now no integration in likwid-bench

QPI counters report 0 values on Intel Xeon SandyBridge EN/EP Processor

Hi,

I try to profile a multi-threaded application and the impact on Numa systems,
When using likwid-perfctr -g QPI [...] the following metrics report zeroes:

DIRECT2CORE_SUCCESS SBOX0C0
RXL_FLITS_G1_DRS_DATA SBOX0C1
RXL_FLITS_G2_NCB_DATA SBOX0C2
DIRECT2CORE_SUCCESS SBOX1C0
RXL_FLITS_G1_DRS_DATA SBOX1C1
RXL_FLITS_G2_NCB_DATA SBOX1C2

My platform is:
4x Intel(R) Xeon(R) CPU E5-4650 0 @ 2.70GHz
CPU type:,Intel Xeon SandyBridge EN/EP processor

What am I missing?
Thanks

Using the additional counters (PMC{4..7}) fails on BroadwellEP

Hi,

I tried to use the additional PMCs on my Broadwell machine, but the additional counters only show zeros.

The CPU is a Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz wtih Hyperthreading turned off. I used the PORT_USAGE group as a test and this is the result:

+--------------------------------+---------+--------------+
|              Event             | Counter |      Sum     |
+--------------------------------+---------+--------------+
|     INSTR_RETIRED_ANY STAT     |  FIXC0  | 179574190000 |
|   CPU_CLK_UNHALTED_CORE STAT   |  FIXC1  | 175468080000 |
|    CPU_CLK_UNHALTED_REF STAT   |  FIXC2  | 175469240000 |
| UOPS_EXECUTED_PORT_PORT_0 STAT |   PMC0  |  43542586000 |
| UOPS_EXECUTED_PORT_PORT_1 STAT |   PMC1  |  61975496000 |
| UOPS_EXECUTED_PORT_PORT_2 STAT |   PMC2  |  51583122000 |
| UOPS_EXECUTED_PORT_PORT_3 STAT |   PMC3  |  51799443000 |
| UOPS_EXECUTED_PORT_PORT_4 STAT |   PMC4  |       0      |
| UOPS_EXECUTED_PORT_PORT_5 STAT |   PMC5  |       0      |
| UOPS_EXECUTED_PORT_PORT_6 STAT |   PMC6  |       0      |
| UOPS_EXECUTED_PORT_PORT_7 STAT |   PMC7  |       0      |
+--------------------------------+---------+--------------+

I have also tried assigning the events to different counters, but it is always PMC{4..7} that refuse to count. Attached is a log of a run with -V 3. Maybe someone with insider infos can figure out whats going wrong.

MfG
K-Wic

Uploading likwid-perfctr_broadwellEP_counter_failure.txt…

Pass return value of wrapped executable through

likwid-pin -c S0:0 false has a successful return value, which makes it inconvenient to check if the pinned command has failed. A command line setting to change this behavior would also be fine.

Machine readable (semantic) output

The current CSV output is not very useful since the original table was meant to be interpreted by humans. It would be help full to have a data format that can parsed and right afterwards processed.

For example, consider the following (regular) output:

phinally:~$ likwid-perfctr -g DATA -C S0:0 du -csh .
--------------------------------------------------------------------------------
CPU name:       Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
CPU type:       Intel Xeon SandyBridge EN/EP processor
CPU clock:      2.70 GHz
--------------------------------------------------------------------------------
[...]
--------------------------------------------------------------------------------
Group 1: DATA
+-------------------------+---------+-----------+
|          Event          | Counter |   Core 0  |
+-------------------------+---------+-----------+
|    INSTR_RETIRED_ANY    |  FIXC0  |  80729480 |
|  CPU_CLK_UNHALTED_CORE  |  FIXC1  | 203480258 |
|   CPU_CLK_UNHALTED_REF  |  FIXC2  | 457818723 |
|  MEM_UOPS_RETIRED_LOADS |   PMC0  |  37244141 |
| MEM_UOPS_RETIRED_STORES |   PMC1  |  13267422 |
+-------------------------+---------+-----------+

+----------------------+-----------+
|        Metric        |   Core 0  |
+----------------------+-----------+
|  Runtime (RDTSC) [s] |  20.4862  |
| Runtime unhalted [s] |   0.0754  |
|      Clock [MHz]     | 1200.0812 |
|          CPI         |   2.5205  |
|  Load to store ratio |   2.8072  |
+----------------------+-----------+
phinally:~$

and a usable JSON representation could look as following:

{
    "node": {
        "hostname": "phinally",
        "CPU name": "Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz",
        "CPU type": "Intel Xeon SandyBridge EN/EP processor",
        "CPU clock [GHz]": 2.7
    },
    "groups": {
        "DATA": {
            "events": {
                "INSTR_RETIRED_ANY": [80729480],
                [...]
            },
            "metrics": {
                "Runtime (RDTSC) [s]": [20.4862],
                [...]
            }
        }
    }
}

or something along those lines.

I personally like to include the return code, stdout and stderr as well.

likwid-mpirun passing a float to bash

In the script likwid-mpirun in the function: writeWrapperScript

There are cases where a float is written in the bash script causing errors in bash and stopping working:

likwid-mpirun -hostfile $OAR_NODEFILE -np 8 -d echo $HOSTNAME results in:

/home/users/dlagunes/.likwidscript_22531.txt: line 8: [: 8.0: integer expression expected /home/users/dlagunes/.likwidscript_22531.txt: line 11: 0 - (8.0 - 1): syntax error: invalid arithmetic operator (error token is ".0 - 1)")
the error must be when this operation local full = tostring(np - (np % ppn)) returns an integer with an appended ".0"

Solution: use math.floor lua function to force a integer in all instances of the variable "full"

local full = tostring(math.floor(np - (np % ppn)))

Failure in building likwid-4.0.0 on Sabayon linux

likwid-4.0.0 fails to build on linux. The failure happens in likwid-bench, see the end of this post.
This happens with gcc-4.8.4 and gcc-4.9.2. I use binutils-2.23.1
I did not modify the likwid configuration config.mk.
Please feel free to ask if you need any further information.

===> COMPILE C GCC/strUtil.o
===> COMPILE C GCC/threads.o
===> COMPILE C GCC/barrier.o
===> COMPILE C GCC/allocator.o
===> COMPILE C GCC/bench.o
===> GENERATE BENCHMARKS
===> ASSEMBLE GCC/vtriad_mem_avx.o
===> ASSEMBLE GCC/store_sse.o
===> ASSEMBLE GCC/striad_plain.o
===> ASSEMBLE GCC/sum_plain.o
===> ASSEMBLE GCC/vtriad_avx.o
===> ASSEMBLE GCC/copy_mem.o
===> ASSEMBLE GCC/copy_mem_avx.o
===> ASSEMBLE GCC/clstore.o
===> ASSEMBLE GCC/stream.o
===> ASSEMBLE GCC/stream_avx.o
===> ASSEMBLE GCC/update_plain.o
===> ASSEMBLE GCC/sum.o
===> ASSEMBLE GCC/vtriad_sse.o
===> ASSEMBLE GCC/striad_mem_avx.o
===> ASSEMBLE GCC/copy_mem_sse.o
===> ASSEMBLE GCC/striad_mem_sse.o
===> ASSEMBLE GCC/stream_mem.o
===> ASSEMBLE GCC/copy.o
===> ASSEMBLE GCC/load.o
===> ASSEMBLE GCC/copy_avx.o
===> ASSEMBLE GCC/load_avx.o
===> ASSEMBLE GCC/store_plain.o
===> ASSEMBLE GCC/load_sse.o
===> ASSEMBLE GCC/striad_avx.o
===> ASSEMBLE GCC/copy_sse.o
===> ASSEMBLE GCC/sum_avx.o
===> ASSEMBLE GCC/triad_mem.o
===> ASSEMBLE GCC/sum_sse.o
===> ASSEMBLE GCC/clcopy.o
===> ASSEMBLE GCC/update.o
===> ASSEMBLE GCC/store_mem.o
===> ASSEMBLE GCC/update_avx.o
===> ASSEMBLE GCC/store_mem_avx.o
===> ASSEMBLE GCC/vtriad_plain.o
===> ASSEMBLE GCC/update_sse.o
===> ASSEMBLE GCC/store_mem_sse.o
===> ASSEMBLE GCC/triad.o
===> ASSEMBLE GCC/triad_avx.o
===> ASSEMBLE GCC/clload.o
===> ASSEMBLE GCC/triad_split.o
===> ASSEMBLE GCC/store.o
===> ASSEMBLE GCC/striad_sse.o
===> ASSEMBLE GCC/store_avx.o
===> ASSEMBLE GCC/copy_plain.o
===> ASSEMBLE GCC/load_plain.o
===> ASSEMBLE GCC/vtriad_mem_sse.o
===> LINKING likwid-bench
/usr/lib/gcc/x86_64-pc-linux-gnu/4.8.4/../../../../x86_64-pc-linux-gnu/bin/ld: ./GCC/striad_sse.o: relocation R_X86_64_32S against `.data' can not be used when making a shared object; recompile with -fPIC
./GCC/striad_sse.o: could not read symbols: Bad value
collect2: error: ld returned 1 exit status
Makefile:95: recipe for target 'likwid-bench_target' failed
make: *** [likwid-bench_target] Error 1

INSTALL_PREFIX vs INSTALLED_PREFIX in configuration.c

I don't know if this is really an issue, and I haven't figured out exactly what it is doing, but it looks suspicious. In configuration.c, it checks for likwid.cfg and likwid_topo.cfg in INSTALL_PREFIX rather than in INSTALLED_PREFIX. Is this correct? I'm compiling for MIC and have these set differently.

Marker results are difficult to read at a glance

I'm finding the fancy formatting of the marker results to make the numbers difficult to interpret. The automatic centering makes the table look pretty, but it makes it more difficult to compare the magnitude of the numbers at a glance. I'm frequently finding myself counting the digits trying to figure out whether one number is 1/10 of another, or 1/100.

For example, rather than this with the centering:

+---------------------------+---------+--------------+
|           Event           | Counter |    Core 16   |
+---------------------------+---------+--------------+
|    Runtime (RDTSC) [s]    |   TSC   | 8.833629e-01 |
|     RS_FULL_STALL_ALL     |   PMC0  |    2132399   |
| NO_ALLOC_CYCLES_RAT_STALL |   PMC1  |   136345931  |
|     INSTR_RETIRED_ANY     |  FIXC0  |  1920997053  |
|   CPU_CLK_UNHALTED_CORE   |  FIXC1  |  1141587342  |
|    CPU_CLK_UNHALTED_REF   |  FIXC2  |  1141587369  |
+---------------------------+---------+--------------+

I would find something like this to be faster to compare with similar regions:

+---------------------------+---------+--------------+
|           Event           | Counter |    Core 16   |
+---------------------------+---------+--------------+
| Runtime (RDTSC) [s]       |    TSC  |    0.8833629 |
| RS_FULL_STALL_ALL         |   PMC0  |      2132399 |
| NO_ALLOC_CYCLES_RAT_STALL |   PMC1  |    136345931 |
| INSTR_RETIRED_ANY         |  FIXC0  |   1920997053 |
| CPU_CLK_UNHALTED_CORE     |  FIXC1  |   1141587342 |
| CPU_CLK_UNHALTED_REF      |  FIXC2  |   1141587369 |
+---------------------------+---------+--------------+

This is probably personal preference, and I don't expect to convince anyone else that one way is better, but where in the code would I find the formatting code to try out some variations?

Strange (error?) results for likwid-topology in examples.

On page https://code.google.com/p/likwid/wiki/LikwidTopology#Intel_Dunnington

Intel Dunnington

CPU name: Intel Xeon MP processor
System have 4 sockets and 6 cores per socket
Sockets: 4
Cores per socket: 6
But for Level 3 cache shown 3 groups (not 4) of cores with 8 cores per group (not 6).
Cache groups: ( 0 12 1 13 2 14 3 15 ) ( 4 16 5 17 6 18 7 19 ) ( 8 20 9 21 10 22 11 23 )

This result looks like error.

likwid-perfctr fails to fork when using the marker API and large amounts of memory

Hi,

I ran into the problem that I could run a large simulation when using plain likwid-perfctr, but using the marker API would fail.
By now I figured out it is due to the call to fork() when calling the marker for the first time and which fails with an errno of ENOMEM. This is caused by the fact that the forked process inherits the resources from the parent process, including the memory consumption. Since the markers are called well into the simulation, I have already allocated a lot of memory and fork now tries to reserve the same amount for each child processes, which naturally fails.
I tried to work around this problem by calling LIKWID_MARKER_REGISTER early on, but unfortunately this does not cause forking. As another workaround I called MARKER_START and MARKER_STOP in quick succession and it does lead to the expected/hoped for behavior and hopefully does not mess with the results too much.

This is the solution I am currently going with, but maybe there is a better way?
Perhaps the REGISTER call could be changed to handle the forking and maybe give the user a hint when the forking fails due to ENOMEM?

Also I came across two other problems while debugging this.

Turning on the debug mode in config.mk does not disable the gcc optimizations, which makes the use of gdb impossible. I have submitted a pull request which addresses this (#52).
(Additionally, calling gdb on all the lua stuff is not exactly trivial, so I give an example here in the hopes that no one else ever needs it: "gdb --args likwid-lua /usr/bin/likwid-perfctr -g xxx -C xxx ./executable" and creating the breakpoints depending on future loads of libraries)
There is an error message when fork fails, but likwid does not abort, but instead fails when it cannot open the socket in /tmp/likwid--1. The -1 is the pid of the forked process...
I therefore propose the following change:

diff --git a/src/access_client.c b/src/access_client.c
--- a/src/access_client.c
+++ b/src/access_client.c
@@ -151,7 +151,8 @@ access_client_startDaemon(int cpu_id)
     }
     else if (pid < 0)
     {
-        ERROR_PLAIN_PRINT(Failed to fork);
+        ERROR_PRINT(Failed to fork while starting client for cpu %d, cpu_id);
+        exit(EXIT_FAILURE);
     }

     EXIT_IF_ERROR(socket_fd = socket(AF_LOCAL, SOCK_STREAM, 0), socket() failed);

This ensures that likwid aborts right away and it also prints the reason for the fork failure.

MfG
K-Wic

Issue of likwid-perfctr Command with '-t' Option

Hi,

I'm collecting performance counter values using likwid-perfctr command with -t option. Here is the command that I tried (since my computer has four CPUs; N:0-3):
likwid-perfctr -c N:0-3 -g BRANCH -t 2s

However, the output that I could get is very different from the output at the Wiki page. First of all, it just stops after two seconds, and, second, anything appeared at the command line (the below is all I have):
--------------------------------------------------------------------------------
# CORES: 0|1|2|3

--------------------------------------------------------------------------------

Could you tell me what's wrong with the likwid-perfctr?

Best,
SH.

Automatically instrument code based on DWARF information

Let user give a function name (or regex) and automatically instrument all matching functions based on DWARF information.

rrze-hpc / likwid Goto Github PK

likwid's People

Contributors

Stargazers

Watchers

Forkers

likwid's Issues

Output

CPU features: ACPI MMX SSE SSE2 HTT TM RDTSCP MONITOR VMX EIST TM2 SSSE3 SSE4.1 SSE4.2 AES AVX SSE3

PERFMON number of fixed counters: 3

Output End

Hello,

Intel Dunnington

Recommend Projects

Recommend Topics

Recommend Org