hewlettpackard / quartz Goto Github PK

Quartz: A DRAM-based performance emulator for NVM

Home Page: https://github.com/HewlettPackard/quartz

License: Other

CMake 1.39% C 86.98% Shell 9.20% Makefile 0.10% C++ 2.33%

quartz's Introduction

Quartz: A DRAM-based performance emulator for NVM

Quartz leverages features available in commodity hardware to emulate different latency and bandwidth characteristics of future byte-addressable NVM technologies.

Quartz's design, implementation details, evaluation, and overhead can be found in the following research paper:

H. Volos, G. Magalhaes, L. Cherkasova, J. Li: Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proc. of the 16th ACM/IFIP/USENIX International Middleware Conference, (Middleware'2015), Vancouver, Canada, December 8-11, 2015. and can be downloaded from: http://www.jahrhundert.net/papers/middleware2015.pdf

While the emulator is designed to cover three processor families: Sandy Bridge, Ivy Bridge, and Haswell -- we have had the best results on the Ivy Bridge platform. Haswell processor has a TurboBoost feature that cause higher variance and deviations when emulating higher range latencies (above 600 ns).

Contributors

For a list of contributors see AUTHORS.

Extended documentation

Extended documentation available in Doxygen form. To build and view:

doxygen
xdg-open doc/html/index.html

Dependencies

This is the list of libraries and tools used by Quartz:

On RPM based distributions:

cmake 2.8
libconfig and libconfig-devel
numactl-devel
uthash-devel
kernel-devel

On Debian based distributions:

cmake 2.8
libconfig-dev
libnuma-dev
uthash-dev
linux-headers

You can run 'sudo scripts/install.sh' in order to automatically install these dependencies.

Supported environment

Currently the latency emulator can be used on Linux with Sandy Bridge, Ivy Bridge, and Haswell Intel processors. For bandwidth emulation support, Intel Thermal Memory Controller device is required. No specific Linux distribution or kernel version is required.

Source code tree overview

bench             Benchmarks
doc               Documentation, including Doxygen generated documentation (doc/html)
src/lib           Emulator main library code
src/dev           Kernel-module for accessing performance counters and 
                  memory-controller PCI registers
scripts           Helper scripts to run a program using the emulator and install 
                  dependencies
test              Several tests and application code examples
benchmark-tests   Several automated tests with benchmark runs and output analysis 
                  for testing the correctness of configured emulation environment and 
                  the accuracy of expected results

For more details, please see the extended documentation generated using Doxygen.

Building

After installing the dependencies, go to the emulator's source code root folder and execute the following steps:

mkdir build
cd build
cmake ..
make clean all

In order to disable statistics support, replace the third step above with:

cmake .. -DSTATISTICS=OFF

See more details about statistics on the respective section below. The emulator library, benchmark and test binaries resulted from the build process will be available in the respective subfolder inside the 'build' folder.

Usage

First, load the emulator's kernel module. From the emulator's source code root folder, execute:

sudo scripts/setupdev.sh load

Set your processor to run at maximum frequency to ensure fixed cycle rate (as the cycle counter is used to project delay time). You can use the scaling governor:

echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

Set the LD_PRELOAD and NVMEMUL_INI environment variables to point respectively to the emulators library and the configuration file to be used. The LD_PRELOAD is used for automatically loading the emulator's library when the user application is executed. Thus, there is no need to statically link the library to the user application. See below details about the configuration file in the respective section.

Rather than configuring the scaling governor and the environment variables manually as indicated above, you can use the scripts/runenv.sh script. See below.

An additional configuration step may be required depending on the Linux Kernel version. This emulator makes use of rdpmc x86 instruction to read CPU counters. Before kernel 4.0, when rdpmc support was enabled, any process (not just ones with an active perf event) could use the rdpmc instruction to access the counters. Starting with Linux 4.0 rdpmc support is only allowed if an event is currently enabled in a process's context. To restore the old behavior, write the value 2 to /sys/devices/cpu/rdpmc if kernel version is 4.0 or greater:

echo 2 | sudo tee /sys/devices/cpu/rdpmc

Run your application:

scripts/runenv.sh <your_app>

The runenv.sh script runs an application in a new shell environment that properly sets LD_PRELOAD to the library available in the build folder. We do not modify the current shell environment to avoid getting other applications interposed by the emulator unexpectedly.

Alternatively, you may directly link the library to your application but the nvmemul library must come first in the linking order to ensure we properly interpose on necessary functions. Additionally, this script sets the NVMEMUL_INI environment variable to point to the nvmemul.ini configuration file available in the emulator's source code root folder.

Configuration file

Emulator runtime parameters can be defined in a configuration file.

The default path is ./nvmemul.ini but you may change the path through the environment variable $NVMEMUL_INI (see scripts/runenv.sh).

The main available parameters are:

- Latency:
  enable                  True means the latency emulation is on, false,
                          the latency emulation is disabled.
  inject_delay            True means the delay injection is on, false,
                          the emulator will skip the delay injection
  read                    The target read latency in nano seconds. It must 
                          be greater than the hardware latency. This value
                          is automatically consisted by the emulator.
  write                   The target write latency in nano seconds. It must 
                          be greater than the hardware latency. This value
                          is automatically consisted by the emulator.
  max_epoch_duration_us   This is the epoch duration in micro seconds. 
                          Eventually an epoch may be greater than this value
                          depending on signal delivery managed by Kernel.
  min_epoch_duration_us   The minimum epoch duration. 
- Bandwidth:
  enable                  True means the bandwidth emulation is on, false, 
                          it is disabled.
  model                   File path used by the emulator to cache the 
                          detected hardware bandwidth characteristics.
  read                    Target read bandwidth in MB/s.
  write                   Target write bandwidth in MB/s;
- Topology:
  mc_pci                  File path used by the emulator to cache the PCI 
                          bus topology. It is not required if bandwidth 
                          emulation is disabled.
  physical_nodes          List all CPU sockets ids to be added to the known
                          topology. An odd number of CPU sockets means it
                          will not be possible to configure all CPUs in
                          pairs and then a single CPU will be used as NVM
                          only. See Emulation modes section below.
- Statistics:
  enable                  True means the statistics collection and report is
                          enable, false, it is disable. See the Statistics
                          section below.
  file                    File path used by the emulator to write the 
                          statistics report. If not provided, emulator will 
                          use stdout.
- Debug:
  level                   Shows debugging message with level up to this 
                          value, the greater this value is, the more verbose 
                          the debug log will be.
                          0: off; 1: critical; 2: error; 3: warning; 4: info;
                          5: debugging.
  verbose                 If greater than zero shows source code information
                          along with the debugging message.

Latency emulation modes

The emulator may run application threads on a NVM only mode or DRAM+NVM mode. It depends if the system has more than one CPU socket and if the topology configuration enables multiple CPU socket.

For NVM only mode, the emulator will use a CPU socket with no sibling node and make use of the DRAM available in that socket to emulate NVM. Any DRAM memory access on this socket will produce delays injection to emulate the target latency.

For DRAM+NVM mode, the emulator will differentiate DRAM from virtual NVM latencies. It is supported only on IvyBridge, Haswell (and higher) Intel processor systems with 2 CPU sockets or more. A proper configuration as mentioned above and explicit calls to NVM memory allocation in the application’s source code is required.

The emulator will bind application threads to node 0 CPU and DRAM. The other CPU socket will not be used for application threads and the DRAM from this second socket will be used as virtual NVM;
The application must explicitly allocate virtual NVRAM memory using pmalloc(size) and pfree(pointer, size) API provided by the emulator.

See the NVM programming section below.

NVM programming

The emulator provides an API for allocating and deallocating memory from NVM space. It is possible to use this API on both NVM only and DRAM+NVM modes. However, it is really required to use this API in the DRAM+NVM mode so the emulator can clearly differentiate DRAM from NVM memory access latencies. This is the API available for user applications:

void *pmalloc(size_t size);
void pfree(void *start, size_t size);

The application can include the NVM_EMUL/src/lib/pmalloc.h header file to properly define these headers. See test/test_nvm.c and test/test_nvm_remote_dram.c for an example on how to allocate memory on respectively local DRAM or virtual NVM on a DRAM+NVM emulation mode.

Statistics

The emulator collects statistical data to help on emulation accuracy validation. If enabled, by default the emulator will show the statistics report when the user application terminates to the standard output. Some applications suppress output to stdout, you can still see the reports by defining a target file for the report in the configuration file. When using a file as output, the emulator appends the result to the file and then previous reports are not overwritten. The statistics source code can also be statically removed at compile time. See Building section.

These are the reported statistics:

- initialization duration   Time in micro seconds took by the emulator to 
                            initialize.
- running threads           The number of threads still running. If the report
                            was called automatically by the emulator, all user 
                            threads are already terminated.
- terminated threads        Number of terminated threads, including the main
                            thread.
For each application thread:
- thread id                 Thread id.
- cpu id                    CPU id where the user thread was bind to.
- spawn timestamp           Thread spawn timestamp as reported by the
                            monotonic time.
- termination timestamp     Thread termination timestamp as reported by the
                            monotonic time.
- execution time
- stall cycles              Total number of CPU stalls caused by memory 
                            accesses made by this thread.
- NVM accesses              Number of effective NVM accesses performed by
                            the application.
- latency calculation overhead cycles     Overhead cycles caused by the 
                                          emulator and that could not be
                                          amortized. Zero is expected.
                                          Otherwise, consider increasing
                                          the epoch duration.
- injected delay cycles     Total number of cycles injected by the emulator
                            to emulate the target latency.
- injected delay in usec    Same value as above, but shown in micro seconds.
- longest epoch duration    The effective longest epoch duration ever 
                            performed for this thread.
- shortest epoch duration   The effective shortest epoch duration ever 
                            performed for this thread.
- average epoch duration    The average epoch duration for this thread.
- number of epochs          Total number of epochs performed for this 
                            thread.
- epochs which didn't reach min duration   Number of epochs requested by 
                                           either Thread Monitor or thread 
                                           synchronizations, but were not 
                                           open since the epoch durations
                                           didn't reach the minimum epoch
                                           duration.
- static epochs requested   Number of epochs requested by the Thread Monitor.

Support to PAPI

Performance API (PAPI) library may be used with the emulator and there are some hooks to switch the current CPU counters reading method to PAPI. Up to the time of this writing, there was no way to make PAPI CPU counter reading to perform at the performance level required by the emulation. In the future, if it is desired to switch to PAPI, follow these steps:

Device pmc_ioctl_setcounter() and emulator lib set_counter() in dev/pmc.c calls can be deleted.
Define PAPI_SUPPORT for src/lib/* source code.
Compile with lib/cpu/pmc-papi.c rather than lib/cpu/pmc.c.
Link code with PAPI and add PAPI include directory.
Some extra tweaks may be required, check TODOs in the code.

Multiple emulated processes and MPI programs

The emulator needs to bind user threads to specific CPU cores in order to optimize emulation results. It is required to export the EMUL_LOCAL_PROCESSES environment variable with the number or emulated processes on the host. The emulator will manage each emulated processes to partition the available CPUs in a coordinated way. It is recommended to set EMUL_LOCAL_PROCESSES with up to half number of available CPU cores (note DRAM+NVM mode already reserves half of available CPU cores).

If EMUL_LOCAL_PROCESSES is not set or set with a value lower than 2, the emulator will not partition CPU cores per process.

If some process crashes the emulator might not have cleaned up the environment and the process rank ids will not be correctly managed. On this case, close all emulated processes and delete files /tmp/emul_lock_file and /tmp/emul_process_local_rank if they exist.

Bandwidth emulation

Quartz supports an emulation mode with "throttled" memory bandwidth.

The memory bandwidth emulation makes use of the copy kernel from the Stream benchmark, openMP version. When the bandwidth emulation is enabled for a first time, Quartz creates a memory bandwidth model by utilizing the available Thermal Registers in the Memory Controller and measuring the corresponding memory bandwidth. This initial step of building a model might take several minutes (~10min).

For the memory bandwitdh emulation, turn off the latency modeling in the configuration file and select all available NUMA nodes in the configuration file in order to prepare the model for any combination of NUMA nodes selection.

Modeling data will be cached to these files:

/tmp/bandwidth_model
/tmp/mc_pci_bus

As first step, the emulator will detect the Memory Controller Thermal Registers Control PCI addresses and cache it to /tmp/mc/pci_bus. After this step, the emulator will close the current execution to safely clear NUMA bindings. Rerun the process to resume the work.

Quartz will create the file: /tmp/bandwidth_model.

It reflects the relationship between Thermal Registers and achievable memory bandwidth (in a single socket). The line format in this file is:

read <thermal register value> <memory bandwidth MB/s>

This file should present ascending values of memory bandwidth ranging from hundreds of MiB/s to tens of GiB/S. These values (or their approximations) can be used for the experiments with memory bandwidth throttling. Note, that the model is built once: it is cached and then used for all later experiments. (You can also run a specially prepared automated script bandwidth-model-building.sh in directory benchmark-tests. For details see [README-BENCHMARKS-TESTING.md] (https://github.hpe.com/labs/quartz/blob/master/README-BENCHMARKS-TESTING.md).

For example, to enable memory bandwidth throttling at 2 GB/s, you should change the emulator configuration file "nvmemul.ini" using the following settings:

bandwidth:
{
enable = true;
model = "/tmp/bandwidth_model";
read = 2000;
write = 2000;
};

Both read and write bandwidth values must be set to the same value since the emulator does not model read/write independently in the current version. See Limitations session.

The pmalloc() family is not intended to be used with the bandwidth modeling. Use numactl for instance to bind CPU and memory of the used application to the intended NUMA node depending. The bandwidth emulator considers the virtual NVRAM node only (in the configuration with two sockets). So it is required the application to keep processes/threads and data on the same NUMA node for bandwidth experiments.

Automated Benchmark Runs

We have created several automated tests with benchmark runs and output analysis for testing the correctness of configured emulation environment and the accuracy of expected results. For details see [README-BENCHMARKS-TESTING.md] (https://github.hpe.com/labs/quartz/blob/master/README-BENCHMARKS-TESTING.md).

Limitations

The emulator functionality may be affected by certain conditions in user applications:

application sets threads CPU and memory affinity.
application opens much more concurrent threads than available cores per socket. Note that on DRAM+NVM emulation mode, half of the available CPU cores is not used for user threads.
application sets handler for SIGUSR1. Other:
Write memory latency is not yet implemented.
Write/Read memory bandwidth emulation cannot be set independently.
The signal handler may cause syscalls in the application to fail. It is recommended to implement retries at the application level as a good practice for syscalls.
Child process from fork() calls are not tracked by the emulator. As a workaround, the emulator could make the library initialization function available in the external API. Applications then should call this function in the beginning of the child process.
OpenMP applications may use synchronization primitives not based on pthreads which are currently not supported.
See Todo session for details.

Todo list

Please see accompanied TODO.dox or extended documentation for an extensive list.

#License

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version. This program is distributed in the
hope that it will be useful, but WITHOUT ANY WARRANTY; without even
the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
PURPOSE. See the GNU General Public License for more details. You
should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation,
Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

#Copyright

    (c) Copyright 2016 Hewlett Packard Enterprise Development LP

NOTE: This software depends on other packages that may be licensed under different open source licenses.

quartz's People

Contributors

Stargazers

Watchers

Forkers

jxy859 cherkaso phoenixlx mindis xdcesc iamchanghyunpark lim-kwang-hyun zhaojp-frank sayounara vandana96 guimagalhaes sudkannan sudarsunkannan sarsanaee hadibrais mengyaoxie-ict wenwen412 xpsair gumi-presentation-by-dzh flairtone yuchaocs hvolos azh18 sihangliu nsq974487195 legend147 fsgeek huwan yf-chai dicridon zhangjaycee changron bongss abulila mindkid chjs hustwnlo supermt kinderriven utsaslab xiaominzou wangep leonlee666 chenfengkg albertleecn aashaka shengzhew dananjayamahesh zwdong1994 sudormroot jackdmarquez virgilshi eilowangfang yuanjiang-ni lsz92cyn lyff-github shubham-sahoo axenhook shirosakirukia herochen11 t109368088fang shihwenhsu

quartz's Issues

Failed in Multiple emulated processes and MPI programs

I export the EMUL_LOCAL_PROCESSES environment variable with the number or emulated processes on the host. And I also choice the NVM model and DRAM+NVM model to run the MPI programs and parsec benchmark for multiple programs. But I can't get the answer.

There is fatal error in MPI_Finalize , I don't know how to use this to test the multiple programs. So how to deal with this problem.

Only print Debug messages when I run my app!

I set latency=true and debug level=5.
There are no print messages for program runs and only the Debug messages.
Would you have any suggestions on how to resolve this?
Thank you!

Running an application in Quartz

scripts/runenv.sh <your_app>
This is the command mentioned to run our application.
I am trying to run dhrystone-2.1 benchmark, but not sure how to run it on Quartz tool.
Please let me know how to run dhrystone-2.1 benchmark on Quartz tool.

a question about emulation of DRAM+NVM mode

hello, I have a question about emulation of DRAM+NVM mode. In nvmemul.ini file, Which type of memory (DRAM or NVM) will be affected by latency? When I use NVM-only mode, I find the performance is changed with different latency(set in ./nvmemul.ini), even if I don't employ pmalloc and pfree. can you describe how to emulate DRAM+NVM mode ? Thank you!

Unable to run Bandwidth Model

I am having difficulty running the bandwidth-model-building.sh where I am getting a segfault error. I have checked the configuration files to make sure things are set as instructed and with debugging find that the segfault occurs when the intel_xeon_ex_get_throttle_register's regs is set to 0x00 (image of work-space below).
Would you have any suggestions on how to resolve this?

Additional Info:

I am able to run the memlat-orig-lat-test.sh memlat-bench-test-10M.sh without any issues.
-I have run this on an i7-3740QM (IvyBridge) and an i7-4700MQ (Haswell)

CPUs seem inactive

I tried to use Quartz, on a machine with 12 CPUs. However, when I htop after the emulation there seem only 2 CPUs active. How can I restore my CPUs to the initial state?

How to modify xeon-ex.h to support Core CPU?

Hi, I'm trying to get quartz to work on my Core Skylake cpus.
According to the paper, the bandwidth model utilize thermal control registers. In Xeon, the corresponding register is THRT_PWR_DIMM[0:2]. I look up the register documents for Core, there doesn't have any register named THRT_PWR_DIMM. Also, there are no registers in Core can set the max number of transactions during the 1 usec throttling time frame per power throttling . Is it possible for bandwidth model to work on Core cpus?

What is pflush()?

HI,

I found there is a pflush() function in the code. Do we need to call it in our user programs in order to inject the PM latency we want?

Some issues in pure PM mode...

I set Quartz to pure PM mode by setting physical_nodes = "0" in numemul.ini, and set read/write latency both to 1000. Then I start running a program by using runenv.sh the runtime of a test program, which has more than 100000 malloc() called inside, the runtime is about 0.13 seconds. If I run it without using runenv.sh , the runtime is about 0.12s. If I increase the read/write latencies to 10000, then running by runenv.sh, the runtime is about 0.22s.

However, if I replace malloc()/free() with pmalloc()/pfree() in the program, then the runtime is about 2.2s. Which means in a pure PM mode, pmalloc() and malloc() have obvious performance gap. But based on my understanding from the README file, pmalloc() and malloc() should have similar performance under a pure PM environment. Am I missing something?

Will the write latency in NVM only mode or DRAM + NVM (Hybrid) mode be same as the write latency of the DRAM?

Since Quartz doesn't have write memory latency implemented yet as mentioned in Limitations of README file, does this mean that any write operations performed in NVM only mode or DRAM + NVM mode will have same write latency as that of the DRAM?

Is write latency mode supported?

Hi there!

Is write latency mode supported now?

If not, what is write latency at configuration file?

Thanks!

Unexplained Building Error

I've been trying to set up Quartz for a few days. After struggling far more than I should have, I've reached a dead end. When I attempt to run the make clean all command from within quartz-master/build, I run into the error pictured in the screenshot below. (This is my second time running the instruction, hence why it begins at 77%.)

I'm assuming the error is that nvmemul.ko is undefined. The only potential cause I can identify is that when I ran scripts/install.sh, the report says that 13 packages were not upgraded. I have configured my CMakeLists.txt file so that there were no errors during the cmake .. command. I haven't been able to find anything by searching, and a friend who is well-versed in Unix did not understand why this error occurred.

I have also tried doing the build instructions from within quartz-master/src. The instructions are not clear what is meant by "the emulator's source code root folder". However, this causes an error 4% into the cmake .. command, so I'm guessing using quartz-master/src is not the solution.

Computer Information:
Intel® Core™ i7-2600S CPU @ 2.80GHz × 8 (Sandy Bridge, I believe)
AMD® Turks / AMD® Turks
Ubuntu 21.04, 64-bit
The Ubuntu installation is a partition running natively on a ~2013 IMac.

Statistics showing 0 NVM accesses for a simple linked list code using pmalloc

I have used this sample code where I have used pmalloc for a linked list

#include<stdio.h>
#include<stdlib.h>
#include "pmalloc.h"

typedef struct node
{
	int data;
	struct node *next;
}NODE;

void insertAtFront(NODE **head,int x)
{
	NODE *new_node = (NODE*)pmalloc(sizeof(NODE));
	new_node->data = x;
	new_node->next = *head;
	*head = new_node;
}


void insertAfter(NODE *prev,int x)
{
	if(prev==NULL)
	{
		printf("prev can't be NULL\n");
		return;
	}
	NODE *new_node = (NODE*)pmalloc(sizeof(NODE));
	new_node->data = x;
	new_node->next = prev->next;
	prev->next = new_node;
}

void append(NODE **head,int x)
{
	NODE *new_node = (NODE*)pmalloc(sizeof(NODE));
	new_node->data = x;
	new_node->next = NULL;

	NODE *last = *head;
	if(*head==NULL)
	{
		*head = new_node;
		return;
	}
	while(last->next != NULL)
		last = last->next;
	last->next = new_node;
}

void printList(NODE *p)
{
	while(p)
	{
		printf("%d->",p->data);
		p = p->next;
	}
	printf("\n");
}

void deleteElement(NODE **p,int elem)
{
	NODE *temp=*p;
	NODE *prev;
	if(temp != NULL && temp->data == elem) // if elem is at first node
	{
		*p = temp->next;
		free(temp);
	}
	while(temp!=NULL && temp->data!=elem)
	{
		prev=temp;
		temp=temp->next;
	}
	if(temp==NULL) return; // no such element
	prev->next = temp->next;
	free(temp);
}

void deleteAtPosition(NODE **p,int pos)
{
	if(*p==NULL) return;
	NODE *temp = *p;
	if(pos==0)
	{
		*p = temp->next;
		free(temp);
		return;
	}
	int i;
	for(i=0;temp!=NULL && i<pos-1;i++)
		temp = temp->next;   // ultimately gets previous node of the node to be deleted
	if(temp==NULL || temp->next==NULL)
		return;
	NODE *next = temp->next->next;
	free(temp->next);
	temp->next = next;
}

int getLength(NODE *p)
{
	int count = 0;
	while(p)
	{
		count++;
		p = p->next;
	}
	return count;
}

int getLengthRecursive(NODE *p)
{
	if(p==NULL)
		return 0;
	return 1 + getLengthRecursive(p->next);

}

void swapNodes(NODE **p,int x, int y)  
{
	if(x==y) 
		return;

	NODE *prevX=NULL, *prevY=NULL,*X=*p,*Y=*p;

	while(X!=NULL && X->data != x)
	{
		prevX = X;
		X = X->next;
	}

	while(Y!=NULL && Y->data !=	y)
	{
		prevY = Y;
		Y = Y->next;
	}
	
	if(X==NULL || Y == NULL)
		return;

	if(prevX==NULL)
		*p = Y;
	else
		prevX->next = Y;

	if(prevY==NULL)
		*p = X;
	else
		prevY->next = X;

	NODE *temp = X->next;
	X->next = Y->next;
	Y->next = temp;
	
}

void reverse(NODE **p)
{
	NODE *prev=NULL,*curr=*p,*next;
	while(curr!=NULL)
	{
		next = curr->next;
		curr->next=prev;
		prev = curr;
		curr = next;
	}
	*p = prev;	
}

void reverseRecursive(NODE **p)
{
	NODE *node = *p;
	if(node == NULL)
		return;
	NODE *rest = (*p)->next;
	if(rest==NULL)
		return;
	reverseRecursive(&rest);
	node->next->next = node;
	node->next = NULL;
	*p = rest;
}

int main()
{
	NODE *head = NULL;
	append(&head,1);
	insertAtFront(&head,2);
	append(&head,3);
	insertAfter(head->next,10);
	printList(head);
	printf("Length: %d \n",getLength(head));
	printf("Length Recursive: %d \n", getLengthRecursive(head));
	//deleteElement(&head,1);
	printList(head);
	//deleteAtPosition(&head,1);
	printList(head);
	printf("Length: %d \n",getLength(head));
	printf("Length Recursive: %d \n", getLengthRecursive(head));
	swapNodes(&head,2,1);
	printList(head);
	reverse(&head);
	printList(head);
	reverseRecursive(&head);
	printList(head);
	return 0;
}

My current directory contents looks like this

plinkedlist.c
src < src directory of quartz >
scripts < scripts directory of quartz>
build < build file of quartz>
nvmemul.ini
nvmemul.dox
nvmemul-orig.ini
a.out < the program executable>

I have compiled the file using the following commands
gcc -I src/lib/ plinkedlist.c -L build/src/lib/ -lnvmemul
sudo scripts/setupdev.sh load
scripts/runenv.sh ./a.out

I get the correct program output but in the statistics I get 0 NVM accesses, even though this is untrue.

Statistics Output:


===== STATISTICS (Thu Nov 23 22:22:17 2017) =====

PID: 18718
Initialization duration: 2136458 usec
Running threads: 0
Terminated threads: 1

== Running threads == 

== Terminated threads == 
	Thread id [18718]
		: cpu id: 0
		: spawn timestamp: 632629839714
		: termination timestamp: 632629839811
		: execution time: 97 usecs
		: stall cycles: 0
		: NVM accesses: 0
		: latency calculation overhead cycles: 0
		: injected delay cycles: 0
		: injected delay in usec: 0
		: longest epoch duration: 0 usec
		: shortest epoch duration: 0 usec
		: average epoch duration: 0 usec
		: number of epochs: 0
		: epochs which didn't reach min duration: 0
		: static epochs requested: 0

Is there any reason/mistake I'm making?

Executing an application in Quartz

Some questions about bandwidth emulator in DRAM+NVM mode

I have read the README.md file and I was confused with the bandwidth emulation.
Consider a duel-socket NUMA environment in which node1 is configured as a virtual NVM node. Does that mean all memory requests to node1's local memory are affected by the bandwidth emulation? (Even if the process is running on node0)
And...what if a process running on node1 access the local memory of node0, will it be affected by the bandwidth emulation?
Sorry, these questions may seem stupid because I'm not familiar with the memory access in a NUMA environment.

wrmsr:pwrite: Input/output error Turbo Boost disabled for all CPUs?

Need help to get quartz to work on skylake cpus

Hi, I'm trying to get quartz to work on Skylake cpus.
According to the paper, LDM_STALL is derived from L2stalls, L3 hits L3 miss.. which in turn are derived from the performance counter events on different cpu micro-architecture. I looked up(with papi_native_avail command) the events used on Haswell and found that most of the events still exist on Skylake except CYCLE_ACTIVITY:STALLS_L2_PENDING. The closest event I know is CYCLE_ACTIVITY:STALLS_L2_MISS which counts the Execution stalls while at least one L2 demand load is outstanding . But I'm not sure. So any idea on Skylake which event is equivalent?

By the way, I'm tyring to access native event counter instead of PAPI for performance reasons. So I have to assemble integer format of event id similar to the number 0x55305a3 in here. Any useful references for how this event id is represented?

NVM delay does not work in the middle of program execution

Hello

I have some problem with NVM read delay.

In my case, as the size of the data increases, it seems that the NVM read delay does not work in the middle of program execution. but if the data size is small, it works well.

I attached the picture that I captured the part where delay did not work using debug mode.

What should I do?

I look forward to your reply

My Experiment setup

Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz (2 socket)
Linux Kernel : 4.4.0-31-generic
Ubuntu 14.04.5 LTS
RAM : 256GB

Not able to run scala programs

ERROR: ld.so: object 'scripts/../build/src/lib/libnvmemul.so' from LD_PRELOAD cannot be preloaded (cannot open shared object file): ignored.
/usr/bin/scala: line 19: cd: /usr/share/scala/bin

Hi, I am getting the above error when i try to run scala programs. There are no issues when I try to run java and C applications.

This is the code which I am trying to run,

object ForLoop {
def main(args: Array[String]) {
var a = 0;
for( a <- 1 to 100){
println( "Value of a: " + a );
}
}
}

The above code works as usual with scala but, when ran with quartz it returns an error. The following is the command which I gave to run this code,
$ scripts/runenv.sh scala ForLoop

Note:

I have extended quartz to support Broadwell based processors and ran benchmark tests to verify the configurations. The results looked fine.
C and Java applications have no issues with quartz.

Error in loading kernel module

Hey,I am getting the error : Unable to load kernel module when i execute below command

sudo scripts/setupdev.sh load

How do i fix this ?

Can’t run Qemu in Quartz.

When I run this:
./scripts/runenv.sh qemu-system --enable-kvm -cpu host -m 8192 -smp 2 -vcpu 0,affinity=0 -vcpu 1,affinity=1 -numa node,mem=4096,cpus=0 -numa node,mem=4096,cpus=1 -drive file=/home/temp/Dyang/centos7-200.qcow2,if=none,id=drive-virtio-disk,format=qcow2 -device virtio-blk-pci,bus=pci.0,drive=drive-virtio-disk,id=virtio-disk -net nic,model=virtio -net tap,script=no -monitor telnet:10.192.168.118:4444,server,nowait -balloon virtio

I get an unexpected error.
qemu-system: ……qemu-gfn/qemu/accel/kvm/kvm-all.c:2380: kvm_ipi_signal: Assertion kvm_immediate_exit' failed`

I set the Debug level to 5 and just find nothing in Quartz print out.
But when I run qemu-system without Quartz, it works.

In kvm_ipi_signal, it calls kvm_cpu_kick to atomic_set(&cpu->kvm_run->immediate_exit,1).
In this reference（https://patchwork.ozlabs.org/patch/732808/?tdsourcetag=s_pctim_aiomsg）,

The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick" a VCPU out of KVM_RUN through a POSIX signal. A signal is attached to a dummy signal handler; by blocking the signal outside KVM_RUN and unblocking it inside, this possible race is closed:

      VCPU thread                     service thread

    check flag
                                                            set flag
                                                            raise signal
    (signal handler does nothing)
    KVM_RUN

However, one issue with KVM_SET_SIGNAL_MASK is that it has to take tsk->sighand->siglock on every KVM_RUN. This lock is often on a remote NUMA node, because it is on the node of a thread's creator. Taking this lock can be very expensive if there are many userspace exits (as is the case for SMP Windows VMs without Hyper-V reference time counter).

Since Quartz generates IPI interrupt injection delay through remote NODE node memory access, will this affect KVM?Does Quartz support Qemu? Does Quartz have some influences on kvm?

Make issue

Hi,
I met the following error while compiling the Quartz code:

when I use make clean all following your step , I got a problem.

[ 69%] Building C object src/lib/CMakeFiles/nvmemul.dir/stat.c.o
/home/ZHduan/quartz/src/lib/stat.c:19:20: fatal error: utlist.h: No such file or directory
#include "utlist.h"
^
compilation terminated.
make[2]: *** [src/lib/CMakeFiles/nvmemul.dir/stat.c.o] Error 1
make[1]: *** [src/lib/CMakeFiles/nvmemul.dir/all] Error 2
make: *** [all] Error 2

You said No specific Linux distribution or kernel version is required. So what's wrong ?
The environment I use is Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz CentOS - linux 3.11.0 gcc version 4.8.5

How to run with numactl?

When quartz come to DRAM+NVM mode , it simulate the nvm on one (remote) node and inject the latency (maybe read latency?).

So can I think that the access memory behavior in remote node's dram is NVM access behavior?

If it is , can I use numactl mbind on node to run the app in nvm? What should I change in nvmemul.ini?

Kernel module loading failed when run setupdev.sh load

Hi guys !
I had a problem when I run command "sudo scripts/setupdev.sh load". It reports that Kernel module loading failed.I don't know how to fix it .I was following the README step by step.
My OS is 4.15.0-46-generic #49-Ubuntu
And the prerequest I guess i have been installed successfully.Because when i use apt-get install xxx it says that

cmake is already the newest version (3.10.2-1ubuntu2).
libconfig-dev is already the newest version (1.5-0.4).
libnuma-dev is already the newest version (2.0.11-2.1).
uthash-dev is already the newest version (2.0.2-1).

I use "apt-get install linux-headers-$(uname -r)“ to install linux-header it says that
linux-headers-4.15.0-46-generic is already the newest version (4.15.0-46.49).

I don't know if there is any version incompatibility problem. Could anybody give me a favour?
Big Thanks!

Feature request: NVM programming support for mmap()

In my case(running Jikes RVM on NVM), I need a specific virtual memory range mapped to NVM, by using some API like pmmap().
Can you give me some hint to start patching quartz?

Run wtih C++ program and complie return errors:

I run Quartz with my own CPP file, with the command:
g++ -I [Eumlator_Path]/quartz/src/lib/ myprogram.cpp -L [Eumlator_Path]/quartz/build/src/lib/ -lnvmemul,
(it works well with .C file with gcc complier)
But turns error:
/usr/include/c++/6/ext/string_conversions.h: In constructor ‘__gnu_cxx::__stoa(_TRet ()(const _CharT, _CharT**, _Base ...), const char*, const _CharT*, std::size_t*, _Base ...)::_Save_errno::_Save_errno()’:
/usr/include/c++/6/ext/string_conversions.h:63:27: error: ‘errno’ was not declared in this scope
_Save_errno() : _M_errno(errno) { errno = 0; }
^
/usr/include/c++/6/ext/string_conversions.h: In destructor ‘__gnu_cxx::__stoa(_TRet ()(const _CharT, _CharT**, _Base ...), const char*, const _CharT*, std::size_t*, _Base ...)::_Save_errno::~_Save_errno()’:
/usr/include/c++/6/ext/string_conversions.h:64:23: error: ‘errno’ was not declared in this scope
~_Save_errno() { if (errno == 0) errno = _M_errno; }
^
/usr/include/c++/6/ext/string_conversions.h: In function ‘_Ret __gnu_cxx::__stoa(_TRet ()(const _CharT, _CharT**, _Base ...), const char*, const _CharT*, std::size_t*, _Base ...)’:
/usr/include/c++/6/ext/string_conversions.h:72:16: error: ‘errno’ was not declared in this scope
else if (errno == ERANGE
^
In file included from /usr/include/c++/6/bits/basic_string.h:5420:0,
from /usr/include/c++/6/string:52,
from /usr/include/c++/6/bits/locale_classes.h:40,
from /usr/include/c++/6/bits/ios_base.h:41,
from /usr/include/c++/6/ios:42,
from /usr/include/c++/6/ostream:38,
from /usr/include/c++/6/iostream:39,
from /home/lishuai/fwang/quartz/reram_test.cpp:2:
/usr/include/c++/6/ext/string_conversions.h:72:25: error: ‘ERANGE’ was not declared in this scope
else if (errno == ERANGE

The primal error has a lot "XXX" was not declared in this scope, and I have already fixed some. But for the left, I need help.
Have somebody met the same problem or give some suggestions? Thank you.

Can quartz support other cpus?

My cpu is core i5 7th gen, which is Kaby lake rather than one of the three cpus mentioned in the articke, can I build and run quartz successfully?
Besides, can I run quartz in a virtual machine with linux OS?

no rule to make tart pmc.o

Hi,
I met the following error while compiling the Quartz code:

[root@localhost build]# make
[ 8%] Built target cpu
[ 82%] Built target nvmemul
[ 86%] Device]
make[5]: *** No rule to make target `/home/sbl/Quartz/quartz-master/build/src/dev/pmc.o, needed by /home/sbl/Quartz/quartz-master/build/src/dev/nvmemul.o. Stop.
make[4]: *** [module/home/sbl/Quartz/quartz-master/build/src/dev] Error 2
gmake[3]: *** [all] Error 2
make[2]: *** [src/dev/nvmemul.ko] Error 2
make[1]: *** [src/dev/CMakeFiles/dev_build.dir/all] Error 2
make: *** [all] Error 2

The environment I use is 2Socket Xeon5600/CentOS-7/Linux4.10/gcc-4.8.5.
I have installed all the required packages in the README.md, and compile the code in the following steps:

mkdir build
cd build
cmake ..
make

and the aforementioned error occurs...
Any suggestions?

Thank you very much.

Computer configuration

My Computer configuration is very common personal computer. Intel core i3, Ubuntu 14.04. Can I install quartz successfully

The number of physical nodes is greater than the number of memory-controller pci buses

when I first run the benchmarktest of bandwidth in benchmarktest directory,it show me "The number of physical nodes is greater than the number of memory-controller pci buses".the result figure is as below:

it show that topology mc pci file saved,but there is no data in /tmp/mc_pci_bus.
then I run the benchmarktest of bandwitdh,it show that there is no complete memory-controller pci topology to be found and report segmentation fault.the result figure is as below:

thanks in advance for any help! my cpu model is haswell.

tee: /sys/bus/event_source/devices/cpu/rdpmc: No file or folder of this type" ?

if you meet the issue：
tee: /sys/bus/event_source/devices/cpu/rdpmc: No file or folder of this type" ?

I think the problem might be:

your Quartz is built on Virtual Machine. Refer to https://stackoverflow.com/questions/19763070/ubuntu-12-10-perf-stat-not-supported-cycles/44253130#44253130, I guess RDPMC is still unavailable on the most virtual machine (at least I tried Ubuntu 14.04, 16.04 and 18.08 and centos 7.0 with Linux kernel 4.4 and 4.11 respectively.)
Still exploring other solution to support virtual machine within Quartz.

xeon E5-2620 v4 @2.10GHz No supported processor found

Hello, I would like to ask a question, my server's cpu model is xeon E5-2620 v4 @2.10GHz, in the implementation of runenv.sh script prompt [16811] ERROR: No supported processor found. I want to determine if this processor meets the requirements.

Releasing videocapture object in OpenCV takes more and more time.

I'm running a opencv program with quartz, this program is to read a lots of videos from a dataset and get some frames from video. But with for loop going on, the videocapture object's release function takes more and more time. At the beginning, release() takes a few milliseconds, then takes hundreds of milliseconds, finally the program need to wait for release() for seconds.

Here is my program:

#include <fstream>
#include <iostream>
#include <string>
#include <cstdio>
#include <random>
#include <algorithm>

#include <opencv2/core/core.hpp>
#include <opencv2/core/version.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/highgui/highgui_c.h>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/opencv.hpp>

#include <sys/time.h>

#include "/home/liupai/hme-workspace/hme-opencv-test/quartz/src/lib/pmalloc.h"

using namespace std;

void ImageChannelToBuffer(const cv::Mat* img, char* buffer, int c)
{
    int idx = 0;
    for (int h = 0; h < img->rows; ++h) {
        for (int w = 0; w < img->cols; ++w) {
            buffer[idx++] = img->at<cv::Vec3b>(h, w)[c];
        }
    }
}
int data_size = 0;
int read_video_to_volume_datum(const char* filename, const int start_frm,
    const int label, const int length, const int height, const int width,
    const int sampling_rate, char** datum)
{
    cv::VideoCapture cap;
    cv::Mat img, img_origin;

    int offset = 0;
    int channel_size = 0;
    int image_size = 0;
    
    int use_start_frm = start_frm;

    cout << "\n#######Start!!!! cap.open file" << endl;
    cap.open(filename);
    if (!cap.isOpened()) {
        cout << "Cannot open " << filename << endl;
        return false;
    }

    int num_of_frames = cap.get(CV_CAP_PROP_FRAME_COUNT) + 1;
    if (num_of_frames < length * sampling_rate) {
        cerr << filename << " does not have enough frames; having "
             << num_of_frames << endl;
        return false;
    }

    offset = 0;
    if (use_start_frm < 0) {
        cerr << "start frame must be greater or equal to 0" << endl;
    }

    int end_frm = use_start_frm + length * sampling_rate - 1;
    if (end_frm > num_of_frames) {
        cerr << "end frame must be less or equal to num of frames, "
             << "filename: " << filename << endl;
    }

    if (use_start_frm) {
        cout << "\033[31m"
             << "use_start_frm: " << use_start_frm
             << ", end_frame: " << end_frm
             << ", num_of_frames: " << num_of_frames
             << ", filename: " << filename
             << "\033[0m" << endl;
        cap.set(CV_CAP_PROP_POS_FRAMES, use_start_frm - 1);
    }

    for (int i = use_start_frm; i <= end_frm; i += sampling_rate) {
        if (sampling_rate > 1) {
            cap.set(CV_CAP_PROP_POS_FRAMES, i);
        }

        if (height > 0 && width > 0) {
            cap.read(img_origin);
            if (!img_origin.data) {
                cerr << filename << " has no data at frame " << i << endl;
                if (*datum != NULL) {
                    pfree(datum, data_size);
                }
                cap.release();
                return false;
            }
            cout << "resize img_origin" << endl;
            cv::resize(img_origin, img, cv::Size(width, height));
        } else {
            cap.read(img);
        }

        if (!img.data) {
            cerr << "Could not open or find file " << filename << endl;
            if (*datum != NULL) {
                pfree(datum, data_size);
            }
            cap.release();
            return false;
        }

        if (i == use_start_frm) {
            image_size = img.rows * img.cols;
            channel_size = image_size * length;
            data_size = channel_size * 3;
            *datum = (char*)pmalloc(data_size*sizeof(char));
        }

        for (int c = 0; c < 3; c++) {
            ImageChannelToBuffer(&img, *datum + c * channel_size + offset, c);
        }
        cout << "offset = " << offset << endl;
        offset += image_size;
        img_origin.release();
    }
    cout << "\033[32mstart cap.release()\033[0m" << endl;
    struct timeval tv_begin, tv_end;
    gettimeofday(&tv_begin, NULL);
    cap.release();
    gettimeofday(&tv_end, NULL);
    cout << "cap.release(): " << 1000.0*(tv_end.tv_sec - tv_begin.tv_sec)
        + (tv_end.tv_usec - tv_begin.tv_usec)/1000.0 << " ms." << endl;
    cout << "\033[32mend cap.release()\033[0m" << endl;
    return true;
}

void shuffle_clips(vector<int>& shuffle_index){
    std::random_device rd;
    std::mt19937 g(rd());
    std::shuffle(shuffle_index.begin(), shuffle_index.end(), g);
}

int main()
{
    const string root_folder = "/home/liupai/hme-workspace/train-data/UCF-101/";
    const string list_file = "/home/liupai/hme-workspace/workspace/C3D/C3D-nvram/examples/c3d_ucf101_finetuning/train_02.lst";

    cout << "opening file: " << list_file << endl;
    std::ifstream list(list_file.c_str());

    vector<string> file_list_;
    vector<int> start_frm_list_;
    vector<int> label_list_;
    vector<int> shuffle_index_;

    int count = 0;
    string filename;
    int start_frm, label;
    while (list >> filename >> start_frm >> label) {
        file_list_.push_back(filename);
        start_frm_list_.push_back(start_frm);
        label_list_.push_back(label);
        shuffle_index_.push_back(count);
        count++;
    }
    shuffle_clips(shuffle_index_);

    const int dataset_size = shuffle_index_.size();
    const int batch_size = 30;
    const int new_length = 8;
    const int new_height = 128;
    const int new_width = 171;
    const int sampling_rate = 1;
    char* datum = NULL;
    int lines_id_ = 0;

    const int max_iter = 20000;
    for (int iter = 0; iter < max_iter; ++iter) {
        
        for (int item_id = 0; item_id < batch_size; ++item_id) {
            cout << "------> iter: " << iter << endl;
            bool read_status;
            int id = shuffle_index_[lines_id_];
            read_status = read_video_to_volume_datum((root_folder + file_list_[id]).c_str(), start_frm_list_[id],
                label_list_[id], new_length, new_height, new_width, sampling_rate, &datum);
            if (read_status) {
                pfree(datum, data_size);
            }

            lines_id_++;
            if (lines_id_ >= dataset_size) {
                // We have reached the end. Restart from the first.
                cout << "Restarting data prefetching from start." << endl;
                lines_id_ = 0;
            }
        }
    }
    cout << "$$$$$$$$$$$$$$ read file finish!!!!!!!!!!!!" << endl;
}

Here is a output:

# At the beginning

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 65, end_frame: 72, num_of_frames: 179, filename: /home/liupai/hme-workspace/train-data/UCF-101/PlayingViolin/v_PlayingViolin_g24_c02.avi
...
start cap.release()
cap.release(): 3.018 ms.
end cap.release()

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 1, end_frame: 8, num_of_frames: 202, filename: /home/liupai/hme-workspace/train-data/UCF-101/TrampolineJumping/v_TrampolineJumping_g18_c01.avi
...
start cap.release()
cap.release(): 3.062 ms.
end cap.release()

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 81, end_frame: 88, num_of_frames: 296, filename: /home/liupai/hme-workspace/train-data/UCF-101/PommelHorse/v_PommelHorse_g12_c03.avi
...
start cap.release()
cap.release(): 2.453 ms.
end cap.release()

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 49, end_frame: 56, num_of_frames: 272, filename: /home/liupai/hme-workspace/train-data/UCF-101/StillRings/v_StillRings_g22_c04.avi
...
start cap.release()
cap.release(): 2.146 ms.
end cap.release()

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 225, end_frame: 232, num_of_frames: 252, filename: /home/liupai/hme-workspace/train-data/UCF-101/HeadMassage/v_HeadMassage_g08_c03.avi
...
start cap.release()
cap.release(): 2.136 ms.
end cap.release()

------> iter: 0
#######Start!!!! cap.open file
use_start_frm: 49, end_frame: 56, num_of_frames: 106, filename: /home/liupai/hme-workspace/train-data/UCF-101/Bowling/v_Bowling_g19_c07.avi
...
start cap.release()
cap.release(): 3.315 ms.
end cap.release()

# After about 400 iterations

------> iter: 437
#######Start!!!! cap.open file
use_start_frm: 113, end_frame: 120, num_of_frames: 376, filename: /home/liupai/hme-workspace/train-data/UCF-101/Kayaking/v_Kayaking_g13_c04.avi
...
start cap.release()
cap.release(): 301.021 ms.
end cap.release()

------> iter: 437
#######Start!!!! cap.open file
use_start_frm: 49, end_frame: 56, num_of_frames: 141, filename: /home/liupai/hme-workspace/train-data/UCF-101/ApplyLipstick/v_ApplyLipstick_g20_c04.avi
...
start cap.release()
cap.release(): 301.74 ms.
end cap.release()

------> iter: 437
#######Start!!!! cap.open file
use_start_frm: 209, end_frame: 216, num_of_frames: 230, filename: /home/liupai/hme-workspace/train-data/UCF-101/BlowDryHair/v_BlowDryHair_g18_c03.avi
...
start cap.release()
cap.release(): 302.311 ms.
end cap.release()

------> iter: 438
#######Start!!!! cap.open file
use_start_frm: 177, end_frame: 184, num_of_frames: 307, filename: /home/liupai/hme-workspace/train-data/UCF-101/BoxingPunchingBag/v_BoxingPunchingBag_g08_c01.avi
....
start cap.release()
cap.release(): 351.546 ms.
end cap.release()

------> iter: 438
#######Start!!!! cap.open file
use_start_frm: 49, end_frame: 56, num_of_frames: 113, filename: /home/liupai/hme-workspace/train-data/UCF-101/FrontCrawl/v_FrontCrawl_g21_c06.avi
...
start cap.release()
cap.release(): 292.598 ms.
end cap.release()
------> iter: 438

0 NVM access

When I try to run a program I don't get the correct output especially for NVM access, is it due to this :
"tee: /sys/bus/event_source/devices/cpu/rdpmc: No file or folder of this type" ?

Errors regarding copy_from_user and libelf-dev are emitted during build

When I compile Quartz on Ubuntu 16.04 with kernel 4.15.0.29, I get three errors:

Makefile:976: "Cannot use CONFIG_STACK_VALIDATION=y, please install libelf-dev, libelf-devel or elfutils-libelf-devel"

/home/hadi/code/quartz/build/src/dev/pmc.c: In function ‘pmc_ioctl_setcounter’:
/home/hadi/code/quartz/build/src/dev/pmc.c:171:9: error: implicit declaration of function ‘copy_from_user’ [-Werror=implicit-function-declaration]
if (copy_from_user(&q, (ioctl_query_setcounter_t*) arg, sizeof(ioctl_query_

/home/hadi/code/quartz/build/src/dev/pmc.c: In function ‘pmc_ioctl_getpci’:
/home/hadi/code/quartz/build/src/dev/pmc.c:224:17: error: implicit declaration of function ‘copy_to_user’ [-Werror=implicit-function-declaration]
if (copy_to_user((ioctl_query_setgetpci_t*) arg, &q, sizeof(ioctl_q

I resolved the first error by installing libelf-dev. Note that this library is not included in the script scripts/install.sh. I resolved the other two errors by modifying pmc.c so that it includes linux/uaccess.h instead of asm/uaccess.h.

After making these changes, the build completes successfully.

CPU support issue

Looks like by default, only 3 CPUs are supported:

In /src/lib/cpu/known_cpus.h line 21:

cpu_model_t* known_cpus[] = {
&cpu_model_intel_xeon_ex_v3,
&cpu_model_intel_xeon_ex_v2,
&cpu_model_intel_xeon_ex,
0
};

My question is, can we add our own CPU model names into this without causing any trouble, as long as the CPU I use is in the three processor families: Sandy Bridge, Ivy Bridge, and Haswell？

wonder about statistics

Hello,
during the experiments with pure PM mode,
I found that the number of NVM accesses are very different in each trial as followings.
I only changed the latency of read and write in the nvmemul.ini

Are there any other configurations should I do to get correct emulation results?

The program uses malloc() and free(), and I run the script after loading nvmemul module.

scripts/runenv.sh prog.exe args

following is CPU information

2-socket, Haswell, 2-way E5-4650v3

Usage of undocumented performance events on Haswell

Quartz uses the two encodings 0x530cd3 and 0x5303d3 for the events MEM_LOAD_UOPS_L3_MISS_RETIRED.REMOTE_DRAM and MEM_LOAD_UOPS_L3_MISS_RETIRED.LOCAL_DRAM, respectively. However, these encodings are only documented in the Intel manual for Ivy Bridge and not Haswell. Instead, on Haswell, the encodings to be used should be 0x5304d3 and 0x5301d3, respectively.

Can DRAM+NVM mode run on Sandy Bridge ?

My cpu is Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz and has two socket.

When I load the module , it find that this is Sandy Bridge.

Do Quartz support D+N mode in Sandy Bridge?

Segmentation Fault Error

Hello, I am trying to execute an application through the emulator. My application is executing successfully in the native machine. I try to link it with the emulator by adding the following flags in the compilation :

-I/NVMemul/quartz/src/lib/ -L/NVMemul/quartz/build/src/lib/ -lnvmemul

However, when I try to execute the app through the runenv.sh script I receive the following error:

../quartz/scripts/../build/src/lib/libnvmemul.so
../quartz/scripts/../nvmemul.ini
../quartz/scripts/runenv.sh: line 57: 25128 Segmentation fault (core dumped) $@

I have executed applications in the past successfully with these flags. Is there anything else that I am missing?

Does Scala programs run on Quartz?

I just wanted to know if anyone were able to run Scala programs on Quartz. If so, what were the changes which you made to be able to run it?

Quartz affecting the application thread sleep time in hybrid (DRAM+NVM) mode.

I wrote a sample program in which I allocate memory randomly to dram (using malloc) and nvm (using pmalloc) and a background thread which is supposed to print out the total bytes allocated to NVM and DRAM after every 1 second.

#include <iostream>
#include <cstdlib>
#include <chrono>
#include <thread>
#include <pthread.h>

using namespace std::chrono;

size_t nvm_size = 0;
size_t dram_size = 0;
high_resolution_clock::time_point start;
high_resolution_clock::time_point stop;
bool status = true;

void print_all() {
    stop = high_resolution_clock::now();
    milliseconds time = duration_cast<milliseconds>(stop-start);
    std::cout << time.count() << "\t" << nvm_size << "\t" << dram_size << std::endl;
}

void start_time() {
    start = high_resolution_clock::now();
    while (status) {
        print_all();
        std::this_thread::sleep_for(seconds(1));
    }
}

void stop_time() {
    status = false;
}

void add_nvm_size(size_t size) {
    nvm_size += size;
}

void remove_nvm_size(size_t size) {
    nvm_size -= size;
}

void add_dram_size(size_t size) {
    dram_size += size;
}

void remove_dram_size(size_t size) {
    dram_size -= size;
}

// void *allocate_nvm(size_t size) {
//     return pmalloc(size);
// }

void *allocate_dram (size_t size) {
    return malloc(size);
}

int main(int argc, char *argv[]) {
    std::thread (start_time).detach();

    int count=1;
    
    while(count<=10000000) {
        int random = rand() % 4;

        if (random==0) {

            allocate_dram (67108864);
            add_dram_size(67108864);
            // std::cout<<count<<"- Allocated in DRAM"<<"\tDRAM SIZE: "<<dram_size<<std::endl;


        }
        else if(random==1){

            allocate_dram (67108864);
            add_nvm_size(67108864);
            // std::cout<<count<<"- Allocated in NVRAM"<<"\tNVRAM SIZE: "<<nvm_size<<std::endl;

        }
        else if(random==2){

            if(dram_size>=67108864) {
                remove_dram_size(67108864);
                // std::cout<<count<<"- Freed from DRAM"<<"\tDRAM SIZE: "<<dram_size<<std::endl;

            }
            // else
                // std::cout<<count<<"- Not Enough Memory Allocated in DRAM to be freed"<<"\tDRAM SIZE: "<<dram_size<<std::endl;



        }
        else if(random==3){

            if(nvm_size>=67108864) {
                remove_nvm_size(67108864);
                // std::cout<<count<<"- Freed from NVRAM"<<"\tNVRAM SIZE: "<<nvm_size<<std::endl;

            }
            // else
                // std::cout<<count<<"- Not Enough Memory Allocated in NVRAM to be freed"<<"\tNVRAM SIZE: "<<nvm_size<<std::endl;


        }

        count++;

    }
    stop_time();
    return 0;
}

The following program ouputs correctly outside quartz. It displays the ouput after every 1 second. So on the left is time in milliseconds, followed by bytes allocated on NVM and bytes allocated on DRAM.

time    NVM    DRAM
0	201326592	0
1000	2885681152	1275068416
2000	30735859712	3288334336
3000	16911433728	138512695296
4040	37983617024	191797133312
5042	14159970304	129654325248
6361	38453379072	189918085120
7363	33554432000	108045271040
8365	15099494400	109521666048
9366	24763170816	117306294272

When I run this program with the quartz in hybrid mode it prints output after 10 milliseconds.

0	268435456	7650410496
10	1879048192	10401873920
20	2415919104	6845104128
30	2483027968	12616466432
40	3556769792	11811160064
50	4496293888	11408506880
60	536870912	17783848960
70	11072962560	16575889408
80	8120172544	9663676416
90	8657043456	7583301632
100	4966055936	268435456
110	939524096	1006632960
120	1476395008	2617245696
130	1946157056	10066329600
140	1073741824	14898167808
150	2281701376	15502147584
160	1744830464	17448304640
...

So quartz is not affecting the functionality of the thread but it's affecting the sleep time of thread.
I have not set EMUL_LOCAL_PROCESSES. Do I need to? Also why will quartz affect only the sleep time of a application thread?

Broadwell Intel processors Not supported

Hello , my server's cpu is Xeon E5-2630 v4 @ 2.20GHz , a Broadwell processor , and it's not supported.
Could you please modify the program to support this new processor ?

How to set DRAM+NVM or single NVM?

I don't understand from the documentation how can we define the mode that we want to use. As I understand in the nvmemul.ini file we define the parameters of the NVM, however how can we select which mode of the emulator we will use? Thanks

Setting read throttling register?

Hello,

The BW throttling worked for me only after setting the THROTTLE_DDR_READ registers
in the __set_read_bw function specifically for runs after the training phase when the bandwidth model file is already present. Is this correct?

__set_read_bw() {
...
node->cpu_model->set_throttle_register(regs, THROTTLE_DDR_ACT,
read_bw_model.throttle_reg_val[point]);
//Added statement
node->cpu_model->set_throttle_register(regs, THROTTLE_DDR_READ,
read_bw_model.throttle_reg_val[point]);
...
}

Can we change the Memory capacity that we simulate?

Since the capacity of PCM can be larger than DRAM, can we set the NVM to be larger than the DRAM when we do the simulation?

Thank you very much.

hewlettpackard / quartz Goto Github PK

quartz's Introduction

Quartz: A DRAM-based performance emulator for NVM

Contributors

Extended documentation

Dependencies

Supported environment

Source code tree overview

Building

Usage

Configuration file

Latency emulation modes

NVM programming

Statistics

Support to PAPI

Multiple emulated processes and MPI programs

Bandwidth emulation

Automated Benchmark Runs

Limitations

Todo list

quartz's People

Contributors

Stargazers

Watchers

Forkers

quartz's Issues

Recommend Projects

Recommend Topics

Recommend Org