ooibc88 / gam Goto Github PK

Globally Addressable Memory management (efficient distributed memory management via RDMA and caching)

Makefile 0.61% C++ 94.18% Shell 1.05% C 2.64% M4 1.52%

gam's Introduction

Overview

GAM (Globally Addressable Memory) is a distributed memory management platform which provides a global, unified memory space over a cluster of nodes connected via RDMA (Remote Direct Memory Access). GAM allows nodes to employ a cache to exploit the locality in global memory accesses, and uses an RDMA-based, distributed cache coherency protocol to keep cached data consistent. Unlike existing distributed memory management systems which typically employ Release Consistency and require synchronization primitives to be explicitly called for data consistency, GAM enforces the PSO (Partial Store Order) memory model which ensures data consistency automatically and relaxes the Read-After-Write and Write-After-Write ordering to remove costly writes from critical program execution paths. For more information, please refer to our VLDB'18 paper.

Build & Usage

Prerequisite

libverbs
boost thread
boost system
gcc 4.8.4+

GAM Core

First build libcuckoo in the lib/libcuckoo directory by following the README.md file in that directory, and then go to the src directory and run make therein.

  cd src;
  make -j;

Test and Micro Benchmark

We provide an extensive set of tools to test and benchmark GAM. These tools are contained in the test directory, and also serve the purpose of demonstrating the usage of the APIs provided in GAM. To build them, simply run make -j in the test directory.

A script benchmark-all.sh is provided in the script directory to facilitate the benchmarking of GAM. This script is also used to generate the result of the micro benchmark in the GAM paper. To run this script, a slaves file needs to be provided within the same directory. Each line of the slaves file contains the ip address and port (separated by space) of a node that is involved in the benchmarking, and the number of lines contained in the slaves file should be no smaller than that of nodes for benchmarking. There are multiple parameters that can be varied for a thorough benchmarking, please refer to our paper for detail.

Applications

We build two distributed applications on top of GAM by using the APIs GAM provide, a distributed key-value store and distributed transaction processing engine. To build them, simply run the below commands:

  cd dht
  make -j
  cd ../database
  make -j

Macro Benchmark

There is a script kv-benchmark.sh provided in the dht directory to benchmark the key-value store. To run it, please change the variables in the script according to the experimental setting. There are also several parameters that can be varied for benchmarking, such as thread number, get ratio and number of nodes. Please refer to the GAM paper and the script for detail.

To run the TPCC benchmark, please follow the instructions of the README file in the database directory.

FaRM

We implement the FaRM system as a baseline for macro benchmark. To build the FaRM codebase, please run the below command:

  git checkout farm 
  cd src
  make -j

We also provide several tools to test and benchmark our FaRM implementation. Please go to the test directory, and make -j therein to generate those tools. All tools but farm-cluster-test can be run directly. For farm-cluster-test, a script run_farm_cluster.sh is provided in scripts directory. Please change the variables in that script according to the deployment environment.

References

[1] Qingchao Cai, Wentian Guo, Hao Zhang, Gang Chen, Beng Chin Ooi, Kian-Lee Tan, Yong Meng Teo, and Sheng Wang. Efficient Distributed Memory Management with RDMA and Caching. PVLDB, 11 (11): 1604- 1617, 2018. DOI: https://doi.org/10.14778/3236187.3236209.

[2] Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. FaRM: Fast remote memory. Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation. 2014.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Notice

The TPCC benchmark code in the database directory is adapted from an open source project Cavalia, which can be found at

  https://github.com/Cavalia/Cavalia

In addition, this project uses the event loop implementation of Redis, which can be found at

  https://redis.io/

gam's People

Stargazers

Watchers

Forkers

guowentian cac2003 simbawei magdaap litongyou wenhuizhang pranasblk huangtao00 dloghin xixicat dananjayamahesh yangzhou1997 huangyibo maxemanuel murolo lambda7xx charles-typ sakura0423 toziegler spongeann maziyar-na huyk18 baoxuezhao feixuwu ruihong123 khaiwang cloud-za swhzzh 00mjk aygu7370 zqyi jordanisaacs wjt1127 ergedathunder liyutingxxn second222none ryuguo rijuyuezhu bdiclab vegetableysm

gam's Issues

Adding MFence to enforce SC consistency doesn't work as expected

Hi @ooibc88 @cac2003 @guowentian

I have been trying to understand the impact of stronger consistency guarantees on application performance in GAM. To this end, I tried to enforce SC consistency by adding an MFence operation after each write (as suggested in Section 4 of the paper: “For example, sequential consistency can be easily achieved by inserting MFence following each Write operation.”). Below are details on the experimental setup, methodology and results.

Experiment setup:

Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3 and keep it in their local cache.

Method:

I replayed several memory traces captured from different applications against GAM, under two scenarios (listed below), and recorded the execution time for both of them. The memory footprint of the application (~1GB) is larger than local cache size (512MB), so there are evictions along with invalidations. All memory accesses are 1 byte.

Scenario 1: Run an application with 10 threads on VM1, PSO consistency.
Scenario 2: Run an application with 10 threads on VM1, enforce SC with memory fences.

Result:

I expected Scenario 2 to be slower since writes cannot be asynchronous anymore. However, Scenario 2 was actually faster than scenario 1 (by 5%-10%).

Questions:

Is the MFence operation completely supported in the current code base?
Are there any benchmarks that compare SC and PSO consistency in the repo?

Thank you for taking the time to read this issue --- I would really appreciate any help!

Abnormal memory access latency when using multiple servers

Hi @ooibc88 @cac2003 @guowentian

I ran some performance benchmarks on GAM that yield unexpected latency numbers when I increase the number of servers, and I was hoping to get some insights from you regarding them. Below are details on the experimental setup, methodology and results.

Experiment setup:

Two servers VM1 and VM2 with 512MB of local memory, and all memory used as cache.
One server VM3 with all available DRAM used as local memory (~10GB), and no cache.

Therefore VM1 and VM2 fetch data from VM3, and keep it in their local cache.

Method:

Scenario 1: Replay the memory traces for 10 threads on VM1, keep VM2 idle.
Scenario 2: Replay the memory traces for 10 threads on VM1 and 10 threads on VM2; this means that there are invalidations between the VMs due to shared memory accesses.

Results:

I expected Scenario 2 to be slower due to more invalidations between VM1 and VM2, but found Scenario 2 was actually faster than Scenario 1.

To understand the results better, I profiled the memory access latency in GAM, separating the latency for local and remote memory accesses (as shown in the table below; only measured for read operations, since write operations are always asynchronous under the PSO model).

	Local access latency(us)	Remote access latency(us)
Scenario 1	2.2	299
Scenario 2	1.4	84

Even though there are invalidations in Scenario 2, the remote access latency is smaller for Scenario 2 compared to Scenario 1. Also there is a slight speed up in local memory accesses in Scenario 2.

Despite extensive profiling, I was unable to explain this strange behavior; is this expected? If so, why? Thank you for taking the time to read this issue --- I would really appreciate any help!

Failed to run benchamark-all.sh

Hi @cac2003 @guowentian @ooibc88

Since the IB network is not available for us, I adapted GAM to run on RoCE (thanks @charles-typ).
However, we have some problems when running ./scripts/benchmark-all.sh.
Experiment Setup (3VMs):

GCC: 10.3.1
Kernel: 5.10.0
remote_ratio > 0

Here is the output (including some customized logs):

[10744] 03 Feb 16:46:12.129 - [benchmark.cc:658-main()] #Node ID = 1
[6344] 03 Feb 16:46:13.088 - [benchmark.cc:658-main()] #Node ID = 2
[5008] 03 Feb 16:46:14.101 - [benchmark.cc:658-main()] #Node ID = 3
cannot find the key(2) for hash table widCliMap (key not found in table)cannot find the key(1) for hash table widCliMap (key not found in table)[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 1 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 2 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:668-main()] Get 3 on node 3
[5008] 03 Feb 16:46:17.101 - [benchmark.cc:671-main()] ###All workers started, reported by node 3###
[5012] 03 Feb 16:46:17.102 - [benchmark.cc:144-Init()] start init
cannot find the key(2) for hash table widCliMap (key not found in table)[10746] 03 Feb 16:46:17.101 - [master.cc:226-ProcessRequest()] unrecognized work request 1

Does anyone have some experience with this?
Thanks!!!

Unable to allocate hash table

Thanks for providing your source code online for reproduction!
I tried to run the benchmark script (scripts/benchmark-all.sh) but unfortunately it claims that it is unable to allocate the hash table: [worker.cc:92-Worker()] Unable to allocate hash table!!!!
It looks like htable = sb.sb_aligned_malloc(NBKT * BKT_SIZE, BKT_SIZE); is not executed, but I do not know why... I tried some dirty fixes but finally, I gave up.
Do you know what I should change? Thanks in advance!

ooibc88 / gam Goto Github PK

gam's Introduction

Overview

Build & Usage

Prerequisite

GAM Core

Test and Micro Benchmark

Applications

Macro Benchmark

FaRM

References

License

Notice

gam's People

Stargazers

Watchers

Forkers

gam's Issues

Experiment setup:

Method:

Result:

Questions:

Experiment setup:

Method:

Results:

Recommend Projects

Recommend Topics

Recommend Org