etascale / argodsm Goto Github PK

View Code? Open in Web Editor NEW

42.0 42.0 22.0 494 KB

ArgoDSM - A Page-Based Software Distributed Shared Memory System

License: Other

CMake 3.11% C++ 95.02% C 1.87%

argodsm's People

Contributors

Stargazers

Watchers

argodsm's Issues

Using argodsm in a server client style application

I was wondering if argodsm can be used to do SHM-style communication between server and client apps which don't share code? For example, I want to create a complex graph in an application, pass the pointer to another application and traverse that graph. In a node-local SHM, I would just memory map the same region in both the application, can I do this in argodsm?

One way of doing this that I see is to allocate some memory that I want to share, and send the pointer over, but I was a little confused by the tutorial.

The tutorial explains that conew_() returns the same pointer on every instance of a parallel program:

This allocation function argo::conew_array is run on all the nodes, returning the same pointer to all of them, thus initializing the data variable in all of them.

But I don't understand how the implementation know which call corresponds to which pointer?

Appreciate any help!

Support for more than 64 nodes

The current implementation of the Pyxis directory uses two unsigned long (assumed to be 64 bit) to represent the sharers and writers of each page, in which each node is represented by one bit. This puts a hard limit of nodes in the system at 64 nodes. All fixed-width integer types impose the same kind of limit on the amount of nodes possible.

Ideally I believe this should be handled inside a Pyxis class with a proper interface, or at least a wrapper class that does not expose the internals of the storage. A quicker solution might be to use vector<bool> as each bool is represented by one bit. vector<bool> has some drawbacks and is not guaranteed to be contiguous for all sizes, so this should be investigated first.

Documentation for behaviour when shared memory smaller than cache

The behaviour proposed in PR #19 and #23 is to reduce the cache size to fit the shared memory exactly once, but this is not properly documented.

The reasoning is that initializing a larger cache takes more time, and would not be beneficial when there cannot possibly be any more memory to take advantage of the larger cache.

Under normal operations this behaviour is not expected to be encountered, but our tests notoriously use small examples to test for edge cases. In these tests, there is a significant performance penalty for creating oversized caches.

The behaviour and reasoning should be documented somewhere.

Re-write or remove the use of CACHELINE

CACHELINE is a hard-coded parameter in the MPI backend that can be roughly translated to number of pages (of size PAGE_SIZE) per ArgoDSM page for both cache-related operations and Pyxis (coherence) operations. This is currently untested due to being hard-coded, and I am certain that this does not work with all of our code, allocation policies in particular.

argodsm/src/backend/mpi/swdsm.h

Lines 45 to 48 in 5a6d253

    
           #ifndef CACHELINE 
        
           /** @brief Size of a ArgoDSM cacheline in number of pages */ 
        
           #define CACHELINE 1L 
        
           #endif

I think the intended effects of CACHELINE are desirable in the sense of a tuning knob to trade false sharing for a reduction in remote operation. For workloads that access contiguous, large (in the sense of several pages or more), data appropriately divided between nodes, there is no such "false sharing". Using a larger PAGE_SIZE would for such workloads reduce the amount of remote operations without any significant drawbacks.

A proper implementation of this should not rely on repeatedly incorporating PAGE_SIZE*CACHELINE in the backend for all cache or coherence operations. One solution is to instead use an environment variable to set the page size in a controlled way (for example in powers of two hardware pages). Most, if not all of the calculations based on PAGE_SIZE should then be easily adaptable to a larger page size even in the current backend implementation.

Another alternative is to simply remove the CACHELINE parameter as it is currently of no use.

prefetch_cache_entry is redundant

The prefetch_cache_entry function is largely redundant.

Considering that @lundgren87 is working on a better algorithm for this functionality no action is needed until the algorithm has been replaced, but this issue is used to track the issue of prefetch_cache_entryin case that update gets delayed.

The Argo atomics do not install the data in the local cache

In the MPI backend, when an Argo atomic instruction reads or modifies some data, it does an RDMA operation to access the data. The Argo coherence is bypassed and the data are never installed in the local cache. One would expect the data to be installed in the cache after an atomic access.

Relevant issue: The loadOne test fails in the second ASSERT_EQ (ASSERT_EQ(i_const, *_i);) even through the while loops before it checks for the exact same condition. The only difference is that the while loop uses an atomic instead of a normal load.

Error building ArgoDSM in debug mode with gcc-8 or newer

When building ArgoDSM with gcc-8 or newer and the CMake flag -DARGO_DEBUG set, the following build errors are encountered:

In file included from /home/sven/git/argodsm/src/allocators/dynamic_allocator.hpp:31,
                 from /home/sven/git/argodsm/src/allocators/collective_allocator.hpp:32,
                 from /home/sven/git/argodsm/src/allocators/allocators.cpp:7:
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp: In instantiation of ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::dynamic_memory_pool<argo::allocators::global_allocator, argo::mempools::NODE_ZERO_ONLY>; LockType = argo::allocators::null_lock; size_t = long unsigned int]’:
/home/sven/git/argodsm/src/allocators/allocators.cpp:23:70:   required from here
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:221:8: error: catching polymorphic type ‘using bad_alloc = class std::bad_alloc’ {aka ‘class std::bad_alloc’} by value [-Werror=catch-value=]
  221 |      } catch (typename MemoryPool::bad_alloc) {
      |        ^~~~~
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:230:8: error: catching polymorphic type ‘class std::bad_alloc’ by value [-Werror=catch-value=]
  230 |       }catch(std::bad_alloc){
      |        ^~~~~
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp: In instantiation of ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::dynamic_memory_pool<argo::allocators::global_allocator, argo::mempools::ALWAYS>; LockType = std::mutex; size_t = long unsigned int]’:
/home/sven/git/argodsm/src/allocators/allocators.cpp:38:67:   required from here
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:221:8: error: catching polymorphic type ‘using bad_alloc = class std::bad_alloc’ {aka ‘class std::bad_alloc’} by value [-Werror=catch-value=]
  221 |      } catch (typename MemoryPool::bad_alloc) {
      |        ^~~~~
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:230:8: error: catching polymorphic type ‘class std::bad_alloc’ by value [-Werror=catch-value=]
  230 |       }catch(std::bad_alloc){
      |        ^~~~~
In file included from /home/sven/git/argodsm/src/allocators/collective_allocator.hpp:28,
                 from /home/sven/git/argodsm/src/allocators/allocators.cpp:7:
/home/sven/git/argodsm/src/allocators/../mempools/dynamic_mempool.hpp: In instantiation of ‘void argo::mempools::dynamic_memory_pool<Allocator, growth_mode, chunk_size>::grow(std::size_t) [with Allocator = argo::allocators::global_allocator; argo::mempools::growth_mode_t growth_mode = argo::mempools::NODE_ZERO_ONLY; long unsigned int chunk_size = 4096; std::size_t = long unsigned int]’:
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:229:8:   required from ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::dynamic_memory_pool<argo::allocators::global_allocator, argo::mempools::NODE_ZERO_ONLY>; LockType = argo::allocators::null_lock; size_t = long unsigned int]’
/home/sven/git/argodsm/src/allocators/allocators.cpp:23:70:   required from here
/home/sven/git/argodsm/src/allocators/../mempools/dynamic_mempool.hpp:158:7: error: catching polymorphic type ‘class std::bad_alloc’ by value [-Werror=catch-value=]
  158 |      }catch(std::bad_alloc){
      |       ^~~~~
/home/sven/git/argodsm/src/allocators/../mempools/dynamic_mempool.hpp: In instantiation of ‘void argo::mempools::dynamic_memory_pool<Allocator, growth_mode, chunk_size>::grow(std::size_t) [with Allocator = argo::allocators::global_allocator; argo::mempools::growth_mode_t growth_mode = argo::mempools::ALWAYS; long unsigned int chunk_size = 4096; std::size_t = long unsigned int]’:
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:229:8:   required from ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::dynamic_memory_pool<argo::allocators::global_allocator, argo::mempools::ALWAYS>; LockType = std::mutex; size_t = long unsigned int]’
/home/sven/git/argodsm/src/allocators/allocators.cpp:38:67:   required from here
/home/sven/git/argodsm/src/allocators/../mempools/dynamic_mempool.hpp:158:7: error: catching polymorphic type ‘class std::bad_alloc’ by value [-Werror=catch-value=]
In file included from /home/sven/git/argodsm/src/allocators/dynamic_allocator.hpp:31,
                 from /home/sven/git/argodsm/src/allocators/collective_allocator.hpp:32,
                 from /home/sven/git/argodsm/src/allocators/allocators.cpp:7:
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp: In instantiation of ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::global_memory_pool<>; LockType = argo::allocators::null_lock; size_t = long unsigned int]’:
/home/sven/git/argodsm/src/allocators/../mempools/dynamic_mempool.hpp:156:15:   required from ‘void argo::mempools::dynamic_memory_pool<Allocator, growth_mode, chunk_size>::grow(std::size_t) [with Allocator = argo::allocators::global_allocator; argo::mempools::growth_mode_t growth_mode = argo::mempools::NODE_ZERO_ONLY; long unsigned int chunk_size = 4096; std::size_t = long unsigned int]’
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:229:8:   required from ‘T* argo::allocators::generic_allocator<T, MemoryPool, LockType>::allocate(size_t) [with T = char; MemoryPool = argo::mempools::dynamic_memory_pool<argo::allocators::global_allocator, argo::mempools::NODE_ZERO_ONLY>; LockType = argo::allocators::null_lock; size_t = long unsigned int]’
/home/sven/git/argodsm/src/allocators/allocators.cpp:23:70:   required from here
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:221:8: error: catching polymorphic type ‘using bad_alloc = class std::bad_alloc’ {aka ‘class std::bad_alloc’} by value [-Werror=catch-value=]
  221 |      } catch (typename MemoryPool::bad_alloc) {
      |        ^~~~~
/home/sven/git/argodsm/src/allocators/generic_allocator.hpp:230:8: error: catching polymorphic type ‘class std::bad_alloc’ by value [-Werror=catch-value=]
  230 |       }catch(std::bad_alloc){
      |        ^~~~~

This does not break without -DARGO_DEBUG set, but needs to be fixed eventually.

C++ allocators cause first-touch on first page with or without initialization

The C++ allocator interface under first-touch allocation erroneously first-touches the first page of each allocation even if argo::allocation::no_initialize is specified.

argodsm/src/allocators/collective_allocator.hpp

Lines 218 to 224 in 70e7392

    
           void* ptr = collective_alloc(sizeof(T) * size); 
        
           using namespace data_distribution; 
        
           global_ptr<void> gptr(ptr); 
        
           // The home node of ptr handles initialization 
        
           if (initialize && argo::backend::node_id() == gptr.node()) { 
        
           	new (ptr) T[size](); 
        
           }

This occurs in a few locations and should be fixed (by someone, probably myself, but logging this issue so I don't forget). The fix is most likely to change the creation of global ptrs from:
global_ptr<void> gptr(ptr);
to:
global_ptr<void> gptr(ptr, false, false);
in order to prevent homenode and offset from being calculated (read first-touched under first-touch allocation) unless initialization is desired.

ArgoDSM reserved VM possibly too large under MEMFD

It appears that when using MEMFD, the default size of the ArgoDSM VM is too large for "some" systems.

When inspecting the memory mappings of a running ArgoDSM process using SHM, the default value of ARGO_END appears too large and intrudes on the mapping of the heap among other things.

5562b64b2000-5562b64b3000 rw-p 0009c000 103:08 8528772                   /home/sven/git/argodsm/build/bin/mpi/barrierTests
5562b6d06000-5562b7108000 rw-p 00000000 00:00 0                          [heap]
7fe8a0000000-7fe8a0021000 rw-p 00000000 00:00 0

As a result the initial anonymous mapping below fails using MEMFD:

argodsm/src/virtual_memory/memfd.cpp

Line 58 in 605645a

    
           start_addr = ::mmap(static_cast<void*>(ARGO_START), ARGO_SIZE, PROT_NONE, flags, -1, 0);

This mapping succeeds on my system if ARGO_END is set to for example 0x500000000000l (Or 0x550000000000l, while 0x560000000000l occasionally fails). I have observed the same behavior (approximate mapping of the heap and failure to map using MEMFD) on other Debian-based machines.

confusingly named variables in MPI backend

As @lundgren87 noticed in #35 there are still many places where variables have confusing names. This issue is attempting to track instances of this as people find them while reading the code, so that information is easily found again later.

Fixing these variable names is a low priority, and it is not expected to be done all in one giant pull request, so feel free to

add links to code where you think this issue applies
create pull requests for fixing any of the things mentioned in this issue

Pyxis directory size scaling with cache_size instead of argo_size

In the current implementation, the Pyxis directory size is set to:

argodsm/src/backend/mpi/swdsm.cpp

Line 1045 in ef64414

classificationSize = 2*cachesize; // Could be smaller ?

and the classification index corresponding to an address is set to:

argodsm/src/backend/mpi/swdsm.cpp

Lines 1425 to 1427 in ef64414

    
           inline unsigned long get_classification_index(uint64_t addr){ 
        
           	return (2*(addr/(pagesize*CACHELINE))) % classificationSize; 
        
           }

When using a cache_size significantly smaller than argo_size, this means that multiple pages will share the same classification index and (possibly) homenode will all be considered downgraded if one page is, resulting in possible performance degradation from unnecessary invalidations.

Is this intended behavior in order to minimize the Pyxis directory size, or is this simply a relic from the time that cache_size was not set by user input?

The amount (or size) of remote operations performed by ArgoDSM does not scale with directory size as far as I can tell, and allowing the directory size to scale with argo_size (at least to some limit) should be positive in terms of performance. However, for very large memory allocations the directory size can possibly be an issue, as for most linux platforms the directory size will be argo_size/256.

Replace unsigned long with appropriate types

The MPI backend uses unsigned long as primary integer type. These should be replaced by more descriptive/suitable types such as std::size_t for sizes and counters, std::uintptr_t for pointers and adresses and std::uint64_t for integers that are required to be 64 bits or larger.

Remote load does not add self as sharer in local Pyxis directory

When loading or prefetching an ArgoDSM page from another ArgoDSM node for the first time, the remote Pyxis directory is updated with the fetching node as sharer but the local Pyxis directory is not.

argodsm/src/backend/mpi/swdsm.cpp

Lines 738 to 741 in ef64414

    
           MPI_Win_lock(MPI_LOCK_EXCLUSIVE, workrank, 0, sharerWindow); 
        
           globalSharers[classidx] |= tempsharer; 
        
           globalSharers[classidx+1] |= tempwriter; 
        
           MPI_Win_unlock(workrank, sharerWindow);

tempsharer contains the remote Pyxis directory content before adding the fetching node as sharer.

This causes the update of the local Pyxis directory to be deferred until the next read miss on the same page (a write miss will only mark the node as writer, not sharer), which will then recognize itself as sharer due to reading the remote Pyxis directory.

Any synchronization points (performing self-invalidation) between the first and second read miss on a page (classification index to be exact) will cause the page to erroneously fail the No writer and assert that the node is a sharer check as it is not locally recognized as sharer yet.

argodsm/src/backend/mpi/swdsm.cpp

Lines 1243 to 1255 in ef64414

    
           MPI_Win_lock(MPI_LOCK_SHARED, workrank, 0, sharerWindow); 
        
           if( 
        
           	 // node is single writer 
        
           	 (globalSharers[classidx+1]==id) 
        
           	 || 
        
           	 // No writer and assert that the node is a sharer 
        
           	 ((globalSharers[classidx+1]==0) && ((globalSharers[classidx]&id)==id)) 
        
           	 ){ 
        
           	MPI_Win_unlock(workrank, sharerWindow); 
        
           	touchedcache[i] =1; 
        
           	/*nothing - we keep the pages, SD is done in flushWB*/ 
        
           } 
        
           else{ //multiple writer or SO

The page will therefore go on to be invalidated from the cache.

Fixing this is simple (just add globalSharers[classidx] |= id; at appropriate location) but needs testing for any unexpected side effects.

Unify the definition of page size

Related to the discussion in:
#45 (comment)

ArgoDSM page size should be defined in one module instead of appearing as separate static const definitions in multiple modules, and retrieved from there where needed throughout the system.

Insufficient linking with librt

Issue

Compiling and linking external projects with ArgoDSM produces the following error.
undefined reference to symbol 'shm_unlink@@GLIBC_2.2.5'
//lib/x86_64-linux-gnu/librt.so.1: error adding symbols: DSO missing from command line

shm_open/shm_unlink requires linking with librt (-lrt). This is currently done on src/CMakeLists.txt line 37 through set(vm_libs rt) but this does not appear to be sufficient when the ArgoDSM libraries are linked from an external project.

Steps to reproduce

Follow the ArgoDSM quickstart guide to build ArgoDSM.
Run make install.
Follow the ArgoDSM tutorial to compile and link argo_example.cpp with the ArgoDSM libraries, reproducing the error.
Link with librt by adding -lrt to the final line in order to successfully compile. (-largo -largobackend-mpi -lrt)

Tested with gcc/g++ (mpicc/mpic++) versions 6.5.0 and 7.3.0.

Enabling hugepage with ArgoDSM and MPI, UCX

Hi, I'm working on a client-server style application with ArgoDSM and trying to optimize performance issues.

I'm wondering if it's possible to (1) use ArgoDSM with hugepage and also (2) how I can adjust cache line size.

It seems like cache line size is currently hard coded and I tried to give it different sizes (like 2 pages, 4 pages) and I got an error during argo::init at the start of a program.

I would appreciate any help! Thanks!

Missing UCX RMA support for 8/16bit MPI atomics

The ArgoDSM atomics currently make use of atomic MPI operations of size 8, 16, 32 and 64 bits. The choice of MPI datatype is made through the below function or its unsigned or float equivalent.

argodsm/src/backend/mpi/mpi.cpp

Lines 64 to 94 in dc8d789

    
           /** 
        
            * @brief Returns an MPI integer type that exactly matches in size the argument given 
        
            * 
        
            * @param size The size of the datatype to be returned 
        
            * @return An MPI datatype with MPI_Type_size == size 
        
            */ 
        
           static MPI_Datatype fitting_mpi_int(std::size_t size) { 
        
           	MPI_Datatype t_type; 
        
           	using namespace argo; 
        
           	switch (size) { 
        
           	case 1: 
        
           		t_type = MPI_INT8_T; 
        
           		break; 
        
           	case 2: 
        
           		t_type = MPI_INT16_T; 
        
           		break; 
        
           	case 4: 
        
           		t_type = MPI_INT32_T; 
        
           		break; 
        
           	case 8: 
        
           		t_type = MPI_INT64_T; 
        
           		break; 
        
           	default: 
        
           		throw std::invalid_argument( 
        
           			"Invalid size (must be either 1, 2, 4 or 8)"); 
        
           		break; 
        
           	} 
        
           	return t_type; 
        
           }

OpenMPI (and most definitely MPICH as it is bundled) nowdays pushes Infiniband users towards using UCX. However, UCX does not provide full RMA support for 8 and 16 bit atomics, instead falling back to active messaging for these (if supported at all).
UCX Documentation
Related issue

Some of the ArgoDSM backend tests (atomicXchgAll, atomicXchgOne) currently fail with "unsupported datatype" when forcing the selection of the UCX osc module or when other alternatives are disabled. For both performance (avoiding active communication) and compatibility reasons, perhaps it would be better to perform at least a properly aligned 32-bit atomic operation instead of 8/16?

argo_init crashes with large memory requestion

argo_init((size_t)64 * 1024 * 1024 * 1024) crashes but argo_init((size_t)32 * 1024 * 1024 * 1024) works fine. I guess it's easy to reproduce because argo_init is the first statement in the program. Tested using the master branch, with the MPI backend, using 2 nodes with 1 process each.

Using the MEMFD VM option segfaults on up-to-date Debian and Ubuntu

This is not resolved by setting vm.overcommit_memory to 1 and should be investigated.

This can be reproduced on GitHub actions using ubuntu-latest, as well as on Ubuntu 22.04 and up-to-date Debian.

submodule code downloads full repository

When a submodule contains multiple branches (e.g. qd_library), the current CMake code downloads all branches instead of limiting itself to the tracked branch.

Is this something we can easily fix?

External applications using ArgoDSM fail to build since #125

The CMake definition of PAGE_SIZE is not propagated to the headers installed by ArgoDSM, and therefore external applications fail to resolve PAGE_SIZE in the following headers.

global_mempool.hpp
dynamic_mempool.hpp
write_buffer.hpp

We need to either find a solution for this or revert and rework #125.
A temporary workaround is to include -DPAGE_SIZE=4096UL in external Makefiles.

parameters used for loading cache entries

Currently, there is two functions to load entries into the cache:

load_cache_entry(unsigned long tag, unsigned long line);
void prefetch_cache_entry(unsigned long tag, unsigned long line);

The parameters of these functions are somewhat redundant, and they should be improved in naming, types, and/or number.

necessity of reserved memory in global memory pool

Currently at

argodsm/src/mempools/global_mempool.hpp

Line 52 in bdda067

size+=reserved;

The global memory pool unconditionally adds some memory "for internal use", but it is unclear what this memory is used for, and whether this is actually needed.

The notorious corner case would be a setup where ArgoDSM only requests to handle a single page, which is not currently tested for.

unneeded memsets in initialization

commit d6db615 in PR #14 removes memsets which are deemed unnecessary.
As that PR is not working properly, this commit should be lifted out of it, applied to the current version, and performance tested. The performance testing is important, so this should not happen before we have proper performance tests: Without forced initialization, runtime may actually be negative impacted, contrary to expectation. Or it might help because of less TLB cluttering.

Atomic store operations are not detected on cached private pages

Issue

When an ArgoDSM page is cached by a remote node, and the page is in private state (caching node is single writer or no writer, caching node is sharer), subsequent atomic writes to the page are not detected on this node by regular reads even after self-invalidation.

The question is whether this is a bug (I believe so), or if the semantics of the ArgoDSM atomic functions simply do not define such behavior in the first place. This is related to #20, but is further exposed by Ioannis work on memory allocation policies which fails some of the atomic tests when the allocated data is owned by a node other than 0.

Reproduction

This following test exposes the bug on 2 or more nodes with the naive allocation policy. In order to display the second case (no writer, caching node is sharer), simply substitute *counter = 0; with volatile int temp = *counter. The important point is that counter is allocated on node 0, and that exactly one other node performs a read or write to it.

include "argo/argo.hpp"

int main(){
	argo::init(1*1024*1024);
	argo::data_distribution::global_ptr<int> counter(argo::conew_<int>());
	
	if(argo::node_id()==1) {
		*counter = 0; // Node 0 owns the data, node 1 becomes single writer
	}
	argo::barrier(); // Barrier makes every node aware of the initialization
	
	// Atomically increment counter on each node
	for(int i=0; i<10; i++){
		argo::backend::atomic::fetch_add(counter, 1);
	}

	argo::barrier(); // Make sure every node has completed execution
	if(*counter == argo::number_of_nodes()*10) {
		printf("Node %d successful (counter: %d).\n", argo::node_id(), *counter);
	}else{
		printf("Node %d failed (counter: %d).\n", argo::node_id(), *counter);
	}
	
	argo::finalize();
}

Detail

This issue is courtesy of the following optimization:

argodsm/src/backend/mpi/swdsm.cpp

Lines 984 to 994 in 5f9b572

    
           if( 
        
           	 // node is single writer 
        
           	 (globalSharers[classidx+1]==id) 
        
           	 || 
        
           	 // No writer and assert that the node is a sharer 
        
           	 ((globalSharers[classidx+1]==0) && ((globalSharers[classidx]&id)==id)) 
        
           	 ){ 
        
           	MPI_Win_unlock(workrank, sharerWindow); 
        
           	touchedcache[i] =1; 
        
           	/*nothing - we keep the pages, SD is done in flushWB*/ 
        
           }

The reason is that ArgoDSM atomics do not alter the Pyxis directory (globalSharers) state, and therefore cached remote pages in single writer or no writer, shared state are not invalidated upon self-invalidation causing the node to miss updates until the state of the page changes.

Solution?

The fact that cached private pages are not downgraded to shared on atomic writes means that it is never completely safe to mix atomic writes and regular reads/writes. I believe that the correct solution would be to write atomic changes to the cache and to update local (and remote when needed as a result) Pyxis directories to the correct state.

Rare coherence failure when prefetching a page unused until after synchronization

Issue

A rare coherence failure can occur when a remote ArgoDSM page is prefetched to a node at one point, but not accessed until after another node has released changes made to the page. ArgoDSM may during synchronization falsely consider the prefetched page to be in "single writer" state, and keep it on the node (without performing a self-downgrade on it) instead of correctly self-invalidating the page. This error does not correct itself unless the page is evicted from the node cache, and subsequent reads from this page will not recognize changes made by remote nodes even after global synchronization.
Note that this issue only occurs if prefetch (DUAL_LOAD) is enabled.

How to reproduce

It is possible to reliably reproduce this error by initializing ArgoDSM with specific argo_size and cache_size parameters along with specific code to trigger the offending prefetch (to ensure that a page is prefetched but not used). I have been able to reproduce this with a cache_size smaller than argo_size/2 in the following code:
https://github.com/lundgren87/paratest/blob/prefetch_fail/simpletest.cpp

This specific code must executed on exactly two ArgoDSM nodes (same result obtained on HW or SW nodes) in order to reproduce this failure (other values of argo_size and cache_size may reproduce the issue with other node sizes) as such:
mpirun -n 2 ./simpletest

ArgoDSM details

In the example above, 20 ArgoDSM pages are allocated during initialization, pages 0-9 on "node 0" and 10-19 on "node 1". Node 0 initializes pages 9-12, which triggers remote loads of pages 10 and 12, and prefetch of pages 11 and 13. Page 13 is not accessed during initialization by node 0. Node 1 meanwhile initializes pages 13-16.
At the first synchronization point, page 13 is considered to be in "single writer" state by node 0 even though it has not accessed the page and is as such kept on the node (without being self-downgraded, since it is in state CLEAN).
After the first synchronization point, node 1 writes additional changes to page 13.
At the second synchronization point, the changes made by node 1 should be made visible to node 0. However, node 0 does not recognize that any writer has been added to page 13 and again keeps it in the cache instead of self-invalidating it.
After the second synchronization point, node 0 attempts to check writes made to pages 9-16, but upon accessing page 13 it accesses the page present in the local node cache instead of remotely loading page 13 containing the latest updates from node 1, resulting in missed writes.

Ensure writeBufferLoad test is fully deterministic

Related to the discussion in:
#45 (comment)

The goal of this test is to ensure that, regardless of the number of ArgoDSM nodes and the future allocation policy used, pages appear in a non-sorted order in the write buffer. This allows for testing that we protect ourselves against interleaving locking (as opposed to locking windows on node 0, 1, 2, ..., n in order) of remote MPI Windows producing deadlocks.

The current implementation relies on producing enough random writes to ensure that this scenario is likely to always trigger, but a sequence of numbers or a seed that guarantees such a result would be a better solution.

standardise functions for rounding etc.

See the conversations in PR #23 for suggested places where rounding / ceiling could be factored into helper functions.

This could improve code quality and ease code review.

Call to `exit()` after `throw`

In many places in the code, I see the following pattern:

   ...
   throw std::system_error(std::make_error_code(static_cast<std::errc>(errno)), ...);
   exit(EXIT_FAILURE);

What exactly is the purpose of writing this? Don't we trust that the throw call will throw an exception?

Any suggestions on how to re-write such code?

	#ifndef CACHELINE
	/** @brief Size of a ArgoDSM cacheline in number of pages */
	#define CACHELINE 1L
	#endif

	void* ptr = collective_alloc(sizeof(T) * size);
	using namespace data_distribution;
	global_ptr<void> gptr(ptr);
	// The home node of ptr handles initialization
	if (initialize && argo::backend::node_id() == gptr.node()) {
	new (ptr) T[size]();
	}

	inline unsigned long get_classification_index(uint64_t addr){
	return (2(addr/(pagesizeCACHELINE))) % classificationSize;
	}

	MPI_Win_lock(MPI_LOCK_EXCLUSIVE, workrank, 0, sharerWindow);
	globalSharers[classidx] \|= tempsharer;
	globalSharers[classidx+1] \|= tempwriter;
	MPI_Win_unlock(workrank, sharerWindow);

	MPI_Win_lock(MPI_LOCK_SHARED, workrank, 0, sharerWindow);
	if(
	// node is single writer
	(globalSharers[classidx+1]==id)
	\|\|
	// No writer and assert that the node is a sharer
	((globalSharers[classidx+1]==0) && ((globalSharers[classidx]&id)==id))
	){
	MPI_Win_unlock(workrank, sharerWindow);
	touchedcache[i] =1;
	/nothing - we keep the pages, SD is done in flushWB/
	}
	else{ //multiple writer or SO

	/**
	* @brief Returns an MPI integer type that exactly matches in size the argument given
	*
	* @param size The size of the datatype to be returned
	* @return An MPI datatype with MPI_Type_size == size
	*/
	static MPI_Datatype fitting_mpi_int(std::size_t size) {
	MPI_Datatype t_type;
	using namespace argo;

	switch (size) {
	case 1:
	t_type = MPI_INT8_T;
	break;
	case 2:
	t_type = MPI_INT16_T;
	break;
	case 4:
	t_type = MPI_INT32_T;
	break;
	case 8:
	t_type = MPI_INT64_T;
	break;
	default:
	throw std::invalid_argument(
	"Invalid size (must be either 1, 2, 4 or 8)");
	break;
	}

	return t_type;
	}

etascale / argodsm Goto Github PK

argodsm's People

Contributors

Stargazers

Watchers

Forkers

argodsm's Issues

Issue

Steps to reproduce

Issue

Reproduction

Detail

Solution?

Issue

How to reproduce

ArgoDSM details

Recommend Projects

Recommend Topics

Recommend Org