OSX | Linux | Node 6.x, 5.x, 4.x, 0.10, iojs, Python2/3:
EMS makes possible shared memory parallelism between Node.js, Python, and C/C++.
Extended Memory Semantics (EMS) is a unified programming and execution model that addresses several challenges of parallel programming:
- Allows any number or kind of processes to share objects
- Manages synchronization and object coherency
- Implements persistence to NVM and secondary storage
- Provides dynamic load-balancing between processes
- May substitute or complement other forms of parallelism
- [Parallel Execution Models Supported](#Types\ of\ Concurrency)
- [Atomic Operations](#Built-in\ Atomic\ Operations)
- Examples
- [Benchmarks](#Examples\ and\ Benchmarks)
- [Synchronization as a Property of the Data, Not a Duty for Tasks](Synchronization\ as\ a\ Property\ of\ the\ Data,\ Not\ a\ Duty\ for\ Tasks)
- Installation
- Roadmap
A modern multicore server has 16-32 cores and over 200GB of memory, equivalent to an entire rack of systems from a few years ago. As a consequence, jobs formerly requiring a Map-Reduce cluster can now be performed entirely in shared memory on a single server without using distributed programming.
Inter-language example in interlanguage.{js,py}
- Start Node.js REPL, create an EMS memory
- Store "Hello"
- Open a second session, begin the Python REPL
- Connect to the EMS shared memory from Python
- Show the object created by JS is present
- Modify the object, and show the modification can be seen in JS
- Exit both REPLs so no programs are running to "own" the EMS memory
- Restart Python, show the memory is still present
- Initialize a counter from Python
- Demonstrate atomic Fetch and Add in JS
- Start a loop in Python incrementing the counter
- Simultaneously print and modify the value from JS
- Try to read "empty" data from Python, process blocks
- Write the empty memory, marking it full, Python resumes execution
EMS operations may performed using any JSON data type, read-modify-write operations may use any combination of JSON data types, producing identical results to like operations on ordinary data.
All basic and atomic read-modify-write operations are available in all concurrency modes, however collectives are not currently available in user defined modes.
-
Basic Operations: Read, write, readers-writer lock, read full/empty, write empty/full
-
Primitives: Stacks, queues, transactions
-
Atomic Read-Modify-Write: Fetch-and-Add, Compare and Swap
-
Collective Operations: All basic OpenMP collective operations are implemented in EMS: dynamic, block, guided, and static loop scheduling, barriers, master and single execution regions
Map-Reduce is often demonstrated using word counting because each document can be processed in parallel, and the results of each document's dictionary reduced into a single dictionary. This EMS implementation also iterates over documents in parallel, but it maintains a single shared dictionary across processes, atomically incrementing the count of each word found. The final word counts are sorted and the most frequently appearing words are printed with their counts.
The performance of this program was measured using an Amazon EC2 instance:
c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory
The leveling of scaling aroung 16 cores despite the presence of ample work
may be related to the use of non-dedicated hardware:
Half of the 36 vCPUs are presumably HyperThreads or otherwise shared resoruce.
AWS instances are also bandwidth limited to EBS storage, where our Gutenberg
corpus is stored.
A benchmark similar to STREAMS
gives us the maximum speed EMS double precision
floating point operations can be performed on a
c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory
.
The micro-benchmarked raw transactional performance and
performance in the context of a workload are measured separately.
The experiments were run using an Amazon EC2 instance:
c4.8xlarge (132 ECUs, 36 vCPUs, 2.9 GHz, Intel Xeon E5-2666v3, 60 GiB memory
Six EMS arrays are created, each holding 1,000,000 numbers. During the benchmark, 1,000,000 transactions are performed, each transaction involves 1-5 randomly selected elements of randomly selected EMS arrays. The transaction reads all the elements and performs a read-modify-write operation involving at least 80% of the elements. After all the transactions are complete, the array elements are checked to confirm all the operations have occurred.
The parallel process scheduling model used is block dynamic (the default),
where each process is responsible for successively smaller blocks
of iterations. The execution model is bulk synchronous parallel, each
processes enters the program at the same main entry point
and executes all the statements in the program.
forEach
loops have their normal semantics of performing all iterations,
parForEach
loops are distributed across threads, each process executing
only a portion of the total iteration space.
EMS internally stores tags that are used for synchronization of user data, allowing synchronization to happen independently of the number or kind of processes accessing the data. The tags can be thought of as being in one of three states, Empty, Full, or Read-Only, and the EMS intrinsic functions enforce atomic access through automatic state transitions.
The EMS array may be indexed directly using an integer, or using a key-index mapping from any primitive type. When a map is used, the key and data itself are updated atomically.
For a more complete description of the principles of operation, visit the EMS web site.
Because all systems are already multicore, parallel programs require no additional equipment, system permissions, or application services, making it easy to get started. The reduced complexity of lightweight threads communicating through shared memory is reflected in a rapid code-debug cycle for ad-hoc application development.
To build and test all C, Python 2 and 3, and Node.js targets, a makefile can automate most build and test tasks.
dunlin> make help
Extended Memory Semantics -- Build Targets
===========================================================
all Build all targets, run all tests
node Build only Node.js
py Build both Python 2 and 3
py[2|3] Build only Python2 or 3
test Run both Node.js and Py tests
test[_js|_py|_py2|_py3] Run only Node.js, or only Py tests, respectively
clean Remove all files that can be regenerated
clean[_js|_py|_py2|_py3] Remove Node.js or Py files that can be regenerated
EMS is available as a NPM Package. EMS itself has no external dependencies,
but does require compiling native C++ functions using node-gyp
,
which is also available as a NPM (sudo npm install -g node-gyp
).
The native C parts of EMS depend on other NPM packages to compile and load. Specifically, the Foreign Function Interface (ffi), C-to-V8 symbol renaming (bindings), and the native addon abstraction layer (nan) are also required to compile EMS.
npm install ems
Download the source code, then compile the native code:
git clone https://github.com/SyntheticSemantics/ems.git
cd ems
npm install
To use this EMS development build to run the examples or tests, set up a global npm link to the current build:
sudo npm link ../ems
On a Mac and most Linux distributions EMS will "just work", but some Linux distributions restrict access to shared memory. The quick workaround is to run jobs as root, a long-term solution will vary with Linux distribution.
Run the work queue driven transaction processing example on 8 processes:
npm run <example>
Or manually via:
cd Examples
node concurrent_Q_and_TM.js 8
Running all the tests with 8 processes:
npm run test # Alternatively: npm test
cd Tests
rm -f EMSthreadStub.js # Do not run the machine generated script used by EMS
for test in `ls *js`; do node $test 8; done
As of 2016-05-01, Mac/Darwin and Linux are supported. A windows port pull request is welcomed!
EMS 1.0 uses Nan for long-term Node.js support, we continue to develop on OSX and Linux via Vagrant.
EMS 1.3 introduces a C API.
EMS 1.4 Python API
EMS 1.5 [Planned] Support for persistent main system memory.
EMS 2.0 [Planned] New API with more tightly integrate with ES6, Python, and other dynamically typed languages languages, making atomic operations on persistent memory more transparent.
BSD, other commercial and open source licenses are available.
Jace A Mogill specializes in FPGA/Software Co-Design, recently embedding a FPGA emulation of an ASIC into Python and also designing an hardware accelerator for Python, Javascript, and other languages. He has over 20 years experience optimizing software for distributed, multi-core, and hybrid computer architectures. He regularly responds to [email protected].