Giter Site home page Giter Site logo

google / marl Goto Github PK

View Code? Open in Web Editor NEW
1.8K 54.0 191.0 1.48 MB

A hybrid thread / fiber task scheduler written in C++ 11

License: Apache License 2.0

CMake 4.41% C++ 68.34% C 13.81% Assembly 5.73% Shell 4.09% Batchfile 1.13% Starlark 0.70% Go 1.80%
threads fibers scheduler task-scheduler tasks coroutine coroutine-library fiber-task-scheduler task-runner thread-pool

marl's Introduction

Marl

Marl is a hybrid thread / fiber task scheduler written in C++ 11.

About

Marl is a C++ 11 library that provides a fluent interface for running tasks across a number of threads.

Marl uses a combination of fibers and threads to allow efficient execution of tasks that can block, while keeping a fixed number of hardware threads.

Marl supports Windows, macOS, Linux, FreeBSD, Fuchsia, Emscripten, Android and iOS (arm, aarch64, loongarch64, mips64, ppc64, rv64, x86 and x64).

Marl has no dependencies on other libraries (with an exception on googletest for building the optional unit tests).

Example:

#include "marl/defer.h"
#include "marl/event.h"
#include "marl/scheduler.h"
#include "marl/waitgroup.h"

#include <cstdio>

int main() {
  // Create a marl scheduler using all the logical processors available to the process.
  // Bind this scheduler to the main thread so we can call marl::schedule()
  marl::Scheduler scheduler(marl::Scheduler::Config::allCores());
  scheduler.bind();
  defer(scheduler.unbind());  // Automatically unbind before returning.

  constexpr int numTasks = 10;

  // Create an event that is manually reset.
  marl::Event sayHello(marl::Event::Mode::Manual);

  // Create a WaitGroup with an initial count of numTasks.
  marl::WaitGroup saidHello(numTasks);

  // Schedule some tasks to run asynchronously.
  for (int i = 0; i < numTasks; i++) {
    // Each task will run on one of the 4 worker threads.
    marl::schedule([=] {  // All marl primitives are capture-by-value.
      // Decrement the WaitGroup counter when the task has finished.
      defer(saidHello.done());

      printf("Task %d waiting to say hello...\n", i);

      // Blocking in a task?
      // The scheduler will find something else for this thread to do.
      sayHello.wait();

      printf("Hello from task %d!\n", i);
    });
  }

  sayHello.signal();  // Unblock all the tasks.

  saidHello.wait();  // Wait for all tasks to complete.

  printf("All tasks said hello.\n");

  // All tasks are guaranteed to complete before the scheduler is destructed.
}

Benchmarks

Graphs of several microbenchmarks can be found here.

Building

Marl contains many unit tests and examples that can be built using CMake.

Unit tests require fetching the googletest external project, which can be done by typing the following in your terminal:

cd <path-to-marl>
git submodule update --init

Linux and macOS

To build the unit tests and examples, type the following in your terminal:

cd <path-to-marl>
mkdir build
cd build
cmake .. -DMARL_BUILD_EXAMPLES=1 -DMARL_BUILD_TESTS=1
make

The resulting binaries will be found in <path-to-marl>/build

Emscripten

  1. install and activate the emscripten sdk following standard instructions for your platform.
  2. build an example from the examples folder using emscripten, say hello_task.
cd <path-to-marl>
mkdir build
cd build
emcmake cmake .. -DMARL_BUILD_EXAMPLES=1
make hello_task -j 8

NOTE: you want to change the value of the linker flag sPTHREAD_POOL_SIZE that must be at least as large as the number of threads used by your application. 3. Test the emscripten output. You can use the provided python script to create a local web server:

../run_webserver

In your browser, navigate to the example URL: http://127.0.0.1:8080/hello_task.html.
Voilà - you should see the log output appear on the web page.

Installing Marl (vcpkg)

Alternatively, you can build and install Marl using vcpkg dependency manager:

git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg integrate install
./vcpkg install marl

The Marl port in vcpkg is kept up to date by Microsoft team members and community contributors. If the version is out of date, please create an issue or pull request on the vcpkg repository.

Windows

Marl can be built using Visual Studio 2019's CMake integration.

Using Marl in your CMake project

You can build and link Marl using add_subdirectory() in your project's CMakeLists.txt file:

set(MARL_DIR <path-to-marl>) # example <path-to-marl>: "${CMAKE_CURRENT_SOURCE_DIR}/third_party/marl"
add_subdirectory(${MARL_DIR})

This will define the marl library target, which you can pass to target_link_libraries():

target_link_libraries(<target> marl) # replace <target> with the name of your project's target

You may also wish to specify your own paths to the third party libraries used by marl. You can do this by setting any of the following variables before the call to add_subdirectory():

set(MARL_THIRD_PARTY_DIR <third-party-root-directory>) # defaults to ${MARL_DIR}/third_party
set(MARL_GOOGLETEST_DIR  <path-to-googletest>)         # defaults to ${MARL_THIRD_PARTY_DIR}/googletest
add_subdirectory(${MARL_DIR})

Usage Recommendations

Capture marl synchronization primitives by value

All marl synchronization primitives aside from marl::ConditionVariable should be lambda-captured by value:

marl::Event event;
marl::schedule([=]{ // [=] Good, [&] Bad.
  event.signal();
})

Internally, these primitives hold a shared pointer to the primitive state. By capturing by value we avoid common issues where the primitive may be destructed before the last reference is used.

Create one instance of marl::Scheduler, use it for the lifetime of the process

The marl::Scheduler constructor can be expensive as it may spawn a number of hardware threads.
Destructing the marl::Scheduler requires waiting on all tasks to complete.

Multiple marl::Schedulers may fight each other for hardware thread utilization.

For these reasons, it is recommended to create a single marl::Scheduler for the lifetime of your process.

For example:

int main() {
  marl::Scheduler scheduler(marl::Scheduler::Config::allCores());
  scheduler.bind();
  defer(scheduler.unbind());

  return do_program_stuff();
}

Bind the scheduler to externally created threads

In order to call marl::schedule() the scheduler must be bound to the calling thread. Failure to bind the scheduler to the thread before calling marl::schedule() will result in undefined behavior.

marl::Scheduler may be simultaneously bound to any number of threads, and the scheduler can be retrieved from a bound thread with marl::Scheduler::get().

A typical way to pass the scheduler from one thread to another would be:

std::thread spawn_new_thread() {
  // Grab the scheduler from the currently running thread.
  marl::Scheduler* scheduler = marl::Scheduler::get();

  // Spawn the new thread.
  return std::thread([=] {
    // Bind the scheduler to the new thread.
    scheduler->bind();
    defer(scheduler->unbind());

    // You can now safely call `marl::schedule()`
    run_thread_logic();
  });
}

Always remember to unbind the scheduler before terminating the thread. Forgetting to unbind will result in the marl::Scheduler destructor blocking indefinitely.

Don't use externally blocking calls in marl tasks

The marl::Scheduler internally holds a number of worker threads which will execute the scheduled tasks. If a marl task becomes blocked on a marl synchronization primitive, marl can yield from the blocked task and continue execution of other scheduled tasks.

Calling a non-marl blocking function on a marl worker thread will prevent that worker thread from being able to switch to execute other tasks until the blocking function has returned. Examples of these non-marl blocking functions include: std::mutex::lock(), std::condition_variable::wait(), accept().

Short blocking calls are acceptable, such as a mutex lock to access a data structure. However be careful that you do not use a marl blocking call with a std::mutex lock held - the marl task may yield with the lock held, and block other tasks from re-locking the mutex. This sort of situation may end up with a deadlock.

If you need to make a blocking call from a marl worker thread, you may wish to use marl::blocking_call(), which will spawn a new thread for performing the call, allowing the marl worker to continue processing other scheduled tasks.


Note: This is not an officially supported Google product

marl's People

Contributors

a952135763 avatar amaiorano avatar andre-kempe-arm avatar ben-clayton avatar c0d1f1ed avatar digit-android avatar dimhotepus avatar flygoat avatar fmayer avatar frankxie05 avatar guillaume227 avatar jmacnak avatar kangz avatar kedixa avatar kushalsingh-00 avatar myd7349 avatar natepaynefb avatar njlr avatar shawnanastasio avatar shlomif avatar sugoi1 avatar superbiebel avatar tanderson-google avatar troibe avatar vettoreldaniele avatar walbourn avatar whalbawi avatar wjh-la avatar xen0n avatar zeldin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marl's Issues

Protect against fiber stack overflow

It appears that the stack space allocated for fibers is not checked for stack overflow.

It seems prudent both from a testing predictability and security POV to add guard pages around the stack allocation.

ASAN errors on MSVC Win32 builds

Using the new MSVC support for ASAN, running marl-unittests will report different address sanitizer errors on each run, but always from SchedulerParams/WithBoundScheduler.BlockingCallVoidReturn/*. It's possible skipping this suite would lead to ASAN failures in tests that run afterwards, but I haven't tried.

Repro steps

  1. Run the Visual Studio installer and install ASAN support

  2. Generate Win32 sln with tests enabled:

mkdir build32 && build32
cmake -A Win32 -DMARL_BUILD_TESTS=1 ..
  1. Open Marl.sln with VS2019, select both marl and marl-unittests in Solution Explorer, right-click Properties, set Configurations to All Configurations, C/C++ --> General --> Enable Address Sanitizer (Experimental) --> Yes

  2. Set configuration to RelWithDebInfo, build and run marl-unittests.

Result

I've gotten different outputs with multiple runs. Here's one:

[ RUN      ] SchedulerParams/WithBoundScheduler.BlockingCallVoidReturn/3
=================================================================
==20236==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x057ff7b8 at pc 0x0061bbdc bp 0x057ff6fc sp 0x057ff6f0
READ of size 4 at 0x057ff7b8 thread T16777215
==20236==WARNING: Failed to use and restart external symbolizer!
    #0 0x61bbdb in std::_Construct_in_place<marl::WaitGroup::Data,marl::Allocator * &> C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\xmemory:229
    #1 0x622f58 in marl::WaitGroup::WaitGroup C:\src\marl\include\marl\waitgroup.h:82
    #2 0x626643 in <lambda_805be53bdef90e2d0ff16255d07f9da5>::operator() C:\src\marl\src\blockingcall_test.cpp:31
    #3 0x628c0d in std::_Func_impl_no_alloc<<lambda_805be53bdef90e2d0ff16255d07f9da5>,void>::_Do_call C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\functional:903
    #4 0x76fa1a in std::_Func_class<void>::operator() C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\functional:951
    #5 0x77eb72 in marl::Scheduler::Worker::runUntilIdle C:\src\marl\src\scheduler.cpp:691
    #6 0x77ee16 in marl::Scheduler::Worker::runUntilShutdown C:\src\marl\src\scheduler.cpp:576
    #7 0x77e5a1 in marl::Scheduler::Worker::run C:\src\marl\src\scheduler.cpp:569
    #8 0x770cdc in std::_Func_impl_no_alloc<<lambda_f08fac5c22a42aa758c925a9c8a41778>,void>::_Do_call C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\functional:903
    #9 0x6a45f6 in marl::OSFiber::run C:\src\marl\src\osfiber_windows.h:97
    #10 0x770587f0 in CreateProcessW+0x90 (C:\WINDOWS\System32\KERNELBASE.dll+0x101087f0)
    #11 0x770587a5 in CreateProcessW+0x45 (C:\WINDOWS\System32\KERNELBASE.dll+0x101087a5)
    #12 0x77b31b96 in RtlUserFiberStart+0x16 (C:\WINDOWS\SYSTEM32\ntdll.dll+0x4b2f1b96)

Address 0x057ff7b8 is located in stack of thread T153 at offset 104 in frame
    #0 0x603800 in ILT+10235(??1?$shared_ptrVmutexstdstdQAEXZ)+0x0 (C:\src\marl\build32\RelWithDebInfo\marl-unittests.exe+0x403800)

  This frame has 1 object(s):
    [16, 20) '_Value' <== Memory access at offset 104 overflows this variable
HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork
      (longjmp, SEH and C++ exceptions *are* supported)
Thread T153 created by unknown thread
SUMMARY: AddressSanitizer: stack-buffer-overflow C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.27.29110\include\xmemory:229 in std::_Construct_in_place<marl::WaitGroup::Data,marl::Allocator * &>
Shadow bytes around the buggy address:
  0x30affea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x30affeb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x30affec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x30affed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x30affee0: 00 00 00 00 00 00 00 00 00 00 f1 f1 04 f3 f3 f3
=>0x30affef0: f3 00 00 00 00 00 00[f2]f2 f2 f2 f8 f1 f1 00 04
  0x30afff00: f2 f2 f2 f2 04 f2 04 f2 00 f2 04 f2 00 f3 f3 f3
  0x30afff10: f3 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x30afff20: 00 00 00 00 f1 f1 04 f3 f3 f3 f3 f1 f1 00 00 00
  0x30afff30: 00 00 00 f2 f2 f2 f2 f8 f3 f3 f3 f3 f1 f1 00 00
  0x30afff40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
  Shadow gap:              cc

Ensure that all heap allocations use the marl::Allocator

marl::Allocator is the user-implementable interface that should be used for all marl heap allocations.

At the time of writing there are many places where we use std containers with the default allocator, which is clearly bypassing the marl::Allocator.

We should ensure that all allocations go through the provided marl::Allocator.

Warning treated as error in 32 bit Windows builds with MSVC in ANGLE

In ANGLE, trying to build a 32 bit Windows version including SwiftShader.

gn args:
is_debug=true
is_clang=false
target_cpu="x86"

Warning message:
...\angle\third_party\SwiftShader\third_party\marl\include\marl/pool.h(241): note: see reference to function template instantiation 'std::shared_ptr<marl::BoundedPoolsw::DrawCall::BatchData,16,marl::PoolPolicy::Preserve::Storage> std::make_shared<marl::BoundedPoolsw::DrawCall::BatchData,16,marl::PoolPolicy::Preserve::Storage,>(void)' being compiled
...\VC\Tools\MSVC\14.16.27023\include\memory(1866): warning C4316: 'std::_Ref_count_obj<_Ty>': object allocated on the heap may not be aligned 16
with
[
_Ty=marl::BoundedPoolsw::DrawCall::BatchData,16,marl::PoolPolicy::Preserve::Storage
]

Add presubmits of Android NDK

Involves adding the Android NDK to the docker image, and then using something like:

cmake .. -GNinja -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK_HOME/build/cmake/android.toolchain.cmake -DANDROID_ABI=arm64-v8a -DANDROID_NATIVE_API_LEVEL=18

Scheduling Strategies

Are there any customization points where you can implement a scheduling strategy? E.g. most systems schedule I/O differently than "work" (FIFO vs LIFO).

Not sure, whether the scheduler is the perfect place for this, but in our current system, a specialized scheduler also prevents race-conditions by not executing two fibers that want to access the same resource (i.e. read+write/write+write) and re-schedules them when the resource is available again. Thus, the application code can be kept clean (no mutex locking, etc.). This requires specific information about the fiber to be executed. Is there something like a fiber_specific_ptr?

Maybe I should wait with those questions until there's a documentation/reference available that may already answer them?

Question: is the bind/unbind and implicit single threaded worker optional?

How load-bearing is bind/unbind? Looking at the marl::schedule implementation it seems like if I have a Scheduler* I should be able to enqueue work, and if I was always enqueuing work for asynchronous execution (that I never want to block the enqueuing thread) the logic in bind/unbind seems unneeded.

Is it safe to just use scheduler->enqueue() if I always have a >0 worker count and don't need the single threaded work queue adoption behavior?

(we are ourselves a framework that can make no assumptions about the calling thread that enqueues work as it may be anything from the UI thread in an Android app calling through JNI to an external executor calling in expecting non-blocking behavior just as marl itself does, etc)

Schedule tasks from within tasks

Hi, I've come to experiment with marl after having watched the GDC Talk "Parallelizing the Naughty Dog Engine".
I am wondering if it is intended by the marl API & scheduler that one can schedule tasks from within tasks.

In the talk (at 14:25) Christian Gyrling says that their engine / application heavily relies on this feature, to the point where their entire main-loop is just a graph of tasks which is kicked off by one task at the start of each frame.

Is there something to be aware of when doing this with marl ? Or maybe is it generally to be avoided ?
Thanks

CMake version

Hi,

I was trying to compile with CMake 3.5.2, but I got this error:

CMake Error at CMakeLists.txt:69 (list):
  list does not recognize sub-command FILTER

Apparently FILTER was introduced in version 3.6, but according to the CMakeLists.txt file, only version 2.8 is required.

https://github.com/google/marl/blob/master/CMakeLists.txt#L15

Can we remove FILTER to make this work with older versions of CMake?

Shared by default

add_library(marl ${MARL_LIST})

I think it creates a shared library by default on Windows 10/MSVS 2019. Adding STATIC explicitly made it static for me. Not sure if that's consistent behaviour.

Add support for wait_for() / wait_until()

Currently the scheduler has no way to deal with wait timeouts. This makes implementing marl::condition_variable::wait_for and marl::condition_variable::wait_until impossible.

Add marl::Scheduler::Fiber::yield_until to automatically resume the fiber after a deadline.
Use this to implement wait_for() and wait_until().

The version on the vcpkg is out of date

Hello, I got marl from vcpkg. It seems that the version on vcpkg has not been updated for a long time. When I wanted to include event.h, I found that there was no such file.

Single-threaded workers should probably not steal work

As raised in #17, it is somewhat unexpected to see non-dedicated-worker threads executing tasks when there are dedicated workers available.

After consideration, I think that this behavior is probably a Bad Idea™.
Tasks that are stolen by the single threaded worker may yield, and sit on the single threaded worker queue, indefinitely, potentially leading to a potential deadlock.

A simple solution here is to prevent work stealing for the single-threaded workers.

marl::blocking_call does not keep the scheduler alive

Given the following:

1: marl::schedule([] {
2:  marl::blocking_call([] {
3:    // scheduler is destructed here.
4:  });
5: });

Before marl::blocking_call returns at (4), the task created at (1) is rescheduled.
If the scheduler is deleted during (3) (likely due to app teardown) then the rescheduling will explode.

This is in a funny territory where I'm not sure if this should be marl's responsibility to avoid this problem, or whether it is reasonable to expect the application to ensure there are no in-flight blocking calls before the scheduler is destructed.
If we go with the latter, we should at least document this case.

Non-performant use of fibers on Windows

out->fiber = CreateFiber(stackSize, &OSFiber::run, out);

The Win32 CreateFiber function has unexpected virtual memory size by default. When using CreateFiberEx, default behavior is to round stack size to next multiple of 1MiB. With CreateFiber, for some reason it rounds to 2MiB, which is even worse.

Now even if you use CreateFiberEx, there's still some weirdness. If dwStackCommitSize and dwStackReserveSize are equal, the stack size will still round to 1MiB. The only way is to use it like this:
CreateFiberEx(size - 1, size, FIBER_FLAG_FLOAT_SWITCH, cb, NULL);

Note that even then, the stack can't be less than dwAllocationGranularity, which is typically 64KiB. If this minimum stack size is OK for performance, then that's how fiber creation should be done on Windows.

If we need smaller values (16KiB is common in schedulers), then asm fibers should be used instead IMO.

Replace marl::Scheduler setters with immutable constructor config

Deprecated marl::Scheduler APIs

This issue documents the deprecation of the following APIs:

These are all replaced with marl::Scheduler::Config.

There are 4 stages to this deprecation:


Original Proposal

marl::Scheduler::setWorkerThreadCount() can only safely be called before scheduling work.
Attempting to make this dynamically adjustable while work is scheduled is extremely fiddly, and previous attempts reduced the performance of marl::schedule to an unacceptable level.

Given that the worker thread count cannot be adjusted after scheduling work, let's deprecate the following methods from marl::Scheduler:

// setThreadInitializer() sets the worker thread initializer function which
// will be called for each new worker thread spawned.
// The initializer will only be called on newly created threads (call
// setThreadInitializer() before setWorkerThreadCount()).
void setThreadInitializer(const std::function<void()>& init);
// getThreadInitializer() returns the thread initializer function set by
// setThreadInitializer().
const std::function<void()>& getThreadInitializer();
// setWorkerThreadCount() adjusts the number of dedicated worker threads.
// A count of 0 puts the scheduler into single-threaded mode.
// Note: Currently the number of threads cannot be adjusted once tasks
// have been enqueued. This restriction may be lifted at a later time.
void setWorkerThreadCount(int count);
// getWorkerThreadCount() returns the number of worker threads.
int getWorkerThreadCount();

We can replace these with a new marl::Scheduler::Config structure that holds the immutable number of worker threads, the worker thread initializer, the allocator, and other future settings. The Config should be passed to a new Scheduler constructor, and all the getters can be replaced with a single getter for the Config passed to the constructor.

Related topic: #136 (comment)

DestructWithPendingFibers test fails on windows-msvc-14.14-x86-cmake

This test run failed with a seg fault:

[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/0
[       OK ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/0 (4 ms)
[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/1
[       OK ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/1 (6 ms)
[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/2
[       OK ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/2 (8 ms)
[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/3
[       OK ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/3 (12 ms)
[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/4
[       OK ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/4 (37 ms)
[ RUN      ] SchedulerParams/WithBoundScheduler.DestructWithPendingFibers/5
bash: line 1:  4312 Segmentation fault      cmd /c 'github\marl\kokoro\windows\presubmit.bat'

Investigate, fix.

thread affinity support for *nix systems

do you have plans to support thread pinning of the worker threads like in the windows implementation?

it would be nice if a "user" of marl could define some sort of affinity-policy to be able to adjust to requirements ( e.g. single thread pinning vs pinning to group of threads together ).

thinking of pseudo++ code:

// we might have 4 workers pinned to 2 compute cores
// in any case, we have to use cores `1` and `2`
auto allowedCpusFromSomewhere( int workerId ) { return marl::CpuSet{1, 2 }; };

main() {
  ...
  marl::Scheduler scheduler;
  scheduler.bind();
  //scheduler.setWorkerThreadCount(4);
  scheduler.setWorkerThreadCountWithAffinity(
    4, []( auto const workerId, auto const& nativeHandle ) noexcept {
      marl::AffinityPolicy.pin( nativeHandle, allowedCpusFromSomewhere(workerId) );
      // Or
      nativeHandle.pin(CpuSet{1,2,3,4});
      // Or
  }
  // Or
  scheduler.setWorkerThreadCountWithAffinity(marl::ForAll{marl::CpuSet{1,2,3,4}});
  scheduler.setWorkerThreadCountWithAffinity(marl::OneOf{marl::CpuSet{1,2,3,4}});
  ...
}

furthermore, a user might just want to specify exactly which cpu cores are allowed to be used due to "external" requirements (thinking of network card capture into worker queue. the capture threads a pinned to single threads and these threads must be different to marls native worker threads).

would that make any sense?

( yes i looked into src/thread.cc and read the comment claiming that pinning to a group is favourable instead of pinning to single thread. however, i guess it could be
application dependent, why not let the user decide instead of hardcoding? )

anyhow, on *nix no pinning at all.

Fractal example: poor scalability and unexpected number of CPUs

Hi,

I have been playing around with the only example and I was quite surprised by it's poor performance.

First of all I modified the code to run in a poorly sequential way (no marl). This is the time required by a single CPU:

real	0m0.025s
user	0m0.025s
sys	0m0.000s

Afterward, I modified the code as follows:

int main(int argc, const char** argv) {
  marl::Scheduler scheduler;
  uint32_t num_threads = atoi(argv[1]);
  scheduler.setWorkerThreadCount(num_threads);

Running with argument "1" I expected one CPU to be used, but actually 2 CPUs are used (100% each).

real	0m18.354s
user	0m32.423s
sys	0m3.972s

Argument "2" uses 3 CPUs apparently

real	0m16.790s
user	0m39.521s
sys	0m10.168s

Argument "4" uses 5 CPUs apparently (I have 8)

real	0m15.852s
user	0m49.900s
sys	0m25.362s

So, basically, the only example provided so far seems to suggest that marl kind of... disappoints, spending most of its time is spent "somewhere" in a blocking operation.

After some profiling, I have the feeling that the fault is a mutex in the function rand(). See attached flamegraph.

image

My suggestion is to either fix this (I don't know how) or provide a different example where we can actually say: "hey, look how scalable it is with the number of CPUs"!

I am also puzzled by the fact that the number of CPUs used is always equal to (num_threads + 1).

Optimize scheduler queue data structures to reduce cache misses and fragmentation

From the TODO here:

// TODO: Implement a queue that recycles elements to reduce number of
// heap allocations.
using TaskQueue = std::deque<Task>;
using FiberQueue = std::deque<Fiber*>;
using FiberSet = std::unordered_set<Fiber*>;

and this:

std::vector<Allocator::unique_ptr<Fiber>>

unordered_map and deque aren't great for memory locality and heap fragmentation, and currently these are not using the marl Allocator so they are coming right from the system allocator. It'd be good to evaluate some alternative data structures (adding to or reusing those already in https://github.com/google/marl/blob/master/include/marl/containers.h) and pooling.

TSAN error reported in WithBoundScheduler.EventWaitUntilTimeTaken/0

Reproduction steps:

git clone https://github.com/google/marl.git
cd marl
git submodule update --init
mkdir build
cd build
cmake .. -GNinja -DMARL_TSAN=1 -DMARL_BUILD_TESTS=1
ninja
./marl-unittests --gtest_filter=SchedulerParams/WithBoundScheduler.EventWaitUntilTimeTaken/0 --gtest_repeat=-1

EventWaitUntilTimeTaken for a single threaded worker appears to fail after repeated runs of the same test:

[ RUN      ] SchedulerParams/WithBoundScheduler.EventWaitUntilTimeTaken/0
ThreadSanitizer:DEADLYSIGNAL
==79999==ERROR: ThreadSanitizer: SEGV on unknown address 0x0000000007f0 (pc 0x7f7a2bfb4f0e bp 0x600001000000 sp 0x7f75ff043610 T79999)
==79999==The signal is caused by a READ memory access.
==79999==Hint: address points to the zero page.
    #0 <null> <null> (libtsan.so.0+0x8cf0e)
    #1 <null> <null> (libtsan.so.0+0xa3b74)
    #2 __tsan_read8 <null> (libtsan.so.0+0x918f7)
    #3 marl::Scheduler::WaitingFibers::Timeout::operator<(marl::Scheduler::WaitingFibers::Timeout const&) const ../src/scheduler.cpp:354 (marl-unittests+0x13eadd)
    #4 std::less<marl::Scheduler::WaitingFibers::Timeout>::operator()(marl::Scheduler::WaitingFibers::Timeout const&, marl::Scheduler::WaitingFibers::Timeout const&) const /usr/include/c++/8/bits/stl_function.h:386 (marl-unittests+0x13eadd)
    #5 std::_Rb_tree<marl::Scheduler::WaitingFibers::Timeout, marl::Scheduler::WaitingFibers::Timeout, std::_Identity<marl::Scheduler::WaitingFibers::Timeout>, std::less<marl::Scheduler::WaitingFibers::Timeout>, std::allocator<marl::Scheduler::WaitingFibers::Timeout> >::_M_get_insert_unique_pos(marl::Scheduler::WaitingFibers::Timeout const&) /usr/include/c++/8/bits/stl_tree.h:2061 (marl-unittests+0x13eadd)
    #6 std::pair<std::_Rb_tree_iterator<marl::Scheduler::WaitingFibers::Timeout>, bool> std::_Rb_tree<marl::Scheduler::WaitingFibers::Timeout, marl::Scheduler::WaitingFibers::Timeout, std::_Identity<marl::Scheduler::WaitingFibers::Timeout>, std::less<marl::Scheduler::WaitingFibers::Timeout>, std::allocator<marl::Scheduler::WaitingFibers::Timeout> >::_M_emplace_unique<marl::Scheduler::WaitingFibers::Timeout>(marl::Scheduler::WaitingFibers::Timeout&&) /usr/include/c++/8/bits/stl_tree.h:2379 (marl-unittests+0x13eadd)
    #7 std::pair<std::_Rb_tree_const_iterator<marl::Scheduler::WaitingFibers::Timeout>, bool> std::set<marl::Scheduler::WaitingFibers::Timeout, std::less<marl::Scheduler::WaitingFibers::Timeout>, std::allocator<marl::Scheduler::WaitingFibers::Timeout> >::emplace<marl::Scheduler::WaitingFibers::Timeout>(marl::Scheduler::WaitingFibers::Timeout&&) /usr/include/c++/8/bits/stl_set.h:463 (marl-unittests+0x13cc4d)
    #8 marl::Scheduler::WaitingFibers::add(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, marl::Scheduler::Fiber*) ../src/scheduler.cpp:332 (marl-unittests+0x13cc4d)
    #9 marl::Scheduler::Worker::suspend(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*) ../src/scheduler.cpp:470 (marl-unittests+0x13cc4d)
    #10 marl::Scheduler::Worker::wait(marl::lock&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::function<bool ()> const&) ../src/scheduler.cpp:446 (marl-unittests+0x13d1e9)
    #11 bool marl::Scheduler::Fiber::wait<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(marl::lock&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, std::function<bool ()> const&) ../include/marl/scheduler.h:486 (marl-unittests+0xb0600)
    #12 bool marl::ConditionVariable::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> >, marl::Event::Shared::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}>(marl::lock&, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&, marl::Event::Shared::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&)::{lambda()#1}&&) ../include/marl/conditionvariable.h:171 (marl-unittests+0xb0600)
    #13 bool marl::Event::Shared::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) ../include/marl/event.h:170 (marl-unittests+0xa85d6)
    #14 bool marl::Event::wait_until<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >(std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const&) const ../include/marl/event.h:205 (marl-unittests+0xa85d6)
    #15 operator() ../src/event_test.cpp:186 (marl-unittests+0xa85d6)
    #16 _M_invoke /usr/include/c++/8/bits/std_function.h:297 (marl-unittests+0xa85d6)
    #17 std::function<void ()>::operator()() const /usr/include/c++/8/bits/std_function.h:687 (marl-unittests+0x13d36c)
    #18 marl::Task::operator()() const ../include/marl/task.h:94 (marl-unittests+0x132715)
    #19 marl::Scheduler::Worker::runUntilIdle() ../src/scheduler.cpp:702 (marl-unittests+0x132715)
    #20 marl::Scheduler::Worker::runUntilShutdown() ../src/scheduler.cpp:587 (marl-unittests+0x13886f)
    #21 marl::Scheduler::Worker::run() ../src/scheduler.cpp:580 (marl-unittests+0x13bc5a)
    #22 operator() ../src/scheduler.cpp:717 (marl-unittests+0x13c06d)
    #23 _M_invoke /usr/include/c++/8/bits/std_function.h:297 (marl-unittests+0x13c06d)
    #24 std::function<void ()>::operator()() const /usr/include/c++/8/bits/std_function.h:687 (marl-unittests+0xc2e25)
    #25 marl::OSFiber::run(marl::OSFiber*) ../src/osfiber_asm.h:129 (marl-unittests+0xc2e25)
    #26 marl_fiber_trampoline ../src/osfiber_x64.c:20 (marl-unittests+0x141c1a)

ThreadSanitizer can not provide additional info.
SUMMARY: ThreadSanitizer: SEGV (/usr/lib/x86_64-linux-gnu/libtsan.so.0+0x8cf0e) 
==79999==ABORTING

when I run marl-unittests on arm soc, Segment error occurred

Starting program: /mnt/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/build/marl-unittests
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/a53_softfp_neon-vfpv4/libthread_db.so.1".
[==========] Running 316 tests from 5 test suites.
[----------] Global test environment set-up.
[----------] 18 tests from WithoutBoundScheduler
[ RUN ] WithoutBoundScheduler.ConditionVariable
[ OK ] WithoutBoundScheduler.ConditionVariable (1 ms)
[ RUN ] WithoutBoundScheduler.Defer
[ OK ] WithoutBoundScheduler.Defer (0 ms)
[ RUN ] WithoutBoundScheduler.DeferOrder
[ OK ] WithoutBoundScheduler.DeferOrder (0 ms)
[ RUN ] WithoutBoundScheduler.OSFiber

Program received signal SIGSEGV, Segmentation fault.
0xb6fded60 in _dl_fixup () from /lib/ld-linux.so.3
(gdb) bt
#0 0xb6fded60 in _dl_fixup () from /lib/ld-linux.so.3
#1 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3
#2 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3
#3 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3
#4 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3
#5 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3
#6 0xb6fe5d64 in _dl_runtime_resolve () from /lib/ld-linux.so.3

CPU infos:
processor : 0
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 100.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5

processor : 1
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 100.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5

Hardware : Generic DT based system
Revision : 0000
Serial : 0000000000000000

Toolchain info:
arm-himix200-linux-g++ (HC&C V1R3C00SPC200B005_20190606) 6.3.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

What is I want to known is how to fix it?

No longer using fiber API on Windows

I think fiber api and ucontext api can no longer be used.
You can use the core code of boost.context to obtain higher performance and smaller stack space.
The smaller stack space on the X32 platform means more tasks can be opened.
What do you think?

Rename default branch `master` to `main`

main is more inclusive and is being increasingly adopted as the default branch name. Let's switch.

Tasks:

  • Create new main branch from master
  • Update all CI / package references to point to main (done except for vcpkg)
  • Merge any late landing changes from master to main (#158)
  • Delete master branch

Compile error with VS2015

/FoCMakeFiles\vk_swiftshader.dir\src\Device\Clipper.cpp.obj /FdCMakeFiles\vk_swiftshader.dir\ /FS -c ..\src\Device\Clipper.cpp
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): error C2065: 'rhs': undeclared identifier
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): warning C4346: 'marl::Pool<T>::Loan': dependent name is not a type
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): note: prefix with 'typename' to indicate a type
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): error C2350: 'marl::Pool<T>::Loan::operator =' is not a static member
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): note: see declaration of 'marl::Pool<T>::Loan::operator ='
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(164): error C2513: 'marl::Pool<T>::Loan::operator =': no variable declared before '='
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): error C2065: 'rhs': undeclared identifier
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): warning C4346: 'marl::Pool<T>::Loan': dependent name is not a type
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): note: prefix with 'typename' to indicate a type
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): error C2350: 'marl::Pool<T>::Loan::operator =' is not a static member
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): note: see declaration of 'marl::Pool<T>::Loan::operator ='
C:\projects\build-swiftshader\SwiftShader\third_party\marl\include\marl/pool.h(175): error C2513: 'marl::Pool<T>::Loan::operator =': no variable declared before '='
[825/1183] Building CXX object CMakeFiles\vk_swiftshader.dir\src\Device\Blitter.cpp.obj
[826/1183] Building CXX object CMakeFiles\vk_swiftshader.dir\src\Device\Context.cpp.obj
ninja: build stopped: subcommand failed.

Implement an Event synchronization primitive

Likely interface:

// Event is a synchronization primitive used to block until a signal is raised.
class Event {
public:
	enum class ClearMode
	{
		// The event signal will be automatically reset when a call to wait()
		// returns.
		// A single call to signal() will only unblock a single (possibly
		// future) call to wait().
		Auto,

		// While the event is in the signaled state, any calls to wait() will
		// unblock without automatically reseting the signaled state.
		// The signaled state can be reset with a call to clear().
		Manual
	};

	Event(ClearMode mode = ClearMode::Auto, bool initialState = false);

	// signal() signals the event, possibly unblocking a call to wait().
	void signal();

	// clear() clears the signaled state.
	void clear();

	// wait() blocks until the event is signaled.
	// If the event was constructed with the Auto ClearMode, then only one 
	// call to wait() will unblock before returning, upon which the signalled state
	// will be automatically cleared.
	void wait();

	// test() returns true if the event is signaled, otherwise false.
	// If the event is signalled and was constructed with the Auto ClearMode
	// then the signalled state will be automatically cleared upon returning.
	bool test();

	// isSignalled() returns true if the event is signaled, otherwise false.
	// Unlike test() the signal is not automatically cleared when the event was 
	// constructed with the Auto ClearMode.
	// Note: No lock is held after bool() returns, so the event state may
	// immediately change after returning. Use with caution.
	bool isSignalled();
}

You must link with -lpthread on linux.

I noticed that when marl, embedded in another library is loaded, it will crash without -lpthread on linux.

Not sure what the right solution is, maybe trying to dynamically load pthreads when run?

Feedback on Marl

Hi,
I'm just going to leave some feeback and thoughts on marl as recommended from this external issue. Note that my experiences with marl are limited and involve only the incomplete benchmark repository I made and a point cloud processing library (for robotics), polylidar, which is having a major refactor being done which also includes marl.

Oh nice! How are you finding marl²?

I find marl to be very simple to build and integrate into my existing project. The CMake build integration is very important to me and seems to be setup correctly such it was trivial to add to my project. Also the Scheduler and WaitGroup primitives seem simple enough and it was apparent how to integrate them into polylidar. Overall I enjoy how simple it is too describe work in tasks and describe dependencies with the wait groups.

Given that your benchmarks put marl mostly in last place in terms of performance, can I ask what convinced you to go with marl³?

I chose to try out marl before those benchmarks were completed : ) . But the main reasons were the following:

  1. Many of the tasks for polylidar were dynamic and nested, meaning you have no idea how many subtasks must be completed inside of a task (and all those subtasks must finish before their parent task is completed). It seemed like marl was primarily focused on this scenario. Also I do not have thousands of tasks generally, maybe like 100 total. I figured the difference between all these libraries might be small at such a low amount of tasks.
  2. I saw that marl had hand-written assembly for the arm architecture for its fiber/thread scheduling. I will be running some of my code on an RPI4 and felt more comfortable using a library that had tested with that architecture. Of course that's just a way of saying I was too lazy to (initially) benchmark performance between architectures and implementations and was relying upon googles name and heavy use of android/arm to assume good testing and performance. Note I do plan to run my benchmark repo on my rpi4 after it has solidified better.

So that lead me to just experiment with marl first. And it worked. Very well.

However I do have some general comments:

My system:
CPU: Ryzen 3900X (12 Core/24 Thread)
RAM: 32 GB
OS: Ubuntu 18.04
Compiler: C++ 14 GCC 7.something

  1. There seems to be a 300 microsecond penalty loss when using marl, even if only 1 thread is used. You can see where I create the scheduler here.

I have benchmarks that compared using marl and not using marl for a light workload that only requires about 300 microseconds of work (because). In this output Optimized means using marl.

ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormals/1/real_time_mean                   336 us          336 us            3
ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormals/1/real_time_median                 335 us          335 us            3
ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormals/1/real_time_stddev                2.36 us         2.35 us            3
ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormalsOptimized/1/real_time_mean          666 us          355 us            3
ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormalsOptimized/1/real_time_median        664 us          355 us            3
ImagesAndSparseMesh/BM_ExtractPlanesAndPolygonsFromMultipleNormalsOptimized/1/real_time_stddev       8.92 us         1.03 us            3

So some thoughts I have that I have not experimented/validated yet. Is this a startup cost assocaited in the instantiation of the Scheduler object? If so should I put it inside of a class variable (created only once) instead of creating the scheduler on the function call. Or is this just the penalty that marl has, which is not that steep of a penalty.

Note that I would not try to reproduce these results with my repository because you need the input data and I feel its too complex. If this startup cost is unexpected it would be better for me to generate a MRE in my benchmark repo.

  1. There was a previous issue brought up by another here. The point I want to highlight is the use of RAII. I kind of agree that it felt pretty weird/different to use the defer macro. I have read the issue and I understand it is optional but I felt there were not alot of example to show how to accomplish without using defer. I also at one point was trying to call 2 different wait group .dones() in one defer and that did not work well (this was before in the poorly written binary_tree benchmark for marl). I think the idea is that I'm really not that amazing at C++ and the common idiom of RAII I seem to get, but the defer makes me a little puzzled about whats going on and what I need to do to sure that no matter what (even exceptions) the wait group gets decremented so I dont wait forever.

  2. The API of cpp-task-flow, even though I really haven't used it too much either, seems amazing. I'm really impressed by how succinct and simple it is to describe dependencies between tasks and then execute them. Its probably out of scope for marl, but it would nice to have something similar.

If I think of anything else I will let you know. Thanks again for releasing this product and putting such an effort on improvement.

Efficient queuing of tasks that start waiting?

In my usage pattern I'm creating a possibly large number of tasks that may have a decent % initially waiting. For example, think of the pathological Ticket::Queue use case where the first (or one of the first) operations would be a wait:

void runTasksConcurrentThenSerially(int numConcurrentTasks) {
    marl::Ticket::Queue queue;
    auto tickets = queue.take(N);
    for (int i = 0; i < N; i++) {
        marl::schedule([=] {
            ticket[N - i].wait();  // or some other ordering
            doSerialWork();
            ticket[i].done();
        });
    }
}

Right now each task will get enqueued, switched to, and then immediately switched out when ticket.wait() blocks and for certain workloads this can be the norm instead of a pathological case.

Maybe the better approach for this is to use the tasks-in-tasks technique to only create the tasks when they are runnable? If so, are there other performance gotchas from doing that? For example, in this artwork:
image
Would it be more efficient to queue up all of these using tickets (so Bn waits for A, C waits for Bn (waitgroup, etc), etc) ahead of time or better instead to have each queue up the following ones (so A only, then A runs and enqueues Bn, etc)?

I'm mostly interested in the fan-out case where A->Bn may be going 1 to 100 and Bn->A would be going 100 to 1. I wouldn't want C to wake 100 times only to wait 99 of them, for example. But I also wouldn't want to pessimize the fanout across threads at Bn by waiting until A was executing to enqueue the tasks.

Thoughts?

RFC: how to integrate Tracy (and other custom tracing) for tracing marl

Tracy is an absolutely fantastic tracing tool that we've recently fallen in love with. To be able to fully understand how marl is behaving with the rest of our application we need to integrate it with marl's trace.h.

Ideally instead of embedding Tracy events directly into marl we could make trace.h able to be extended by the integrator of marl itself. In header-only libraries this is generally done by allowing overrides (such as with the Vulkan Memory Allocator: https://github.com/google/iree/blob/master/iree/hal/vulkan/internal_vk_mem_alloc.cc#L28-L66), however overriding MARL_SCOPED_EVENT/etc is difficult without making direct changes as the macros are required during marl compilation.

The approach dear imgui takes is compatible in that it allows either include-path based overriding or completely custom path overriding by way of defines:
https://github.com/ocornut/imgui/blob/master/imgui.h#L40-L46

This would allow an integrator, for example, to have their own trace.h or trace-impl.h that defines the MARL_* tracing macros that will be used when building marl, overriding the current chrome tracing-based implementation.

Thoughts? Would you be open to a PR adding an optional MARL_USER_TRACE_H check?

build error in redhat7.0

marl/src/scheduler.cpp: In constructor ‘std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::unordered_map(std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type, const hasher&, const key_equal&, const allocator_type&) [with _Key = std::thread::id; _Tp = std::unique_ptr<marl::Scheduler::Worker, marl::Allocator::Deleter>; _Hash = std::hash<std::thread::id>; _Pred = std::equal_to<std::thread::id>; _Alloc = marl::StlAllocator<std::pair<const std::thread::id, std::unique_ptr<marl::Scheduler::Worker, marl::Allocator::Deleter> > >; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::size_type = long unsigned int; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::hasher = std::hash<std::thread::id>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::key_equal = std::equal_to<std::thread::id>; std::unordered_map<_Key, _Tp, _Hash, _Pred, _Alloc>::allocator_type = marl::StlAllocator<std::pair<const std::thread::id, std::unique_ptr<marl::Scheduler::Worker, marl::Allocator::Deleter> > >]’:
marl/src/scheduler.cpp:743:22: error: no matching function for call to ‘marl::StlAllocator<std::pair<const std::thread::id, std::unique_ptr<marl::Scheduler::Worker, marl::Allocator::Deleter> > >::StlAllocator()’
     : byTid(allocator) {}

The error "ASSERT: alignment (0xb35245b4) must be less than the page size (0x1000)" certainly appear

QUESTION

I want to know is why it happen? Is it a bug or my use is wrong?

PLATFORM

processor : 0
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 100.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5

processor : 1
model name : ARMv7 Processor rev 5 (v7l)
BogoMIPS : 100.00
Features : half thumb fastmult vfp edsp neon vfpv3 tls vfpv4 idiva idivt vfpd32 lpae evtstrm
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x0
CPU part : 0xc07
CPU revision : 5

Hardware : Generic DT based system
Revision : 0000
Serial : 0000000000000000

TOOLCHAIN

Using built-in specs.
COLLECT_GCC=arm-himix200-linux-g++
COLLECT_LTO_WRAPPER=/opt/hisi-linux/x86-arm/arm-himix200-linux/host_bin/../libexec/gcc/arm-linux-gnueabi/6.3.0/lto-wrapper
Target: arm-linux-gnueabi
Configured with: /home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/src/gcc-6.3.0/configure --host=i386-redhat-linux --build=i386-redhat-linux --target=arm-linux-gnueabi --prefix=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/install --enable-threads --disable-libmudflap --disable-libssp --disable-libstdcxx-pch --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --enable-shared --enable-lto --enable-symvers=gnu --enable-__cxa_atexit --disable-nls --enable-clocale=gnu --enable-extra-hisi-multilibs --with-sysroot=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/install/target --with-build-sysroot=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/install/target --with-gmp=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/obj/host-libs/usr --with-mpfr=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/obj/host-libs/usr --with-mpc=/home/sying/SDK_CPU_UNIFIED/build/script/arm-himix200-linux/arm_himix200_build_dir/obj/host-libs/usr --enable-libgomp --disable-libitm --enable-poison-system-directories --with-pkgversion='HC&C V1R3C00SPC200B005_20190606' --disable-bootstrap
Thread model: posix
gcc version 6.3.0 (HC&C V1R3C00SPC200B005_20190606)

USE METHOD

                                                   /        task B          \
//  preprocess-->ticket.take--> task A --> ticket.wait --> task A: wait  -->task A: resume
                                                  \        task  C          /

Modify code:


  struct Request {
    size_t size = 0;                 // The size of the allocation in bytes.
    size_t alignment = 0;            // The minimum alignment of the allocation.
#if MARL_USE_FIBER_STACK_GUARDS // different
    bool useGuards = true;                // different
#else
    bool useGuards = false;          // Whether the allocation is guarded.
#endif
    Usage usage = Usage::Undefined;  // Intended usage of the allocation.
  };

  void* ptr = nullptr;  // The pointer to the allocated memory.
  Request request;      // Request used for the allocation.
};

If I not modify source code, will invoke alignedMalloc function in

    if (request.useGuards) {
      ptr = ::pagedMalloc(request.alignment, request.size, true, true);
    } else if (request.alignment > 1U) {
      ptr = ::alignedMalloc(request.alignment, request.size);
    } else {
      ptr = ::malloc(request.size);
    }

GDB INFO

WARNING: pagedMalloc

ASSERT: alignment (0xb35245b4) must be less than the page size (0x1000)

[New Thread 0xaabff410 (LWP 307)]
[New Thread 0xab3ff410 (LWP 306)]
[New Thread 0xabbff410 (LWP 305)]
[New Thread 0xac3ff410 (LWP 304)]
[New Thread 0xacdff410 (LWP 303)]
[New Thread 0xad7fd410 (LWP 302)]
[New Thread 0xadffd410 (LWP 301)]
[New Thread 0xae7fd410 (LWP 300)]
[New Thread 0xb2d25410 (LWP 299)]
[New Thread 0xb3525410 (LWP 298)]
[New Thread 0xb699d410 (LWP 297)]

Program received signal SIGABRT, Aborted.
[Switching to Thread 0xb3525410 (LWP 298)]
0xb69c9948 in raise () from /lib/a53_softfp_neon-vfpv4/libc.so.6
(gdb) bt
#0 0xb69c9948 in raise () from /lib/a53_softfp_neon-vfpv4/libc.so.6
#1 0xb69cada8 in abort () from /lib/a53_softfp_neon-vfpv4/libc.so.6
#2 0x006a79b0 in marl::fatal (msg=) at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/debug.cpp:32
#3 0x006a7d60 in pagedMalloc (guardLow=true, guardHigh=true, size=3008513564, alignment=)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/memory.cpp:145
#4 (anonymous namespace)::DefaultAllocator::allocate (this=, request=...)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/memory.cpp:210
#5 0x006a9b04 in allocate (n=1, this=0xb6106110)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/include/marl/memory.h:415
#6 allocate (__n=1, __a=...) at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/alloc_traits.h:281
#7 _M_allocate_nodemarl::Scheduler::Fiber*& (this=0xb6106110)
at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/hashtable_policy.h:1947
#8 _M_emplacemarl::Scheduler::Fiber*& (this=0xb6106110)
at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/hashtable.h:1513
#9 emplacemarl::Scheduler::Fiber*& (this=0xb6106110)
at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/hashtable.h:728
#10 emplacemarl::Scheduler::Fiber*& (this=0xb6106110)
at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/unordered_set.h:369
#11 marl::Scheduler::Worker::runUntilIdle (this=this@entry=0xb6106000)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/scheduler.cpp:693
#12 0x006af850 in runUntilShutdown (this=)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/scheduler.cpp:592
#13 marl::Scheduler::Worker::run (this=0xb6106000)
at /root/Workspace/Baidu/baidu/idl-xteam/hisi-faceID-sdk/cross_3rdparty/opensource/marl/src/scheduler.cpp:585
#14 0x006c07b4 in std::__adjust_heap<marl::Thread::Core*, int, marl::Thread::Core, __gnu_cxx::__ops::_Iter_less_iter> (__first=0x8c5bd0, __holeIndex=0,
__len=, __value=..., __comp=...) at /opt/hisi-linux/x86-arm/arm-himix200-linux/arm-linux-gnueabi/include/c++/6.3.0/bits/stl_heap.h:226
#15 0x00000000 in ?? ()

iOS support?

Marl looks really interesting but my codebase primarily targets mobile, so I'd need support for both iOS and Android. Is iOS support on the cards at all?

A bit of background - my codebase does real-time CPU-side camera processing and many of the high CPU parts could be split across cores relatively easily. The work is non-blocking so I don't think I really need fibers or yielding, I just want to spread it out between workers.

I was thinking of writing a simple dispatch queue type of thing where workers would just all pull the next job from a shared queue, but thought I'd look at some of the more complete third-party library solutions. If you think marl is overkill for this use case then that would be really helpful to hear too!

There might even be standard library solutions that will work - I've been slow to learn what's new in C++ as historically the codebase worked with lots of old / esoteric compilers (old MSVC, Symbian, Android ndk before it had a standard library...). I have started adopting some new features from C++11 but I haven't dedicated much time to learning all the new stuff.

ConditionVariable::notify_one notifies all fibers?

I was wondering if this was intentional as a way to avoid starvation or if this should be only notifying a single fiber? I noticed that waiters are inserted at the front of the list which seems like it may also be intentional to try to keep wait/notify pairs on the same thread (or at least provide some ordering); is that right?

marl::lock lock(mutex);
for (auto fiber : waiting) {
fiber->notify();
}

(I need to customize ConditionVariable a bit for my use case, so I was trying to understand the load-bearing parts of the implementation :)

Please consider replacing foreign language idioms with proper C++ equivalents

The RAII idiom is a native way in C++ land to handle resources management while try-with-resources/finally/defer-like idiom is an escape hatch for rare occasions. External destruction makes code prone to resources leakage and, more importantly, memory corruption and misuse.

The other issue is lack of proper copy/move constructors/operators. Please check C++ guidelines on movable/copyable classes.

One such example is Scheduler class

  1. It looks like either non-copyable or even non-movable class. Although it has no copy constructors deleted, or move-constructors overridden. As a result, it implicitly uses generated ones, depending on its member fields semantics. This may lead to unexpected copyability and handles duplication, with consequent dangling-pointers or double-frees.
  2. Destructor only checks scheduler was unbound. The proper way would be to both perform unbind and, if movability is desired, create explicit move constructor which would leave old instance in "moved-from" state.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.