reble / llvm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from intel/llvm

4.0 4.0 4.0 2.51 GB

Intel staging area for llvm.org contribution. Home for Intel LLVM-based projects.

llvm's People

Contributors

Stargazers

Watchers

Forkers

julianmi cheng-hsiang-chiu wee-free-scot tgiladraz

llvm's Issues

Support specialization constants in graph implementation

Implement the use of specialization constants in graph nodes, verified with tests. In the case where we run into technical issues with the implementation, we should discuss whether we need to place restrictions on their use in the spec.

Create tests for record & replay graphs mechanism

Convert existing Codeplay tests for SYCL_EXT_CODEPLAY_GRAPHS to llvm-test-suite tests for testing the record & replay graph interface from the unified oneAPI graphs extension

Explicit graph building function for adding sub-graph

The record & replay graph building API lets a user add a sub-graph as a node to the graph by recording a queue queue::submit(command_graph<executable>) operation. However, there is no equivalent mechanism in the explicit graph
building API for adding a sub-graph.

CUDA defines cudaGraphAddChildGraphNode in it's explicit API which does similar.

We should consider adding a function like command_graph<modifiable>::add(command_graph<executable/modifiable>, const std::vector<node> &dep = {}) to achieve this functionality.

Dot product test fails with missing event message

Summary of the issue:

The dot product test in the sycl-graph-poc-v2 branch currently fails to run correctly, with the following output:

die: The host-visible proxy event missing
terminate called without an active exception

Summary of findings:

Issue is caused by the use of the reduction object in the last graph node. This creates a requirement for an auxiliary resource when the command is scheduled.
When the aux resources are being cleaned up they wait for any command submissions to have finished, however the lazy queue command list is not correctly indicating that it has finished executing and at this point the runtime panics.
It's unclear whether the command list is notifying correctly and this is just not being reflected in the host visible event associated with the command or if it is failing to notify altogether.
Making sure that the graph command list correctly notifies the associated event when it is finished should fix the issue, though exactly where that change needs to happen is unclear to me.
Existing queue mechanisms for resetting and cleaning up command lists may help to avoid code duplication, if they can be used with the separate lazy command list.

Property to control granularity of cyclic dependency checks

We need to define a property that skips checking for cycles in the graph with each make_edge call.

Agree on support for handler::depends_on

Supporting handler::depends_on to track edges would make it easier for users to express USM dependencies in the record &
replay model. See intel#5626 (comment)

This could be implemented by the runtime mapping default constructed sycl::events returned by queue recording to internal nodes. When one of these sycl::events is later passed into handler::depends_on the runtime can check which node in the graph
the event is associated with, and error if it is not an event returned from a queue recording.

Whether we decide to support this mechanism or not, we should update the spec to make the interaction with handler::depends_on clear.

Waiting on a queue after graph submission causes incorrect behaviour.

It appears that calling queue.wait() after the graph has been submitted causes incorrect behaviour/results in the example code, like so:

  // Execute several iterations of the graph
  for (unsigned n = 0; n < iterations; n++) {
    testQueue.submit([&](handler &h) { h.exec_graph(graphExec); });
  }
  // Perform a wait on all graph submissions.
  testQueue.wait();

In reality this should at the very least do nothing if the graph has already been executed (as happens in the current prototype).

Specify errors for add_free()

Decide if command_graph<graph_state::modifiable>::add_free() show throw an exception under any of the following circumstances, or others not listed here.

If data not allocated by add_malloc_device
If data is already freed
If data is not valid address

If so, update the spec to reflect that this behaviour.

Decide on inclusion of executable graph update feature

Being able to update an executable graph with new inputs is a feature that is available in the Codeplay vendor graphs extension, but not the original oneAPI extension.

Such an extension requires support in the underlying API to be performant, e.g cl_khr_command_buffer_mutable_dispatch.

We should decide if this feature is a requirement for making the ML applications we are interested in performant, or if it can be removed from SYCL_EXT_ONEAPI_GRAPH

Define SYCL_EXT_ONEAPI_GRAPH macro

The vendor macro SYCL_EXT_ONEAPI_GRAPH specified by the extension specification is not accessible by the user, which should be defined as 1.

Seeing error: use of undeclared identifier 'SYCL_EXT_ONEAPI_GRAPH' when trying to use it on sycl-graph-develop.

Provide a record & replay mechanism to mark internal graph memory

When a graph is created by recording a queue with record & replay there is no way to tag memory objects internal to the graph, which would enable optimizations on the internal memory. Without this tagging, enabling kernel fusion on a graph created by record & replay may not be possible.

Such an interface must support both buffers and USM allocations.

Implement graphs emulation layer

Implement in the DPC++ graphs PoC at the PI level a lazy queue on top of a regular queue to emulate the deferred execution behaviour of graphs.

Please feel free to add more information here @julianmi, that single sentence probably doesn't do justice to what you've been working on, and may not even be accurate.

Execute graph using handler

Spun off from - #4 (comment)

Rather than the event submit(command_graph<graph_state::executable> graph); function to run the executable graph, an alternative API would be to use the handler to add a new function exec_graph(command_graph<graph_state::executable> graph).

auto ev = q.submit([&](handler& cgh){
  cgh.depends_on(some_other_event);
  // Not sure of the most appropriate name here
  cgh.exec_graph(g);
});

The advantages for this are 1) that it is more in keeping with the existing SYCL API, and 2) allows a graph execution to depend on an arbitrary event. This would make it easier for users to write the following code without have to block on host waits:

auto ev1 = q.submit([&](handler& cgh){
cgh.exec_graph(g);
});

// dest is some input to graph `g`
auto ev2 = q.memcpy(dest, src, numBytes,  ev1);

auto ev3 = q.submit([&](handler& cgh){
  cgh.depends_on(ev2);
  cgh.exec_graph(g);
});

We could also add a queue shortcut function(s)

Consider grouping extension names more strongly together

From feedback:

Should command_graph simply be graph, and should we have a namespace for other objects (node, graph_state, depends_on
property). We already have a verbose extension namespace, but should consider a future where graphs isn't an extension any more but part of the core spec

Specifying the implications of profiling a graph submission.

Motivated by feedback at - intel#5626 (comment)

We should specify what it means to get profiling information from the event returned from the queue submission of a graph. All the wording in the core SYCL spec around profiling is centred on command-groups - https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#table.event.profilinginfo - so we can't defer to that.

Better define graph memory allocation semantics

Based on feedback we should better define how memory allocation/free nodes behave in a graph. There are several aspects to this:

What is the lifetime of the memory? My understanding is that if a memory is both allocated and freed by a graph then it is owned by the graph. However, memory may only be allocated by the graph without an associated free node in which case the memory can be used outside the graph and must be freed by the user with sycl::free(). We should explicitly say this.
Relationship between memory allocations in an executable graph v a modifiable graph. That is, if the same modifiable graph is finalized to two executable graph objects, do they point to the same memory. My understanding is that they do not, but we should say this explicitly in the spec.
Relationship with USM. If we expect the usage of the usage of the pointers returned to be the same as device USM, then we should say that explicitly in the spec.
Invalid usage of free - We should specify that it is undefined behaviour if the pointer passed to add_free is already freed or not valid address

Reevaluate extra support for deferred Memory Allocation

We considered the following explicit graph building interfaces for adding a memory allocation owned by the graph:

Allocation function returning a reference to the raw pointer, i.e. void*&, which will be instantiated on graph finalization with the location of the allocated USM memory.
Allocation function returning a handle to the allocation. Applications use the handle in node command-group functions to access memory when allocated.
Allocation function returning a pointer to a virtual allocation, only backed with an actual allocation when graph is finalized or submitted.

Expand on graphs llvm-test-suite tests

DPC++ runtime testing is done through llvm-test-suite tests, while compilation testing is done in the sycl/test directory

We should have thorough runtime testing to accompany a DPC++ prototype implementation of SYCL_EXT_ONEAPI_GRAPH.

Tests for the record & replay API can already be found in https://github.com/Bensuo/llvm-test-suite however these need to have tests for the explicit API added to them.

Specify behaviour mixing graph building mechanisms

The Unification of vendor extensions PR doesn't introduce any specification wording on what happens when a user mixes the explicit graph building mechanism with the record & replay mechanism on the same modifiable graph

If a modifiable graph is being recorded to by a queue, and then the user uses explicit APIs to add nodes to the same graph before the recording has finished, is that okay? We should specify if that's valid or not.

Another possible scenario is if a modifiable graph contains nodes which have been explicitly added, and then the user goes and records more commands to it from a queue. Is that okay? I would assume so, but again we should specify explicitly in the spec.

Implement creating an edge with handler::depends_on()

Our specification defines that when using handler::depends_on() during node creation, passing an event returned from queue recording, will create a node edge.

This functionality needs to be implemented in the prototype and tests added to https://github.com/Bensuo/llvm-test-suite/tree/intel/SYCL/CommandGraph

Currently handler::depends_on is causing segfaults when using record & replay graph creation with oneDNN

POC can't support subgraphs in record and replay

#55 enables submitting subgraphs for the explicit API but this poses problems for the record and replay implementation due to several factors:

Because the CGFs stored in nodes are essentially black boxes until being run we don't know that we have a subgraph as part of a parent graph, so we can't extract nodes from the subgraph to add to the parent graph without the handler being run.
When recording to the subgraph we don't have information available about parent graph nodes to create proper edges, and at subgraph submission time we do have that information but don't have access to subgraph information without running the handler.
However, when we run the handler for the subgraph submission this is also execution of the graph and we can't effectively block it without also breaking for explicit mode. Since the handler needs to be run to determine graph edges between the subgraph and the parent graph we can't avoid running the handler earlier than submission time.
Even if we could only run the handler at submission time at that point it is too late to determine edges and modify the topology of the graph.
Possibly we could modify the handler to extract the subgraph at subgrpah submission time but since the handler is a public facing class this would require an ABI break which I don't think is feasible here, and may not even solve all the issues.

Ideally we would have information as part of node_impl that lets us know at finalize time that the node is a subgraph submission and be able to merge the nodes and determine edges at that point in time. This approach would likely be needed for the non-POC implementation so shouldn't present any issues at that time.

Consider multi-device support in graphs API

There is a feature request for SYCL_EXT_ONEAPI_GRAPH to support creating an executable graph which contains commands targetting different devices. The complete graph targetting different devices can then be submitted for execution at once.

Creating/submitting such a graph is not possible in the current API, and instead the user must compose executable graph submissions from graphs targetting the individual devices. We should discuss whether multi-device support is a requirement, and if so, iterate on the spec to make it possible.

Enable Host Tasks in graph PoC

The current graphs PoC (#5) does not support host tasks (I believe)

This ticket is to support host tasks in both explicit and record & replay APIs. Tests for this functionality should accompany the change.

Add record & replay graph recording API to PoC

The Proof-of-concept implementation in #5 currently only supports the explicit graph building API.

The record & replay API from the unified proposal needs to be implemented in the DPC++ as well.

Consider templated add_malloc_device API

We should consider whether we'd like a templated equivalent of our API for adding malloc nodes

node* add_malloc_device(void *&data, size_t numBytes, const std::vector<node>& dep = {});`

e.g

template <typename T>
node* add_malloc_device(void *&data, size_t count, const std::vector<node>& dep = {});

This is equivalent to template <typename T> T* sycl::malloc_device(count, ... and void* sycl::malloc_device(size_t numBytes,...

Push common DPC++ changes back to mainline

PR #82 touches shared DPC++ code in sycl/include/sycl/detail/common.hpp to enable correct resolution of namespaces when the macros are used in extensions.

This code modification has implications outwith the scope of our extension, so it would be worth pushing it back to the central DPC++ repo as a standalone PR.

Remove lazy queue property

We should document what the ext::oneapi::property::queue::lazy_execution queue property does, as it is in the PoC but not in the SYCL_EXT_ONEAPI_GRAPH specification.

If we require users to use this property when writing code with graphs, we should advertise somewhere that it exists and what it does. Even if it is accompanied by a note saying that the property is temporary and will only exist until a more complete implementation is developed.

Can't use buffers with graphs because copy back to host fails

Using buffers inside a command graph potentially causes a copy back to host to be triggered when the buffer is destroyed or synchronized in some other way. This presents issues for the current graphs implementation, which causes an error to be thrown similar to #24. The cause is slightly different in this case though:

The copy back operation tries to execute, but is instead added to the command_graph's command list
The runtime then tries to immediately wait on the copy back operation which fails because nothing was enqueued and no host-visible event created.

This is the same problem preventing any other arbirtrary commands from executing on a queue which use the lazy_queue property but likely requires a decent amount of work to support in the current POC. It also may not be feasible in the current implementation to indicate to the queue which submissions are intended for a graph and which are not without making changes to the PI interface by modifying pi_queue, which seems like it may be required to solve this problem.

Agree on simultaneous executable graph submission wording

Decide on wording for submitting an executable graph while another instance of the same executable graph is still executing.

The current SYCL_EXT_ONEAPI_GRAPH wording is

Support for submitting a graph for execution, before a previous execution has been
completed is backend specific. The runtime may throw an error.

While SYCL_EXT_CODEPLAY_GRAPHS wording is

There are no gurantees that more than one instance of graph will execute concurrently. Submitting a graph for execution, before a previous execution has been complete may result in serialized execution depending on the SYCL backend an characteristics of the graph.

With the current specified behaviour it is hard for the user to know if the runtime will throw an error or not when they write application code attempting this.

Improve verification of test results and add sanity checks

Some of our tests lack proper verification of computed results and sanity checks of program behavior (synchronization, ...)

Update prototype explicit API to change vector of node dependencies to properties

Our implementation needs updated to match spec change - #29
In the explicit API rather than passing dependencies as a vector of nodes, they are now passed as properties, however old implementation uses the old model.

Implement executable graph update

The void command_graph<graph_state::executable> update(const command_graph<graph_state::modifiable>& graph); feature included in the spec is not defined in our prototype, we should add this entry-point.

The depth of this implementation is up for discussion, options are:

Stub entry-point which throws an exception saying not implemented
Emulated functionality by recreating the underlying PI commands
Extending PI to introduce argument update of existing commands, which this functionality is then built upon.

Implement error handling for invalid event/queue usage

Implement the erroneous wait behaviour defined by #78

This includes:

Throwing an error on handler::depends_on() when an event is passed that doesn't correspond to a node (related to #89)
Throwing an error when waiting on an event returned from a queue submission recorded to a graph.
calling queue::wait() on a queue in the recording state

Enable multiple executable graphs in PoC

The current graphs PoC(#5) is limited to a single executable graph.

This Issue is to address this limitation so that a user can create and submit multiple executable graphs for execution. A test for this functionality should ideally accompany the change.

Should an exception be thrown if a graph contains a cycle

The free function make_edge() allows users to create a cycle in a graph, which goes against the DAG definition of a graph.

If we do want to error on cycles, checking for them in make_edge would turn it into a costly operation, especially for graphs at large scale. Because you would have to run the analysis each time you connect two nodes. Instead, having this error when the executable graph is created would avoid repeatedly paying the cost of analysing the graph.

It may also be the kind of check that would be useful for users while developing an application, but once they actually want to release the application would be unwanted overhead. So we could specify a mechanism for making this check optional. CUDA has cudaGraphInstantiateWithFlags, we could potentially add a flags parameter to finalize() that would allow a user to pass in configuration information for creating the executable graph? E.g a validate flag that would do this cycle check and possibly other verifications. Other optimization flags could be added here in the future if deemed necessary.

Update implementation to reflect device specific graph changes

Our implementation needs to be updated to match the changes introduced in spec PR #83
Most importantly, the API change to add a device to the graph.add() entry-point, but the error checks also need implemented.

memcpy operations break on queue with lazy_queue property

Trying to do a memcpy using a queue with the lazy_queue property causes program termination with the following error:

die: The host-visible proxy event missing
terminate called without an active exception

This presents an issue for both example code/test having to manually initialize USM memory and also for future users of a prototype implementation who would likely want to use memcpy.

Nodes captured or added to a graph can fail due to captured variables going out of scope

Because of the delayed submission of the lambda functions passed to the sycl::queue in either mode of graph construction it is possible for a situation to arise where the lambda captures variables which are alive at time of submission to the queue but out of scope at time of graph execution, which causes undefined behaviour and errors when executing a graph. For example:

void run_some_kernel(sycl::queue q, int* data){
  // data is captured by ref here but will have gone out of scope when the 
  // CGF is later run when the graph is executed.
  q.submit([&](sycl::handler& h){
    // Call some kernel which uses data here
  });
}
int main(){

sycl::queue q;

int* someUsmAllocatedPtr;
q.begin_recording();
run_some_kernel(q, someUsmAllocatedPtr);
q.end_recording();
}

This scenario is potentially more common with record and replay but not improbable with the explicit API.

Implement command status query from event returned from graph submission.

Implement setting the correct info::event_command_status on an event returned from graph submission, verified with testing for each status.

Define "topologically identical" in the context of executable graph update

This phrase is used to specify when update() can successfully be used, but is currently not well defined.

Actions feedback from - intel#5626 (comment)

Align record & replay stateful queue recording model with kernel fusion

The graph specification should align with the kernel fusion extension where possible. Kernel fusion previously has a stateful queue model in their extension, but got pushback due to exception safety:

As a result, kernel-fusion ended up with a fusion_wrapper class:

ext::codeplay::experimental::fusion_wrapper w{q};
w.start_fusion();
// queue submissions
w.complete_fusion()

To align with kernel fusion, the complementary change to graphs record & replay model would be to change
A)

ext::oneapi::experimental::command_graph graph;
q.begin_recording(graph);
// queue submissions
q.end_recording();

Into B)

ext::oneapi::experimental::command_graph graph{q};  // could pass more than one queue here?
graph.begin_recording(); 
// queue submissions
graph.end_recording();

We did consider approach B internally at Codeplay for our vendor extension, but decided against it because it could be surprising to the user - e.g it looks like something is being done to graph, but following this call to graph.begin_recording(); the behaviour of q changes dramatically as a side effect (commands are no longer submitted for execution). However, aligning with kernel fusion here and addressing the exception safety issue which also applies to graphs I think is a motivating reason to revise our design.

Update PoC graph execution mechanism

The current PoC code from #5 uses command_graph<graph_state::executable>::exec_and_wait() to execute a graph. This is from an outdated version of the graphs spec, and should be replaced by either queue::submit(command_graph<graph_state::executable> graph)
or handler::exec_graph(command_graph<graph_state::executable> graph)

See #4 (comment)

Decide on graph default device

Motivated by PR review comment #83 (comment)

With the change in a modifiable graph from being device-agnostic to device-specific, the explicit API now takes an extra device parameter to graph.add(). It may make users lives more ergonomic for the most common single-device use cases if the graph had a default device that nodes were created for, and there was an overload of graph.add() which didn't need the device and used the default device instead.

The open question with this is whether a user should be able to change the default device, or if it's immutable on graph construction.

If a user can't change the default device, then a property to the existing graph constructor would be fine for setting the
default device, since there is no set_property in SYCL. But it would be useful for a user to change the default device, then maybe a set_default_device()/get_default_device() pair of functions would be also be worth adding, as well as overloading the current constructor with command_graph(device) too.

Implement querying the event returned by a graph submission for profiling information

Implement the info::event_profiling query from an event returned by a graph submission. This implementation should reflect the outcome of #52

Implement error checking for invalid buffer behaviour

Implement errors defined by buffer limitations spec section - https://github.com/reble/llvm/pull/73/files

it is an error to submit a command which has an accessor on a buffer which would cause a write-back to happen
it is an error to create a host accessor to a buffer which is used in commands which are currently being recorded to a command graph

Change should be accompanied by tests which verify the above behaviour.

queue is recording commands to the graph object.
if src or dest are not valid nodes assigned to the graph object.
if src and dest are the same node.
cycle discovered on finalization.

We should implement checks for these in the code, and test the exceptions are thrown.