-
Notifications
You must be signed in to change notification settings - Fork 125
Description
My application
I am working on a 3D noise algorithm I'm trying to rewrite to enable parallel execution on the GPU using TornadoVM.
a short explanation for the algorithm: each x, y, z point is fed into an initial noise function along with an RNG, and that function outputs a double. That might happen multiple times for the same x,y,z (and the RNG will roll causing different results)
Then all the noise results from the initial functions are combined and composed using other functions.
for example, a shortened function to calculate a final value:
add(sqrt(add(mul(noise(x,y,z, rng), 2), noise(x,y,z))), 5)
The actual function compositions used in algorithm are way longer than this, and there are multiple compositions which are used for different reasons.
My implementation was to create Tasks for each of the functions used, and compose them to a taskgraph.
For example, a noise task which outputs a DoubleArray corresponding to a given 3D volume and RNG.
And add, mul, etc. tasks which take a DoubleArray's and operate on them.
Then I programmatically build a taskGraph that correspond to the same composition made by the original algorithm
and execute it when needed.
Note that data transfers with the CPU happen only at the start (x,y,z,RNG) and at the end (final DoubleArray)
The issue
From my understanding, TornadoVM creates a GPU Kernel for each of the Tasks I have defined, and manages the execution of each kernels from the CPU.
That means that each little add, mul of arrays on the GPU requires loading a new kernel and executing it.
And I suspect this slows down my implementation.
My suggestion
I suppose most backends support compiling one kernel for multiple functions which pass arrays from one to the other, (though I'm only familiar with opencl and PTX)
I think there are two options to enable this:
- Group tasks which are dependent only on the previous task and don't need new external data according to the Dependency Graph, and compile each group to a single kernel.
- Allow the user to annotate tasks to be grouped with other tasks for single kernel compilation.
Alternatives I have considered
I could write some code generator which would take the function compositions and generate many single tasks which contain the entire composition.
but that would generate lots of repeating code, and I think that defeats the purpose of the TaskGraph.
When I first found TornadoVM I was very happy with the idea of being able to dynamically build TaskGraphs
Now I feel like it's not built to its full potential, because in the end each task is compiled and executed by itself, and the execution of each little task is managed by the CPU.
I know there might be other reasons for this implementation, but that is my opinion, and I would be happy to hear others.
Addition context
I can't change the algorithm in ways which will affect the result i.e. using another noise.
The results have to be consistent with the CPU.
I have made a working implementation, but it's slower than the CPU right now.
Also, There are other factors, unrelated to this issue, which probably slow my implementation down.