I am implementing the split operation as described here: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html. Currently I am try to implement the scatter step.
scatter simply performs an operation similar to OpenCL's shuffle operation (here: https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/shuffle.html) but on general work group (__local) arrays instead of vectors.
I tried to find the support for this operation in OpenCL docs but with no luck.
Is there native OpenCL support for this operation, and if not is there some paper or article where I can find an efficient algorithm to implement this operation?
A note that might be helpful, I am using AMD's implementation so if there is a solution provided by AMD specifically, it would be fine. However, a general solution is much better.
Here is what I trying to accomplish in code. This is an OpenCL function that will be later called by another kernel that implements radix sort:
void Split(__local uint local_keys[256], uint cur_bit, int local_id)
{
__local uint e[256];
__local uint f[256];
__local uint t[256];
__local uint d[256];
e[local_id] = !(local_keys[local_id] & (1 << cur_bit));
f[local_id] = work_group_scan_exclusive_add(e[local_id]);
work_group_barrier(CLK_LOCAL_MEM_FENCE);
uint total_falses = e[255] + f[255];
t[local_id] = local_id - f[local_id] + total_falses;
d[local_id] = e[local_id] ? f[local_id] : t[local_id];
work_group_barrier(CLK_LOCAL_MEM_FENCE);
// TODO scatter .... (Please see last step of the attached image)
.......
}
