OpenCL Cookbook: Parallelise your host loops using OpenCL

Continuing on in our series – this time we look at possibly the most important topic of all in OpenCL. It is the reason why we use OpenCL and it is also the most compelling benefit that OpenCL offers. It is, of course, parallelism. But how do we exploit the vast amount of parallelism that GPUs offer? At the simplest level we can do so by exploiting latent areas of parallelism in our host code the simplest of which are loops. In other words – if we can port loops in our host code to the GPU they become parallel and get faster by a factor of the total number of iterations. I demonstrate using a small example.

Host loop

[cpp]
void cpu_3d_loop (int x, int y, int z) {

for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
printf("CPU %d,%d,%dn", i, j, k);
}
}
}

}
[/cpp]

Imagine the loop above in our C++ host code. This is not one loop but in fact three. In other words it has three dimensions. The total number of iterations in this combined loop is x*y*z. If x=4, y=3 and z=2 the total number of iterations would be 4x3x2=24. On the CPU these loops execute serially which is fine for a small number of iterations but for large numbers it becomes a fundamental bottleneck. If this set of loops was ported to the GPU each iteration would run in parallel and the total number of threads in use would be 24 for the previous example.

A small scale example may not seem impressive at first. You could argue that you could just as well run 24 threads on the CPU. But consider this: what happens when you have the above set of loops in your host code performing thousands or even millions of iterations? How are you going to achieve hardware parallelism in this case on the CPU? The answer is you can’t. GPUs each have hundreds of cores and offer a far greater degree of parallelism so loops with a large number of iterations becomes easy work for the GPU which can run thousands or even millions of threads effectively. Below I demonstrate how to port such a loop to OpenCL.

Host binding code

[cpp]
#define __NO_STD_VECTOR
#define __CL_ENABLE_EXCEPTIONS

#include <fstream>
#include <iostream>
#include <iterator>
#include <CL/cl.hpp>
#include <CL/opencl.h>

using namespace cl;

void cpu_3d_loop (int x, int y, int z) {

for (int i = 0; i < x; i++) {
for (int j = 0; j < y; j++) {
for (int k = 0; k < z; k++) {
printf("CPU %d,%d,%dn", i, j, k);
}
}
}

}

int main () {

// CPU 3d loop

int x = 4;
int y = 3;
int z = 2;
cpu_3d_loop(x, y, z);
std::cout << std::endl;

// GPU 3d loop

vector<Platform> platforms;
vector<Device> devices;
vector<Kernel> kernels;

try {

// create platform, context and command queue
Platform::get(&platforms);
platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
Context context(devices);
CommandQueue queue(context, devices[0]);

// load opencl source
std::ifstream cl_file("kernels.cl");
std::string cl_string(std::istreambuf_iterator<char>(cl_file),
(std::istreambuf_iterator<char>()));
Program::Sources source(1, std::make_pair(cl_string.c_str(),
cl_string.length() + 1));

// create program and kernel and set kernel arguments
Program program(context, source);
program.build(devices);
Kernel kernel(program, "ndrange_parallelism");

// execute kernel and wait for completion
NDRange global_work_size(x, y, z);
queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);
queue.finish();

} catch (Error e) {
std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
}

return 0;

}
[/cpp]

The above program runs the cpu loop and then runs the equivalent logic on the gpu. Both cpu and gpu runs produce output to show which iteration they are processing. The key lines of code that demonstrate how to port the loop are below.

[cpp]
NDRange global_work_size(x, y, z);
queue.enqueueNDRangeKernel(kernel, NullRange, global_work_size, NullRange);
[/cpp]

Here we set three upper bounds – one for each loop – this is known as the global work size. The kernel can then retrieve values for the currently executing iteration within the kernel itself as shown below. It can then use these indices to do whatever work is inside the loop. In this case we just print the indices for illustration.

Kernel code

The kernel you see below is executed x*y*z times with different values for i, j and k. See? No loops! 🙂

__kernel void ndrange_parallelism () {

	int i = get_global_id(0);
	int j = get_global_id(1);
	int k = get_global_id(2);

	printf("GPU %d,%d,%dn", i, j, k);

}

The output of running the above host code is as follows.

CPU 0,0,0
CPU 0,0,1
CPU 0,1,0
CPU 0,1,1
CPU 0,2,0
CPU 0,2,1
CPU 1,0,0
CPU 1,0,1
CPU 1,1,0
CPU 1,1,1
CPU 1,2,0
CPU 1,2,1
CPU 2,0,0
CPU 2,0,1
CPU 2,1,0
CPU 2,1,1
CPU 2,2,0
CPU 2,2,1
CPU 3,0,0
CPU 3,0,1
CPU 3,1,0
CPU 3,1,1
CPU 3,2,0
CPU 3,2,1

GPU 0,0,0
GPU 1,0,0
GPU 2,0,0
GPU 3,0,0
GPU 0,1,0
GPU 1,1,0
GPU 2,1,0
GPU 3,1,0
GPU 0,2,0
GPU 1,2,0
GPU 2,2,0
GPU 3,2,0
GPU 0,0,1
GPU 1,0,1
GPU 2,0,1
GPU 3,0,1
GPU 0,1,1
GPU 1,1,1
GPU 2,1,1
GPU 3,1,1
GPU 0,2,1
GPU 1,2,1
GPU 2,2,1
GPU 3,2,1

NOTE: Although there may appear to be a sequence in the order in which the GPU processes the iterations this is only due to the use of printf(). In reality when not using printf() the order of iterations is completely arbitrary and random. Therefore one must not rely on the order of iterations when porting loops to the GPU. If you need loops to be in a certain order then you can either keep your loops on the host or port only those parts of the loop that do not need to be sequential.

Why use GPU computing?

Although this example is fairly simple it does illustrate the most important value add of GPU computing and OpenCL. Hardware parallelism is the essence of what GPU computing offers and it is the most compelling reason to use it. If you imagine a legacy codebase and all the latent areas of parallelism that are currently running sequentially you can imagine the vast untapped power of GPGPU. Later on in the series we will look at techniques to port existing host code to the GPU. That process can be very difficult but can provide dramatic gains in performance far beyond the limits of CPU computing. Till next time.

3 thoughts on “OpenCL Cookbook: Parallelise your host loops using OpenCL

  1. Hi

    Sorry man, I’m a beginner using opencl, and I wanna know what is the meaning of this output:

    CPU 0,0,0
    CPU 0,0,1
    CPU 0,1,0
    CPU 0,1,1
    CPU 0,2,0
    CPU 0,2,1
    CPU 1,0,0
    CPU 1,0,1
    CPU 1,1,0
    CPU 1,1,1
    CPU 1,2,0
    CPU 1,2,1
    CPU 2,0,0
    CPU 2,0,1
    CPU 2,1,0
    CPU 2,1,1
    CPU 2,2,0
    CPU 2,2,1
    CPU 3,0,0
    CPU 3,0,1
    CPU 3,1,0
    CPU 3,1,1
    CPU 3,2,0
    CPU 3,2,1

    clGetDeviceIDs : -1

    Thanks 🙂

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s