OpenCL Cookbook: How to leverage multiple devices in OpenCL

So far, in the OpenCL Cookbook series, we’ve only looked at utilising a single device for computation. But what happens when you install more than one card in your host machine? How do you scale your computation across multiple GPUs? Will your code automatically scale to multiple devices or does it require you to consciously think about how to distribute the load of the computation across all available devices and change your code to apply that strategy? Here I look at answers to these questions.

Decide on how you want to use the host binding to support multiple devices

There are two ways in which a given host binding can support multiple devices.

  • A single context across all device and one command queue per device.
  • One context and command queue per device

Let’s look at these in more detail with skeletal implementations in C.

Creating a single context across all devices and one command queue per device

For this particular way of the binding supporting multiple devices we create only one context and share it across one command queue per device. So if we have say two devices we’ll have one context and two command queues each of which share that one context.

[c]
#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

cl_int err;

// get first platform
cl_platform_id platform;
err = clGetPlatformIDs(1, &platform, NULL);

// get device count
cl_uint deviceCount;
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);

// get all devices
cl_device_id* devices;
devices = new cl_device_id[deviceCount];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);

// create a single context for all devices
cl_context context = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);

// for each device create a separate queue
cl_command_queue* queues = new cl_command_queue[deviceCount];
for (int i = 0; i < deviceCount; i++) {
queues[i] = clCreateCommandQueue(context, devices[i], 0, &err);
}

/*
* Here you have one context across all devices and one command queue per device.
* You can choose to send your tasks to any of these queues depending on which
* device you want to execute the task on.
*/

// cleanup
for(int i = 0; i < deviceCount; i++) {
clReleaseDevice(devices[i]);
clReleaseCommandQueue(queues[i]);
}

clReleaseContext(context);

delete[] devices;
delete[] queues;

return 0;

}
[/c]

Creating one context and one command queue per device

Here I create one context and one command queue per device each of which have their own context rather than sharing one.

[c]
#include <iostream>
#include <CL/cl.hpp>
#include <CL/opencl.h>

int main () {

cl_int err;

// get first platform
cl_platform_id platform;
err = clGetPlatformIDs(1, &platform, NULL);

// get device count
cl_uint deviceCount;
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &deviceCount);

// get all devices
cl_device_id* devices;
devices = new cl_device_id[deviceCount];
err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, deviceCount, devices, NULL);

// for each device create a separate context AND queue
cl_context* contexts = new cl_context[deviceCount];
cl_command_queue* queues = new cl_command_queue[deviceCount];
for (int i = 0; i < deviceCount; i++) {
contexts[i] = clCreateContext(NULL, deviceCount, devices, NULL, NULL, &err);
queues[i] = clCreateCommandQueue(contexts[i], devices[i], 0, &err);
}

/*
* Here you have one context and one command queue per device.
* You can choose to send your tasks to any of these queues.
*/

// cleanup
for(int i = 0; i < deviceCount; i++) {
clReleaseDevice(devices[i]);
clReleaseContext(contexts[i]);
clReleaseCommandQueue(queues[i]);
}

delete[] devices;
delete[] contexts;
delete[] queues;

return 0;

}
[/c]

How do you scale your computation across multiple devices?

The process of utilising multiple devices for your computation is not done automatically by the binding when new devices are detected sadly. Nor is it possible for it do so. Doing this requires active thought from the host programmer. When using a single device you send all your kernel invocations to the command queue associated with that device. In order to use multiple devices you must have one command queue per device either sharing a context or each queue having its own context. Then you must decide how to distribute your kernel calls across all available queues. It may be as simple as a round robin strategy across all queues for all your computations or it may be more complex.

Bear in mind that if your computation entails reading back a result synchronously then a round robin strategy across queues won’t work. This is because each current call will block and complete prior to you sending to the next queue which will essentially make the process of distributing across queues serial. Obviously this defeats the whole purpose of having multiple devices operating in parallel. What you really need is one host thread per device each sending computations to its own command queue. That way each queue is receiving and processing computations in parallel with other queues. Then you effectively achieve true hardware parallelism.

Which of the two ways should you use?

It depends. I would try the single context option first as it’s likely to use less memory and be faster. If you encounter instability or problems I would switch to the multiple context method. That’s the general rule. There is, however, another reason you may opt for a multiple context method. If you are using multiple threads which all require access to a context it is preferable for each thread to have its own context as the opencl host binding is not guaranteed to be thread safe. If you try to access a single context across multiple threads you may get serious system crashes and reboots so always have thread confined opencl structures.

Using a single context across multiple host threads

You may want to use one thread per device to send tasks to the command queue associated with each device. In this case you will have multiple host threads. But here have to be careful. In my experience it has not been safe to use a single context across multiple host threads. The last time I tried this was in C# using the Cloo host binding. Using a single context across multiple host threads resulted in a Windows 7 blue screen, Windows dumping memory to a file and then rebooting after which Windows failed to come back up until physically rebooted once more from the machine. The solution is to use the multi context option outlined above. Have thread confined separation for opencl resources and you’ll be fine.

4 thoughts on “OpenCL Cookbook: How to leverage multiple devices in OpenCL

  1. Dhruba is focused on using multiple GPUs from one process here but there is a third option, which is to run multiple copies of your program and use one GPU each. This can be a good match with legacy single-threaded libraries and/or single-thread oriented grid compute frameworks (e.g. migrating a DataSynapse one-engine-per-CPU setup to one-engine-per-GPU).

    Note that with older motherboards it doesn’t matter what CPU your host-side threads run on, but the newer Sandy Bridge Xeons have a dedicated link from one of the CPU sockets to each PCIe slot (at least the good ones do : the EVGA SR-3 does not, but that’s one of the reasons why that motherboard is overpriced junk). This means that for optimum performance under the one-thread-per-GPU model, you should set core affinity on your host-side thread to a core on the socket that’s directly connected to the GPU, and allocate the OpenCL buffers yourself from thread-local memory (assuming you have NUMA memory model enabled in OS + BIOS). That said this will only make a difference if CPU/GPU and/or inter-CPU aggregate bandwith is a bottleneck in your app; most workloads do not max out the 100 gigabits+ of internal bandwidth you get on a modern workstation board.

  2. Thanks Michael. Very insightful account of the hardware side of things. Although I did cover core affinity and device selection indirectly in my independent posts on these subjects perhaps it would be helpful to the reader to cover them in another post in relation to utilising multiple gpus in legacy mode to illustrate the above point.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s