CSC400 Page (2010)
General Programming on Graphics Processing Units
- 1 Introduction
- 2 Graphics Processing Units
- 3 Distance Transform Algorithm
- 4 Coding
The general goal of this special studies is to have a greater understanding of the architecture and programming practices of the GPU.
Our specific goal this semester was to create an implementation of the Distance-Transform algorithm that would be run on the GPU. Given a 2D array as input, the GPU would return a 2D array with the adjusted values.
- Janet Ghuo
- Millie Walsh
- Julia Burch
February 4th -- project presentation
Graphics Processing Units
A Graphics Processing Unit (GPU for short) is a many-core processor on a graphics card that is in charge of performing all the complex mathematics necessary to render an image to a computer screen -- essentially it is a highly specialized computational cruncher. Because of its parallelization capabilities, it often performs these calculations faster than the CPU.
Although they were specifically designed for graphics rendering, recently they have come to be more general purpose parallel processors that can significantly speed up computation-heavy programs.
In the past, the desire to increase process speed drove the developments and improvement in the design of the central processing unit (CPU). Increasing clock frequency was the primary way that developers were able to improve performance, but this increase came with problems: heat and energy consumption. Since then these limitations have slowed the improvements in speed and thus causing a shift toward multicore. Multicore refers to multiple processing cores and is a way to divide work between processors. This trend places new emphasis on the software developers to write application that utilize the trend in architecture. While these developments were occurring for the CPU, others were developing solutions that would later significantly impact the world of scientific computing.
There is high demand for technical developments in certain sectors of the computer science industry. The video game industry lead rapid developments of a new processor called the Graphics Processing Unit or GPU. The GPU was designed to calculate in parallel many floating-point operations that are required to create 2D and 3D graphics. These calculations are largely done with matrix multiplication in which many calculations, that compose the result, can be done independently of each other and thus parallel computation will not affect the result and rapidly increase the processing time.
Although the earlier vendors saw the value of parallel processing in scientific applications, they remained separate from the graphics cards and tried to develop a chip that would meet scientific applications needs. Companies like Cray and Silicon Graphic were the leaders in this movement because of their interest in the high performance computing industry.
Researchers who needed the parallel processing power, but at a more affordable price, went to the GPU to run their scientific computations using the openGL API. These demands became apparent to NVIDIA, a prominent graphics card vendor, and developments were made to create their own parallel computing architecture: Compute Unified Device Architecture or CUDA. Other competitors joined the market including AMD and Miscrosoft DirectCompute. Most recently the developments have been in creating a open source language called openCL.
And to think I was going to talk to somenoe in person about this.
The kernel is the portion of the program that contains the data-parallel functions that will be executed on the GPU. A CUDA or openCL program consists of nondata-parallel portions that are executed on the host (CPU) and data-parallel portions that are written in C CUDA or openCL language and are executed on the device (GPU).
In scientific computing today large amounts of data is processed. In the processing of this data we look to the looping structures to find instances of data parallelism. If the operations performed inside the looping structures are independent of the previous iteration of the loop than these operations can be performed independently or in parallel.
The Wikipedia page for parallel computing provides a good example of a loop that cannot be made parallel because of “loop-carried dependency”. http://en.wikipedia.org/wiki/Parallel_computing#Data_parallelism
As mention above, a thread is a piece of code that is mapped over the hardware. An iteration of a data independent loop would be a single thread when computed on a GPU.
Many-core v. Multi-core
GPUs are many core devices because they are composed of more than 128 processors or cores. For example the latest NVIDIA GPU (Fermi) is composed of 448 cores. This term is not to be confused with the more commonly heard term: multi-core. Multi-core refers to CPUs with several processors on one chip, often dual-core or quad-core.
Distance Transform Algorithm
These papers provide some background on the target algorithm.
From Pedro Felzenszwalb:
currPara = 0 paraLocations = 0 paraBoundaries = negative infinity paraBoundaries = infinity
While CUDA is a framework specific to NVidia cards, OpenCL is a more general framework that works with any graphics card. There are several tradeoffs to using OpenCL over CUDA:
- The CUDA API makes it extremely easy for a user to figure out how to program for the GPU. Because OpenCL works with all graphics cards, communcation between the CPU and the GPU is largely left to the user to handle.
- In order to maximize performance with OpenCL, you really do have to program for whichever specific GPU you are trying to use. Because CUDA works with only one, it is already set up and configured to get the best performance from the GPU.
- Using CUDA
- I was able to install CUDA on a machine in the Seelye basement lab (SB8-MAC03). CUDA should remain on the computer as long as nobody needs to re-image the Mac-side of the computer (re-imaging just the Windows-side should have no effect).--Jguo 17:20, 12 October 2010 (UTC)
- Matlab and CUDA
Currently OpenCL can be compiled and run on Athena in the research lab on the 3rd floor of Ford. Theoretically our work should be able to be compiled and run on any computer with an appropriate graphics card, the ATI SDK, and Visual Studio 2010 Express C++ Edition. --Jburch 19:58, 14 October 2010 (UTC)