CSC400: Parallel Processing on a GPU

From CSclasswiki
Jump to: navigation, search

Jennifer Sadler, Advising professor: Nick Howe

See Project Progress for a detailed journal of the research process.

Introduction

Rand-a-4.png Rand-gdta-4.png
The scaled image representations in MATLAB of a randomly generated 500x500 matrix and its GDT computed in 3ms.

Goals

There are several components to this project, which has the ultimate goal of producing a parallelized implementation of a Generalized Distance Transform (GDT) algorithm in CUDA. This is in some ways a continuation of the work of Janet Guo, Millie Walsh, and Julia Burch with Nick in the Fall 2010 semester, and their work has proven itself to be a valuable resource.

Brief Timeline

There are two main components this project, conceptual understanding (GDT and its application) and programming (implementation in CUDA).

  • Conceptual Understanding
    • At the start of the semester, I spent most of my time familiarizing myself with the GDT and the algorithm for it, as outlined in this paper.
    • In October, I noticed a connection between Voronoi Diagrams and a possible application of the GDT. Already, I do know that distance transforms can be used to calculate the medial axis of polygons, and there is certainly a connection there to Voronoi diagrams. This piqued my interest and I was interested to see what relevant applications may arise.
    • In implementing the integral image in CUDA, I gained a much more solid understanding of how the parallelized GDT works.
    • With the parallelization implemented, it was easy to see how the transform changes images.
    • Near the end of the semester, attempts at designing a reconciliation method for a block-wise parallelization posed difficult questions and proved to be a much more difficult problem than expected.
  • Programming
    • In the first stage of this project, I had to get CUDA, MATLAB, and Visual C++ all working together, so that we could properly compile .cu (CUDA) files for our use. There were many delays in this process due to conflicting versions of software and a mysterious bug that was resolved by simply renaming a batch file. Getting the compiler running smoothly likely took up most of the time in the project.
    • Once the compiler was up and running, we moved on to writing CUDA programs.
    • Before diving into the GDT we first developed a parallelization of the integral image on a matrix, that served as a good model for the parallelization of the GDT.
    • I completed a parallelization of the GDT with the simple approach (i.e. all columns at once, and then all rows at once) early December.
    • There were conceptual roadblocks in developing the more advanced "block" parallelization that involved dividing matrices into sub-matrices to which the GDT would be applied, followed by a reconciliation to produce a final result.
    • The last weeks of the semester were spent performing timing tests with the tic-toc MATLAB function comparing the run time of the parallelization running on the GPU to a non-parallel GDT running on the CPU and also observing how matrix dimensions affect the efficiency of the parallelized GDT.
    • In the very last hours of the semester, I discovered a serious bug likely rooted in memory allocation problems.

Background Information

Generalized Distance Transform

Distance transforms are operations performed on binary images that can produce a wide variety of results depending on the measure used. Having many uses, distance transforms improve image processing, image comparison, and pattern detection techniques. They are a powerful tool in computer vision, but can only be so useful with the monochromatic limitation.

The Generalized Distance Transform (GDT) is an expansion of the regular distance transform, applicable to multi-value images, and it has just as many valuable uses in mathematics and computer science as the regular distance transform, if not more, including more advanced image processing, energy minimization, and object recognition. However, the GDT has untapped potential, in that it has yet to be implemented with parallel processing (to our knowledge), and the algorithm for it developed by Felzenszwalb and Huttenlocher easily lends itself to a parallelization. Like the regular distance transform, the GDT can be implemented using a number of different measures which yield different results, and in this implementation we used the squared Euclidean distance as our measure for a number of reasons.

In order to understand the process of computing the GDT of an image, it is easiest to think of the image as a two-dimensional array of values. The GDT algorithm I implemented analyzes each row in the array and computes the lower envelope of parabolas rooted at the values in the row, after which it uses the result and the locations of the roots of the minimal parabolas to compute the squared Euclidean distance transform on each row. From the array produced by the row-wise GDT, we apply the same computation on each column, and then we have our end result. Here are three matrices, represented as images in MATLAB, and their corresponding GDT with squared Euclidean distance:

Rand-a-1.pngRand-gdta-1.png
Rand-a-2.pngRand-gdta-2.png
Rand-a-3.pngRand-gdta-3.png

CUDA and the GPU

Compute Unified Device Architecture (CUDA) is the parallel computing architecture created by NVidia for use with their graphics processing units (GPUs). It allows developers to unlock the parallel processing capabilities in their GPU, to obtain greater computational speed and efficiency than the CPU alone may provide.

The language used with CUDA is a variation on C, and to compile we used MATLAB 2011a with Visual C++ 2010 (v10.0), Microsoft SDK 7.1, and the CUDA Toolkit 4.0. Compiling a CUDA program for our use was a 2-step process, first the .cu CUDA file to a .cpp C++ file, then the .cpp C++ file to a .mexw64 MEX file that could run in MATLAB.

Results

The parallelized GDT running on the GPU is actually slower than the non-parallel version on the CPU for small matrices. We imagine that this is for memory allocation reasons. For larger matrices, the implementation seemed to be faster. However, at the very end of the semester I found a bug, likely some kind of memory-related problem, such that the parallelized GDT stopped working for large matrices, and then stopped working altogether if used again. For square matrices, the tipping point was around 500x500, and the memory error seemed to pose issues in the timing tests for much smaller square matrices. Repeated calls of the GDT function lead to errors much more quickly, regardless of the size of the matrix, and once the program would stop working for one matrix, it no longer produced accurate transforms. This is certainly a problem to investigate looking forward.

The memory bug is illustrated below in the graphs of two timing tests run consecutively:

Timetestrun1.pngTimetestrun2.png

At around 175x175 in the first timing test, we see a significant drop. This is where the program faults, and every successive call to the GPU's GDT yields false results.

This late-in-the-game discovery falsified all previous timing tests. Dang!