CSC400: Parallel Processing on a GPU/Project Progress

From CSclasswiki
Jump to: navigation, search

Back to the main page.

September

This month was spent mostly with familiarizing myself with the GDT and background information about CUDA. There were some delays in getting a computer set up, as a new power supply had to be ordered for the GPU, and the GPU itself was also too large for the computer that I was originally going to use.

Week 1: 9.4-9.10

I had my first meeting with Nick before classes start to discuss the basic ideas behind this project and what kind of work the project would entail.

Week 2: 9.11-9.17

I registered for Special Studies CSC 400 course, and began reading the GDT paper. It took several reads to get a moderate understanding of the concepts. I also spent time exploring NVIDIA's resources for CUDA developers online.

Week 3: 9.18-9.24

Nick was gone for a conference, so I continued to wrap my head around the GDT. No computers were yet available for me to use.

Week 4: 9.25-10.1

Still waiting on a power supply for the GPU, we had no computer to use. We also went over the GDT algorithm together with a small example, to see how it works step by step. The concepts involved with the GDT were much clearer to me after this, though still not 100% there. The algorithm made perfect sense.

October

Now with a working computer and the GPU installed, I could begin working on setting up CUDA with Matlab and Visual C++.

Week 5: 10.2-10.8

Finally a computer! Unfortunately, the one Nick intended for me to use was too small for the GPU to fit. So, as he'd gotten a new computer, he gave me his old computer, with versions of CUDA, Matlab, and VS already installed. I set up the computer and my account on it.

Week 6: 10.9-10.15

I began the process of installing all necessary programs and setting up MATLAB so that a CUDA program could be compiled properly. I used this webpage as a guideline for setting up the system before compiling. Encountered identical errors in compiling for two different .cu files:

 >> system(sprintf('nvcc -I"%s/extern/include" --cuda "square_me.cu" --output-file "square_me.cpp"',matlabroot));
 square_me.cu 
 tmpxft_0000102c_00000000-3_square_me.cudafe1.gpu 
 tmpxft_0000102c_00000000-8_square_me.cudafe2.gpu 
 square_me.cu 
 tmpxft_0000102c_00000000-3_square_me.cudafe1.cpp 
 >> mex -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\x64" -lcufft -lcudart square_me.cpp
 square_me.cpp 
 c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\math_functions.h(3200) : 
   error C2169: 'llabs' : intrinsic function, cannot be defined 

 C:\PROGRA~1\MATLAB\R2011A\BIN\MEX.PL: Error: Compile of 'square_me.cpp' failed. 

 ??? Error using ==> mex at 208
 Unable to complete successfully.

 >> system(sprintf('nvcc -I"%s/extern/include" --cuda "Szeta.cu" --output-file "Szeta.cpp"',matlabroot));
 Szeta.cu 
 tmpxft_000012d0_00000000-3_Szeta.cudafe1.gpu 
 tmpxft_000012d0_00000000-8_Szeta.cudafe2.gpu 
 Szeta.cu 
 tmpxft_000012d0_00000000-3_Szeta.cudafe1.cpp 
 >> mex -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\x64" -lcufft -lcudart Szeta.cpp
 Szeta.cpp 
 c:\program files\nvidia gpu computing toolkit\cuda\v4.0\include\math_functions.h(3200) : 
    error C2169: 'llabs' : intrinsic function, cannot be defined 

 C:\PROGRA~1\MATLAB\R2011A\BIN\MEX.PL: Error: Compile of 'Szeta.cpp' failed. 

 ??? Error using ==> mex at 208
 Unable to complete successfully.

The error was suspected to be an issue between CUDA and MATLAB, and possibly as a result of differing versions. I checked compatibility between versions that we were using to find that there were possible inconsistencies.

Background information: the system(sprintf('nvcc ...')); step compiles the .cu file into a .cpp file. The mex command compiles the .cpp file into a .mex64 file that can run on the GPU with CUDA. I found this document which explains what nvcc does and how it works.

Week 7: 10.16-10.22

I encountered a different type of error when trying to compile this time, even though I haven't changed anything:

 >> system(sprintf('nvcc -I"%s/extern/include" --cuda "square_me.cu" --output-file "square_me.cpp"',matlabroot));
 square_me.cu 
 tmpxft_000010e4_00000000-3_square_me.cudafe1.gpu 
 tmpxft_000010e4_00000000-8_square_me.cudafe2.gpu 
 square_me.cu 
 tmpxft_000010e4_00000000-3_square_me.cudafe1.cpp 
 >> mex -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\x64" -lcufft -lcudart square_me.cpp
  Creating library C:\Users\JSADLE~1.ITS\AppData\Local\Temp\mex_Xf9VWE\templib.x and object C:\Users
    \JSADLE~1.ITS\AppData\Local\Temp\mex_Xf9VWE\templib.exp 
 square_me.obj : error LNK2001: unresolved external symbol _fltused 
 square_me.obj : error LNK2019: unresolved external symbol pow referenced in function exp10 
 square_me.obj : error LNK2019: unresolved external symbol sin referenced in function sincos 
 square_me.obj : error LNK2019: unresolved external symbol cos referenced in function sincos 
 square_me.obj : error LNK2019: unresolved external symbol ceil referenced in function trunc 
 square_me.obj : error LNK2019: unresolved external symbol floor referenced in function trunc 
 square_me.obj : error LNK2019: unresolved external symbol ldexp referenced in function scalbn 
 square_me.obj : error LNK2019: unresolved external symbol log referenced in function log2 
 square_me.obj : error LNK2019: unresolved external symbol exp referenced in function expm1 
 square_me.obj : error LNK2019: unresolved external symbol sqrt referenced in function acosh 
 square_me.obj : error LNK2019: unresolved external symbol hypot referenced in function hypotf 
 square_me.obj : error LNK2019: unresolved external symbol atexit referenced in function "void 
   __cdecl __sti____cudaRegisterAll_44_tmpxft_000010e4_00000000_6_square_me_cpp1_ii_e53e6457
  (void)" (?__sti____cudaRegisterAll_44_tmpxft_000010e4_00000000_6_square_me_cpp1_ii_e53e6457
   @@YAXXZ) 
 LINK : error LNK2001: unresolved external symbol _DllMainCRTStartup 
 square_me.mexw64 : fatal error LNK1120: 13 unresolved externals 

 C:\PROGRA~1\MATLAB\R2011A\BIN\MEX.PL: Error: Link of 'square_me.mexw64' failed. 

 ??? Error using ==> mex at 208
 Unable to complete successfully.

I found we were running two versions of nearly every program involved in compiling. The computer had CUDA toolkits v3.2 and v4.0, Visual C++ 2008 [v9.0] and 2010 [v10.0], and Microsoft SDK 7.0 and 7.1. We decided before trying to debug the compiling issues, it would be best to simplify to one version of each, just to eliminate the variable of conflicting versions in the debugging process. All of the most recent versions are compatible with the version of MATLAB we're using, 2011a (see this table). I uninstalled all of the old versions of these programs on the computer, and made sure the installations of the newer versions were up to date and complete.

I also spent a significant amount of time updating this wiki page for the first time. All previous entries are from notes in my notebook and from memory, so from now on weekly updates should be more detailed/lengthy as they're made in real time and also as there should be a lot more happening in the coming weeks.

Week 8: 10.23-10.29

This has been a strenuous week for me in my other classes so unfortunately I've had to put off much time that I've wanted to work on this. Last place I left it, I was finalizing installations of everything we need to compile a .cu file to a .mex64 file in MATLAB. In reading up on Windows SDK, the .NET framework, and Visual C++, I found this and thought that should be noted. I made sure to get the proper service pack with no bugs for Visual C++ here.

Altogether, here is what we're running: MATLAB 2011a using Visual C++ 2010 Express compiler with the Microsoft SDK 7.1 and .NET Framework 4.0 SP1 Compiler Update. MATLAB uses NVCC (NVidia) and Visual C++ to compile a given CUDA program myprog.cu into a C++ file, at which point an application of mex will compile myprog.cpp into myprog.mex64, which we can run on the GPU.

With the current arrangement, I am using nvcc with the call system(sprintf('nvcc -I"%s/extern/include" --cuda "myprog.cu" --output-file "myprog.cpp"',matlabroot)); and I use mex with the command: mex -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v4.0\lib\x64" -lcufft -lcudart myprog.cpp in MATLAB.

After going through all of the procedures again to set up nvcc, etc for no bugs, I think I may have gone back 2 steps. When I try to run nvcc, I get nvcc fatal  : Cannot find compiler 'cl.exe' in PATH . I've made sure that the directory containing cl.exe is in the path variable ('C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin\amd64'), but still no change.

I read several forums for possible solutions including here and here. Some of the cases seemed to be resolved by actions I'd already performed to no avail. I will have to spend more time on this. I'm beginning to be frustrated by these errors.

After meeting with Nick, I realized my understanding of the Path variable was sorely lacking. I learned about the role it plays in Unix and Windows, as well as the Path variable within MATLAB itself for MATLAB programs. Though I was running nvcc in MATLAB, I was calling a system function, so I needed to be adding the proper directories to the system Path.

Once I got that, I moved onto a new error, where I realized there was an insufficient compiler installation, that is, we did not actually have a 64-bit C++ compiler. Eventually I found (with Nick's direction) an option to install the 64-bit version of the Visual C++ compiler, by changing features available in SDK 7.1. The installation failed on the first two attempts: first due to a security error with a folder that disappeared after restarting the computer, second due to a nondescript "Fatal error during installation." I'm learning quite a bit about every single facet of all of these programs/compilers/etc.

After a successful installation of the C++ compiler via SDK 7.1, we're still missing the ever-important vcvars64.bat file necessary for successful 64-bit compiling. This thread on the Microsoft forums was an interesting read to try to identify why this might be the case.

SO THE COMPILER WORKS NOW. What did we do to make it work? Nick renamed vcvars32.bat to vcvars64.bat. That was all it took. I am amazed that the solution to the problem I have been toiling over for an hour was as simple as renaming a file. WOW. So now that we're compiling, we can FINALLY move on! This was a big step in the project... and it was as simple as renaming a file.

November

Week 9: 10.30-11.5

We worked on developing a plan of action for the remainder of the semester, since we faced great delays getting the compiler up and running. We decided that in the following week, I will (try to) get a parallel integral image coded up in CUDA. The integral image calculation is relevant to the generalized distance transform calculation, and so using this as a warm-up to CUDA would be valuable. We hope to have the integral image coded by the end week 10, a simple distance transform implementation by the end of week 11, and then perhaps a "block" distance transform implementation by the end of the first week of December, though this is very hopeful thinking.

Week 10: 11.6-11.12

We developed a CUDA program together to compute the integral image of a matrix, e.g.

 >> a 
 a =
    0     1     1     1     1     1
    2     2     2     2     2     2
    3     4     5     6     7     8
    0     0     0     1     0     1
 >> a1 = integral_image(a)
 a1 =
    0     1     2     3     4     5
    2     5     8    11    14    17
    5    12    20    29    39    50
    5    12    20    30    40    52

in a parallelization that employs 2 kernels: one that computes the cumulative sum of the columns into an intermediate matrix and one that computes the cumulative sum of the rows of that intermediate matrix. This sets up a great framework for the parallelization of the GDT. I will start on that immediately.

Also, the process of developing the program by myself and then along with Nick gave me a much stronger conceptual understanding of the ideas behind how a parallelized GDT will work. Not only do I have a now strong programming framework as I move forward, but also a much stronger mental framework regarding the problem!

Week 11: 11.13-11.19

I successfully implemented a row-wise kernel for the GDT algorithm. Next week to write the column-wise kernel using the row-wise kernel as a model, and then the basic parallelized GDT will be completed!

Some thoughts Nick and I had on the future of how we could wrap up the semester:

  • testing efficiency with tic and toc on large matrices to see the effects of
    • memory orientation/allocation methods for certain arrays (e.g. v, z matrices in the algorithm)
    • precomputing values (e.g. using a=i*n, and v(a+k) everywhere instead of v(i*n+k) everywhere)

Week 12: 11.20-11.26

I think I did it! I now have successfully implemented the GDT algorithm as a simply parallelized CUDA program!!!

Here's a sample of a test-run below, with the example matrix taken from the work of students last year:

 >> a = 10 0 10 10; 10 10 10 10; 0 10 1 10
 a =
   10     0    10    10
   10    10    10    10
    0    10     1    10
 >> a1 = gdt(a)
 a1 =
    1     0     1     4
    1     1     2     3
    0     1     1     2

where a1 is the square of the distance transform of a.

I still have to go through the program thoroughly and make sure it really does work.

Week 13: 11.27-12.3

Made sure the program does indeed work!

- Thanksgiving break -

December

Week 14: 12.4-12.10

We both analyzed and discussed potential issues and roadblocks that could be encountered with an implementation of a blocked parallelization of the GDT.

Began timing tests of parallel GDT on the GPU v. normal GDT on CPU.

Week 15: 12.11-12.17

Continued timing tests. One ran for over 36 hours!!

Week 16: 12.18-12.24

Final work, constructing a poster for Collaborations, self-assessment, and filling out the wiki page.