Wednesday, February 2, 2011

Exploring CUDA via the Euclidean Distance

So this post is really inspired after reading Chapter 1 of Toby Segaran's book "Programming Collective Intelligence". In that chapter, Toby talked about making recommendations based on the ideas behind the Euclidean Distance and Pearson Coefficient.

I thought of doing a simple exercise and see stuff works out in CUDA. Here's a sample implementation of the Euclidean Distance and conducted benchmarking both on a Nvidia GTX 480 and GT 330M.

The implementation is pretty straight forward since the block that is suitable for massively parallelism is easily identifiable and here's how. Using the wikipedia's article on Euclidean distance, we focus the attention on this equation:
Reading the equation from left-to-right or from right-to-left, you can tell that its basically performing computations on two 1-dimensional arrays, namely P and Q aka Euclidean 2-space. From there, you can easily tell that CUDA's technology is made for these forms of computations and best of all - you'll probably never encounter the problem of divergent computation and you can tap into the raw computing power of your GPU since the GPU's constantly busy executing as occupancy's close to 1! that's pretty cool in my dictionary :)

One implementation that's pretty straightforward to implement is to model these 2 spaces into 2 different input arrays i've named P and Q, coincidentally. If you have read the documentation or books in CUDA, you would know the next few steps is to allocate memory on the device and design your GPU kernel to run these computations. The pseudo algorithm for the GPU kernel is something like this:
1. Load input arrays, 'P' & 'Q', into shared memory
2. For each input element in 'P' & 'Q' in the shared memory, load them and  
   perform the computation of raising the result (i.e. taking the difference
   of 'P' & 'Q') to the power of 2
3. Store the previous result to the device memory of the GPU
. You'll notice that this GPU kernel's not taking the square-root of the result in this kernel - it'll be done in the typical C setting i.e. computing in the host and not on the GPU and the reason is due to the mathematical nature of performing square-roots since sqrt( a + b ) != sqrt(a) + sqrt(b)

One gotcha you need to be aware of is that on devices such as those with compute capability 1.2 and lower (e.g. yours truly's mac book pro running GT 330M) the double-precision FP computation is downgraded to single-precision FP. You'll notice a warning message like "warning : Double is not supported. Demoting to float"

Next question: How do i/you do all this with Thrust?

Thrust is the library you should probably use when it comes to developing stuff with CUDA. It just makes things easier. Here's one implementation with the CUDA Thrust that i've made - one thing you need to take note is that if you conduct this test on a machine that has compute capability of 1.x then you'll get the warning message i've highlighted above; otherwise it should work well on device's with capability of 2.x
1 #include <thrust/host_vector.h>
  2 #if __CUDA_ARCH__ == 200 
  3  #include <thrust/device_vector.h>
  4 #elif __CUDA_ARCH__ == 100
  5  #include <thrust/device_ptr.h>
  6 #endif
  7 #include <thrust/generate.h>
  8 #include <thrust/reduce.h>
  9 #include <thrust/transform.h>
 10 #include <thrust/functional.h>
 11 #include <cstdlib>
 12 #include <iostream>
 13 
 14 struct powfunctor
 15 {
 16     __host__ __device__
 17     float operator()(const float& p, const float& q) const {
 18         return pow( p - q, 2);
 19     }
 20 };
 21 
 22 template <typename inputiterator1,typename inputiterator2>
 23 double computeGold(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2)
 24 {
 25     float sum = 0.0;
 26     for(; (first1 != last1) && (first2 != last2); ++first1, ++first2)
 27         sum += pow(*first1 - *first2, 2);
 28 
 29     float s1 = sqrt(sum);
 30     std::cout << "Gold=" << s1 << std::endl;
 31 
 32     return s1;
 33 }
 34 
 35 int main(void)
 36 {
 37 #if __CUDA_ARCH__ == 200
 38     thrust::device_vector<float> p_vec(1 << 20);
 39     thrust::device_vector<float> q_vec(1 << 20);
 40     thrust::device_vector<float> r_vec(1 << 20);
 41     thrust::generate(p_vec.begin(), p_vec.end(), rand);
 42     thrust::generate(q_vec.begin(), q_vec.end(), rand);
 43     // Current Thrust's transformations supports 2 input vectors, so we use it
 44     thrust::transform(p_vec.begin(), p_vec.end(), q_vec.begin(), r_vec.begin(), powfunctor());
 45 
 46     int sum = thrust::reduce(r_vec.begin(), r_vec.end(), (int)0, thrust::plus<float>());
 47     std::cout << "sqrt(" << sum  << ")=" << sqrt(sum) << std::endl;
 48 #else
 49
50 
 51     unsigned int  N = 1 << 20;
 52     thrust::host_vector<float> p_vec(N);
 53     thrust::host_vector<float> q_vec(N);
 54     thrust::host_vector<float> r_vec(N);
 55     srand(0);
56     thrust::generate(p_vec.begin(), p_vec.end(), rand);
 57     thrust::generate(q_vec.begin(), q_vec.end(), rand);
 58 
 59     double referenceSoln = computeGold(p_vec.begin(), p_vec.end(), q_vec.begin(), q_vec.end());
 60 
 61     // device memory 'raw' pointers
 62     float* raw_ptr_P;
 63     float* raw_ptr_Q;
 64     float* raw_ptr_R;
 65 
 66     cudaMalloc( (void**)&raw_ptr_P, (N)*sizeof(float));
 67     cudaMalloc( (void**)&raw_ptr_Q, (N)*sizeof(float));
 68     cudaMalloc( (void**)&raw_ptr_R, (N)*sizeof(float));
 69 
 70     thrust::device_ptr<float> dev_ptr_P(raw_ptr_P);
 71     thrust::device_ptr<float> dev_ptr_Q(raw_ptr_Q);
 72     thrust::device_ptr<float> dev_ptr_R(raw_ptr_R);
 73 
 74     thrust::copy(p_vec.begin(), p_vec.end(), dev_ptr_P);
 75     thrust::copy(q_vec.begin(), q_vec.end(), dev_ptr_Q);
 76 
 77     // uncommenting the following will produce errors for 1.x devices 
 78     // complaining that CUDA doesn't support function pointers and function 
 79     // templates. reason is because a host function like 'rand' cannot be 
 80     // executed in the device i.e. GPU
 81     //thrust::generate(dev_ptr_P, dev_ptr_Q + N, rand);
 82     //thrust::generate(dev_ptr_Q, dev_ptr_Q + N, rand);
 83 
 84     thrust::transform(dev_ptr_P, dev_ptr_P + N, dev_ptr_Q, dev_ptr_R, powfunctor());
 85 
 86     float sum = thrust::reduce(dev_ptr_R, dev_ptr_R + N, (float)0, thrust::plus<float>());
 87     std::cout << "1. GPU " << sqrt(sum) << std::endl;
 88     std::cout << "2. CPU " << referenceSoln << std::endl;
 89 #endif
 90 
 91     std::cout << "END" << std::endl;
 92     return 0;
 93 }
 94 
Before i go off, i leave with you Rich Hickey's latest talk on Hammock-drive Development. A really cool video and i strongly recommend that you catch all 40 mins of it. This is a great video to start thinking about improving self-thinking in 2011 
Would this make computing say recommendation algorithms faster? By the pure fact that you're leveraging the GPU the answer is a quick "Yes" but the actual thing i wanted to show in this post is that there are always at least more than 1 way to get something done which is simply "freedom".

Have fun!


2 comments:

Bay said...

Wow! Very interesting blog you had here! Wish to connect with you on Google Friend Connect! Cheers! =) (http://whimsicalss.blogspot.com/)

Raymond Tay said...

Thanks Bay! Appreciate that!