I thought of doing a simple exercise and see stuff works out in CUDA. Here's a sample implementation of the Euclidean Distance and conducted benchmarking both on a Nvidia GTX 480 and GT 330M.
The implementation is pretty straight forward since the block that is suitable for massively parallelism is easily identifiable and here's how. Using the wikipedia's article on Euclidean distance, we focus the attention on this equation:
Reading the equation from left-to-right or from right-to-left, you can tell that its basically performing computations on two 1-dimensional arrays, namely P and Q aka Euclidean 2-space. From there, you can easily tell that CUDA's technology is made for these forms of computations and best of all - you'll probably never encounter the problem of divergent computation and you can tap into the raw computing power of your GPU since the GPU's constantly busy executing as occupancy's close to 1! that's pretty cool in my dictionary :)
One implementation that's pretty straightforward to implement is to model these 2 spaces into 2 different input arrays i've named P and Q, coincidentally. If you have read the documentation or books in CUDA, you would know the next few steps is to allocate memory on the device and design your GPU kernel to run these computations. The pseudo algorithm for the GPU kernel is something like this:
1. Load input arrays, 'P' & 'Q', into shared memory 2. For each input element in 'P' & 'Q' in the shared memory, load them and perform the computation of raising the result (i.e. taking the difference of 'P' & 'Q') to the power of 2 3. Store the previous result to the device memory of the GPU. You'll notice that this GPU kernel's not taking the square-root of the result in this kernel - it'll be done in the typical C setting i.e. computing in the host and not on the GPU and the reason is due to the mathematical nature of performing square-roots since sqrt( a + b ) != sqrt(a) + sqrt(b)
One gotcha you need to be aware of is that on devices such as those with compute capability 1.2 and lower (e.g. yours truly's mac book pro running GT 330M) the double-precision FP computation is downgraded to single-precision FP. You'll notice a warning message like "warning : Double is not supported. Demoting to float"
Next question: How do i/you do all this with Thrust?
Thrust is the library you should probably use when it comes to developing stuff with CUDA. It just makes things easier. Here's one implementation with the CUDA Thrust that i've made - one thing you need to take note is that if you conduct this test on a machine that has compute capability of 1.x then you'll get the warning message i've highlighted above; otherwise it should work well on device's with capability of 2.x
1 #include <thrust/host_vector.h>
2 #if __CUDA_ARCH__ == 200
3 #include <thrust/device_vector.h>
4 #elif __CUDA_ARCH__ == 100
5 #include <thrust/device_ptr.h>
6 #endif
7 #include <thrust/generate.h>
8 #include <thrust/reduce.h>
9 #include <thrust/transform.h>
10 #include <thrust/functional.h>
11 #include <cstdlib>
12 #include <iostream>
13
14 struct powfunctor
15 {
16 __host__ __device__
17 float operator()(const float& p, const float& q) const {
18 return pow( p - q, 2);
19 }
20 };
21
22 template <typename inputiterator1,typename inputiterator2>
23 double computeGold(InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, InputIterator2 last2)
24 {
25 float sum = 0.0;
26 for(; (first1 != last1) && (first2 != last2); ++first1, ++first2)
27 sum += pow(*first1 - *first2, 2);
28
29 float s1 = sqrt(sum);
30 std::cout << "Gold=" << s1 << std::endl;
31
32 return s1;
33 }
34
35 int main(void)
36 {
37 #if __CUDA_ARCH__ == 200
38 thrust::device_vector<float> p_vec(1 << 20);
39 thrust::device_vector<float> q_vec(1 << 20);
40 thrust::device_vector<float> r_vec(1 << 20);
41 thrust::generate(p_vec.begin(), p_vec.end(), rand);
42 thrust::generate(q_vec.begin(), q_vec.end(), rand);
43 // Current Thrust's transformations supports 2 input vectors, so we use it
44 thrust::transform(p_vec.begin(), p_vec.end(), q_vec.begin(), r_vec.begin(), powfunctor());
45
46 int sum = thrust::reduce(r_vec.begin(), r_vec.end(), (int)0, thrust::plus<float>());
47 std::cout << "sqrt(" << sum << ")=" << sqrt(sum) << std::endl;
48 #else
49
50
51 unsigned int N = 1 << 20;
52 thrust::host_vector<float> p_vec(N);
53 thrust::host_vector<float> q_vec(N);
54 thrust::host_vector<float> r_vec(N);
55 srand(0);56 thrust::generate(p_vec.begin(), p_vec.end(), rand); 57 thrust::generate(q_vec.begin(), q_vec.end(), rand); 58 59 double referenceSoln = computeGold(p_vec.begin(), p_vec.end(), q_vec.begin(), q_vec.end()); 60 61 // device memory 'raw' pointers 62 float* raw_ptr_P; 63 float* raw_ptr_Q; 64 float* raw_ptr_R; 65 66 cudaMalloc( (void**)&raw_ptr_P, (N)*sizeof(float)); 67 cudaMalloc( (void**)&raw_ptr_Q, (N)*sizeof(float)); 68 cudaMalloc( (void**)&raw_ptr_R, (N)*sizeof(float)); 69 70 thrust::device_ptr<float> dev_ptr_P(raw_ptr_P); 71 thrust::device_ptr<float> dev_ptr_Q(raw_ptr_Q); 72 thrust::device_ptr<float> dev_ptr_R(raw_ptr_R); 73 74 thrust::copy(p_vec.begin(), p_vec.end(), dev_ptr_P); 75 thrust::copy(q_vec.begin(), q_vec.end(), dev_ptr_Q); 76 77 // uncommenting the following will produce errors for 1.x devices 78 // complaining that CUDA doesn't support function pointers and function 79 // templates. reason is because a host function like 'rand' cannot be 80 // executed in the device i.e. GPU 81 //thrust::generate(dev_ptr_P, dev_ptr_Q + N, rand); 82 //thrust::generate(dev_ptr_Q, dev_ptr_Q + N, rand); 83 84 thrust::transform(dev_ptr_P, dev_ptr_P + N, dev_ptr_Q, dev_ptr_R, powfunctor()); 85 86 float sum = thrust::reduce(dev_ptr_R, dev_ptr_R + N, (float)0, thrust::plus<float>()); 87 std::cout << "1. GPU " << sqrt(sum) << std::endl; 88 std::cout << "2. CPU " << referenceSoln << std::endl; 89 #endif 90 91 std::cout << "END" << std::endl; 92 return 0; 93 } 94Before i go off, i leave with you Rich Hickey's latest talk on Hammock-drive Development. A really cool video and i strongly recommend that you catch all 40 mins of it. This is a great video to start thinking about improving self-thinking in 2011
Would this make computing say recommendation algorithms faster? By the pure fact that you're leveraging the GPU the answer is a quick "Yes" but the actual thing i wanted to show in this post is that there are always at least more than 1 way to get something done which is simply "freedom".
Have fun!
Tweet
2 comments:
Wow! Very interesting blog you had here! Wish to connect with you on Google Friend Connect! Cheers! =) (http://whimsicalss.blogspot.com/)
Thanks Bay! Appreciate that!
Post a Comment