So here's the sample run with runtimes and the source code.
The one below is run under simple multi-threading implementation
ray:src ray$ time ./multiDownload 1>/dev/null 2>&1
real 0m3.261s
user 0m0.014s
sys 0m0.028s
This one below is run under CUDA-device emulation mode
ray:src ray$ time ./cudaDownload 1>/dev/null 2>&1
real 0m31.893s
user 0m0.022s
sys 0m0.038s
The performance was disappointing while running under device emulation mode under CUDA and I looked into the runtimes to find an explanation as to why. Here are 2 notes i've made ...
1. Under device emulation mode, the CUDA still launched as many threads as implied by the source code(s) which is cool stuff. See gdb output of runtime below. From the output below, you can see that the threads are in the semaphore_wait_trap() which probably explains why the runtimes suck that much.
2. The semaphore_wait_trap() resolves into launching a system call into the Mac OS X kernel which is not surprising as that's how CUDA implement kernels in device emulation mode and that causes most of the latencies since under this mode of execution, threads are executed on the CPUs and not on the GPUs.
(gdb) info threads
33 process 28855 thread 0x5303 0x945b32c2 in semaphore_wait_trap ()
32 process 28855 thread 0x5103 0x945b32c2 in semaphore_wait_trap ()
31 process 28855 thread 0x4f03 0x945b32c2 in semaphore_wait_trap ()
30 process 28855 thread 0x4d03 0x945b32c2 in semaphore_wait_trap ()
29 process 28855 thread 0x4b03 0x945b32c2 in semaphore_wait_trap ()
28 process 28855 thread 0x4903 0x945b32c2 in semaphore_wait_trap ()
27 process 28855 thread 0x4703 0x945b32c2 in semaphore_wait_trap ()
26 process 28855 thread 0x4503 0x945b32c2 in semaphore_wait_trap ()
25 process 28855 thread 0x4303 0x945b32c2 in semaphore_wait_trap ()
24 process 28855 thread 0x4103 0x945b32c2 in semaphore_wait_trap ()
23 process 28855 thread 0x3f03 0x945b32c2 in semaphore_wait_trap ()
22 process 28855 thread 0x3d03 0x945b32c2 in semaphore_wait_trap ()
21 process 28855 thread 0x3b03 0x945b32c2 in semaphore_wait_trap ()
20 process 28855 thread 0x3903 0x945b32c2 in semaphore_wait_trap ()
19 process 28855 thread 0x3703 0x945b32c2 in semaphore_wait_trap ()
18 process 28855 thread 0x3503 0x945b32c2 in semaphore_wait_trap ()
17 process 28855 thread 0x3303 0x945b32c2 in semaphore_wait_trap ()
16 process 28855 thread 0x3103 0x945b32c2 in semaphore_wait_trap ()
15 process 28855 thread 0x2f03 0x945b32c2 in semaphore_wait_trap ()
14 process 28855 thread 0x2d03 0x945b32c2 in semaphore_wait_trap ()
13 process 28855 thread 0x2b03 0x945b32c2 in semaphore_wait_trap ()
12 process 28855 thread 0x2903 0x945b32c2 in semaphore_wait_trap ()
11 process 28855 thread 0x2703 0x945b32c2 in semaphore_wait_trap ()
10 process 28855 thread 0x2503 0x945b32c2 in semaphore_wait_trap ()
9 process 28855 thread 0x2303 0x945b32c2 in semaphore_wait_trap ()
8 process 28855 thread 0x2103 0x945b32c2 in semaphore_wait_trap ()
7 process 28855 thread 0x1f03 0x945b32c2 in semaphore_wait_trap ()
6 process 28855 thread 0x1d03 0x945b32c2 in semaphore_wait_trap ()
5 process 28855 thread 0x1b03 0x945b32c2 in semaphore_wait_trap ()
4 process 28855 thread 0x1903 0x945b32c2 in semaphore_wait_trap ()
3 process 28855 thread 0x1703 0x946026fa in select$DARWIN_EXTSN ()
2 process 28855 thread 0x1503 0x945b32c2 in semaphore_wait_trap ()
* 1 process 28855 local thread 0x2d03 0x945ba46e in __semwait_signal ()
(gdb) disassemble semaphore_wait_trap
Dump of assembler code for function semaphore_wait_trap:
0x945b32b8: mov $0xffffffdc,%eax
0x945b32bd: call 0x945b3ad4 <_sysenter_trap>
0x945b32c2: ret
0x945b32c3: nop
End of assembler dump.
Its expected that i could not cuda my sample application since its not possible to call host function from within kernel function (to borrow CUDA's terminology) so i had to compile and build it under device emulation but what this experiment demonstrated was that CUDA's device emulation mode may not be the answer that i was looking for but it raises my question "Wouldn't it be great if Nvidia could provide the software library to allow kernel functions to call host functions in the CUDA manner? " Perhaps its a work in progress.
A likely candidate for this sort of computing could be in OpenCL (Open Computing Language) and it'll be in the next Mac OS (Snow Leopard) Yay! Read the press release here.
Here are the sources codes i used (this multi-threaded program was lifted from the libcurl website's example code and i merely modified some stuff to fit my experiment)
#include <stdio.h>
#include <pthread.h>
#include <curl/curl.h>
#define NUMT 32
const char* const urls[NUMT] = {
"http://www.yahoo.com",
"http://www.cnn.com",
"http://www.hotmail.com",
"http://www.gmail.com",
"http://www.hp.com",
"http://www.microsoft.com",
"http://www.sun.com",
"http://blogs.sun.com/",
"http://www.acm.org",
"http://blogs.sun.com/d/",
"http://blogs.sun.com/jonathan",
"http://blogs.sun.com/jimgris",
"http://blogs.sun.com/theaquarium",
"http://blogs.sun.com/arungupta",
"http://blogs.sun.com/katakai",
"http://blogs.sun.com/webmink",
"http://blogs.sun.com/startups",
"http://blogs.sun.com/geertjan",
"http://blogs.sun.com/eclectic",
"http://blogs.sun.com/theplanetarium",
"http://blogs.sun.com/SDNProgramNews",
"http://blogs.sun.com/GullFOSS",
"http://blogs.sun.com/richb",
"http://blogs.sun.com/chrisg",
"http://blogs.sun.com/ontherecord",
"http://blogs.sun.com/HPC",
"http://blogs.sun.com/bblfish",
"http://blogs.sun.com/enterprisetechtips",
"http://blogs.sun.com/ahl",
"http://blogs.sun.com/jag",
"http://blogs.sun.com/bigadmin",
"http://blogs.sun.com/brendan"
};
static void *pull_one_url(void* url) {
CURL* curl;
curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_perform(curl);
curl_easy_cleanup(curl);
return NULL;
}
int main(int argc, char** argv) {
pthread_t tid[NUMT];
int i;
int error;
curl_global_init(CURL_GLOBAL_ALL);
for( i = 0; i < NUMT; i++)
pthread_create(&tid[i], NULL, pull_one_url, (void*)urls[i]);
for( i = 0; i < NUMT; i++)
pthread_join(tid[i], NULL);
return 0;
}
Here's the portion in CUDA-style (I've shown only the portion where its different)
__global__ void pull_one_url(char** url) {
int tid = threadIdx.x;
CURL* curl;
curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url[tid]);
curl_easy_perform(curl);
curl_easy_cleanup(curl);
printf("%d finished@%s\n", tid, url[tid]);
return ;
}
int main(int argc, char** argv) {
char** d_a;
int memSize=0;
for( int i = 0; i < NUMT; i++)
memSize += strlen(urls[i]);
printf("size=%d\n", memSize);
cudaMalloc((void**)&d_a, memSize);
cudaMemcpy( d_a, urls, memSize, cudaMemcpyHostToDevice );
for( int i = 0; i < NUMT; i++)
printf("%s\n", d_a[i]);
pull_one_url<<<1, NUMT>>>(d_a);
cudaFree(d_a);
return 0;
}
3 comments:
That's a very interesting article. I used the cURL command line all the time. curl -w allows you to break down the download time into various portion.
What were you expecting to gain from using CUDA in this situation? curl's functionality is not computationally intensive at all, and I'm not sure what you were expecting to offload to the GPU.
Hi Ray,
Thanks for the feedback. It really was just my rambling on a experiment of mine and i knew i wasn't going to get any benefit out of it.
Post a Comment