PageRank is new to me. In fact, i've only known about it some time in mid of 2010; talk about coming to a scene pretty late. For those of you whom don't know what PageRank is about, i'll let Wikipedia tell you all about it or as Google tells it "... a technology that determined the 'importance' of a webpage by looking at what other pages link to it, as well as other data." Talk about 'peer' appraisal on the exascale. I do apologize for overloading the terms to the point of abuse.
What i do like about the PageRank is that it takes into account two aspects of human beings when it comes to sharing info:
1) People just love to share information through the internet; thus enabling others to receive that information rapidly;
2) People will eventually tire of sharing probably at the end of a day...before lunch, before a nap etc
I just love the sort of things people do with math, here's another one where mathematicians can describe a city using equations. Ok, that should be enough digression.
Why i did this?
Reason is because i asked myself : Can i model the PageRank algorithm into the GPU so that i can use the super computer under my desk to compute how popular a web page really is?
First of all, take some time to understand the mathematics. Frankly, i'm no math major but it seems pretty alright since the equation is
Trusting the wikipedia *makes the sign of the cross* and confirming that equation with my favourite algorithm texts, i was assured the math is correct. Whew. From the documentation, i understood that this equation needs to be iterated a number of times so that the results can be more reliable. Those folks in HPC shouldn't have much problem I imagine but how much a novice like myself go about doing this?
Modeling the problem ... some ideas ...
Few considerations i can think of the top of my mind:
Results of performance tests and optimizations applied
There are a couple of scenarios in the tests i've ran. To give you an idea what the following sub-sections is going to be about, i'll like to show you the benchmark results off the GT 330M Nvidia processor and GTX 480. One key thing i wanted to point out is that the two devices of different compute capabilities DO make a difference.
Results of test runs on GT 330M Nvidia processor (Compute Capability 1.2)
Having to cut back on the size of my data as it crashed much to my horror.
Moving forward, the tests are modelled based on 4 million web links, the average run times is taken from 10 test runs discarding 1 or 2 runs depending whether the deviation is too wide. The CPU code was compiled to level 2 optimization and GPU code was compiled using the production release of CUDA 4.0 SDK.
Version 1 (straight forward):
The average GPU run time (over ten runs discarding 1) = 7850.5 ms
The average CPU run time (over ten runs discarding 1) = 22187.89 ms
The speed up over the CPU version is = 2.83 times.
Version 2 (using shared memory, loop optimization etc):
The average GPU run time (over ten runs discarding 1) = 3033.8 ms
The average CPU run time (over ten runs discarding 1) = 20412.2 ms
The speed up over the two GPU versions is approximately 2.4 times. The key improvement was to make use of the device's shared memory so that it can be as close as possible to the GPU considering the latency of global memory access is between 400 - 600 cycles.
The discrepancy in the CPU run times is probably due to a Mac OS X system process running in the background and also i was listening to coding music.
Here's a glimpse of the device's configuration i used calculated by CUDA's occupancy calculator (an invaluable tool in your arsenal); see below for the diagram for GTX 480 (CC 2.0)
Results of test runs on GTX 480 Nvidia GPU (Compute Capability 2.0)
Consume all of shared memory on device (no L1), compiled for sm_20, use fast math
The same code used in Version 2 previously is tested here compiled for -Xptxas -dlcm=cg -use_fast_math, -arch=sm_20 and the run times shown here was fine tuned with the help of the visual profiler.
The average GPU run time (over ten runs discarding 1) = 1176.9 ms
The average CPU run time (over ten runs discarding 1) = 20362.6 ms
The speed up over the CPU version is = 17 times
One last ditch effort came in the form of using texture memory, specifically speaking linear texture memories, on the GTX 480. Reason for poor run is because i've only managed to achieve a 3% hit rate in the texture memory. That's really because of scatter reads and writes; hence the REAL problem is actually the kernel has significant scatter read / write - so it doesn't really matter if i used texture memory or not because the problem's still there.
The average GPU run time (over ten runs discarding 1) = 1142.1 ms
The average CPU run time (over ten runs discarding 1) = 20092.3 ms
The speed up over the CPU version is = 17 times
Conclusion
I can't really conclude this at this point in time since the kernel hasn't reached its maximum potential but a couple of things are useful to take note of and possibly worth sharing:
Roadmap ahead
This kernel isn't optimal definitely (see diagrams above which implies low occupancy). I've plans to rewrite the kernel (after hours) and run some more tests - i'll update it in a separate post when i can.
Modeling the problem ... some ideas ...
Few considerations i can think of the top of my mind:
- Mine the Internet (that's crawling the web, really and for ease of programming i've used Python)
- Represent the mined data into a graph (i use the adjacency matrix for starters)
- Decide a damping factor (that's the d in the equation and i found that 0.85 seems to be the accepted value from here and what 0.85 says is that the random surfer has 85% chance of clicking on ANY link and you are free to test different models by adjusting the damping factor)
- Decide a number of iterations you are going to use to compute the probability distribution. From this fella's description in his/her post it appears that 100 iterations should suffice (Did i mention there would be hundreds of millions of pages?? That'll give your CPU a good run for the money you've paid)
Those are just some of the stuff i've thought about, definitely not exhaustive.
I'm assuming you know how to do a web crawling in multi-threading/processing form (note that the liberal use of 'thread' and 'process' differs from 1 programming language to another)
Now, to the implementation and benchmarks!!! Btw, the simulations i have is in the order of hundreds of millions of web links (some cyclic, some not) and running it on my mac book pro didn't work well as it crashed and burned as i didn't have enough RAM and for best results i've used by GTX 480 FERMI 2.0
Results of performance tests and optimizations applied
There are a couple of scenarios in the tests i've ran. To give you an idea what the following sub-sections is going to be about, i'll like to show you the benchmark results off the GT 330M Nvidia processor and GTX 480. One key thing i wanted to point out is that the two devices of different compute capabilities DO make a difference.
Results of test runs on GT 330M Nvidia processor (Compute Capability 1.2)
Having to cut back on the size of my data as it crashed much to my horror.
Moving forward, the tests are modelled based on 4 million web links, the average run times is taken from 10 test runs discarding 1 or 2 runs depending whether the deviation is too wide. The CPU code was compiled to level 2 optimization and GPU code was compiled using the production release of CUDA 4.0 SDK.
Version 1 (straight forward):
The average GPU run time (over ten runs discarding 1) = 7850.5 ms
The average CPU run time (over ten runs discarding 1) = 22187.89 ms
The speed up over the CPU version is = 2.83 times.
Version 2 (using shared memory, loop optimization etc):
The average GPU run time (over ten runs discarding 1) = 3033.8 ms
The average CPU run time (over ten runs discarding 1) = 20412.2 ms
The speed up over the CPU version is = 6.7 times
The speed up over the two GPU versions is approximately 2.4 times. The key improvement was to make use of the device's shared memory so that it can be as close as possible to the GPU considering the latency of global memory access is between 400 - 600 cycles.
The discrepancy in the CPU run times is probably due to a Mac OS X system process running in the background and also i was listening to coding music.
Here's a glimpse of the device's configuration i used calculated by CUDA's occupancy calculator (an invaluable tool in your arsenal); see below for the diagram for GTX 480 (CC 2.0)
Results of test runs on GTX 480 Nvidia GPU (Compute Capability 2.0)
Consume all of shared memory on device (no L1), compiled for sm_20, use fast math
The same code used in Version 2 previously is tested here compiled for -Xptxas -dlcm=cg -use_fast_math, -arch=sm_20 and the run times shown here was fine tuned with the help of the visual profiler.
The average GPU run time (over ten runs discarding 1) = 1176.9 ms
The average CPU run time (over ten runs discarding 1) = 20362.6 ms
The speed up over the CPU version is = 17 times
Consume all of shared memory on device (no L1), compiled for sm_20, use fast math, texture memory
One last ditch effort came in the form of using texture memory, specifically speaking linear texture memories, on the GTX 480. Reason for poor run is because i've only managed to achieve a 3% hit rate in the texture memory. That's really because of scatter reads and writes; hence the REAL problem is actually the kernel has significant scatter read / write - so it doesn't really matter if i used texture memory or not because the problem's still there.
The average GPU run time (over ten runs discarding 1) = 1142.1 ms
The average CPU run time (over ten runs discarding 1) = 20092.3 ms
The speed up over the CPU version is = 17 times
Conclusion
I can't really conclude this at this point in time since the kernel hasn't reached its maximum potential but a couple of things are useful to take note of and possibly worth sharing:
- Choosing a graphics card that supports Fermi or later should be a good bet i.e. compute capability 2.x
- Attempt to reduce global memory bandwidth by placing data as close as possible to the processing units e.g. shared memory, L1/L2 caches, Texture memories, constant memories.
- Remember to compile your CUDA kernel code for architectures that best match your graphics card e.g. -arch=sm_12 -arch=sm_13 -arch=sm_20 since the default would be sm_10
- Experiment with different ways to implement your kernel. Always think about improving memory access.
- If you are determined to get the best performing code, then i'd recommend you get really familiar with the visual profiler and understand the metrics that's being measured and how each strategy you applied worked best.
Roadmap ahead
This kernel isn't optimal definitely (see diagrams above which implies low occupancy). I've plans to rewrite the kernel (after hours) and run some more tests - i'll update it in a separate post when i can.



2 comments:
How many nodes did your graph had?
Anon - ran the test with 4 million nodes. it isn't a lot but it was the lowest denominator i can get on both GT330M and GTX480 with my implementation. It should slightly more but i had a couple of 1D arrays storing graph nodes info, results.
Post a Comment