Monday, November 2, 2009

Tsung - Load Testing Tool in Erlang

You've must have heard of Tsung - a load testing tool. Well i've got the opportunity to learn and use it in my current employment and its really versatile in conducting load tests for web-based applications. It has many features that i've come to appreciate after spending a couple of years with Mercury Interactive (its defunct now after being bought over by HP in 2006); admittedly Mercury's solutions were more versatile than what Tsung can offer but considering that web is the common platform where many applications are being hosted on; i think its a safe bet :)

So, IMO i think
  1. Tsung is good for conducting load testing scenarios and executing them in a local / distributed manner
  2. Load Testing can be relatively light weight (Erlang processes are hitting the target app) and hence the cost of using relatively heavy weight machines is likely to reduce since there is lesser need to use those machines since more concurrent users can be simulated on 1 machine in Tsung. (I'll see whether this assumption is correct)

Good starting points (URLs of interest):

  1. http://tsung.erlang-projects.org/
  2. http://www.erlang.org
A load testing tool would be useless if it didn't know how to capture server response data (e.g. checking whether an expected string is returned) and reuse it (e.g. session ids returned from servers which you can use subsequently in a HTTP POST/GET in an URL-rewrite type of string) subsequently, generate dynamic data, read data from an external file (commonly used in storing <username, password> pairs)

So my example would illustrate the logging in to a website (e.g. http://projecteuler.net) and logging out from it - simple enough to illustrate my point.
Next what i did was to record the series of events that mimick my user logging and logging out of the website and this is captured in the tsung_recorder_timestamp.xml
and i used that XML file and included some other stuff so that it looks like what i have for you below (this is a basic load test scenario)

<?xml version="1.0"?>
<!DOCTYPE tsung SYSTEM "/usr/local/share/tsung/tsung-1.0.dtd">

<tsung loglevel="info" dumptraffic="false" version="1.0">

<clients>
<client host="localhost" use_controller_vm="true"/>
</clients>
<servers>
<server host="78.110.165.8" port="80" type="tcp"></server>
</servers>

<load>
<arrivalphase phase="1" duration="2" unit="minute">
<users interarrival="1" unit="minute"></users>
</arrivalphase>
<user session="rec20091102-01:30" start_time="0" unit="second"></user>
</load>

<options>
<option name="file_server" value="/tmp/userlist.csv"></option>
</options>

<sessions>
<session name='rec20091102-01:30' probability='100' type='ts_http'>
<request><http url='http://projecteuler.net/' version='1.1' method='GET'></http></request>
<request><http url='/style_main.css' version='1.1' if_modified_since='Sat, 29 Nov 2008 20:04:00 GMT' method='GET'></http></request>
<request><http url='/images/logo.jpg' version='1.1' if_modified_since='Thu, 28 Dec 2006 14:12:43 GMT' method='GET'></http></request>
<request><http url='/images/icon_register.png' version='1.1' if_modified_since='Fri, 31 Dec 2004 12:48:16 GMT' method='GET'></http></request>
<request><http url='/images/icon_about.png' version='1.1' if_modified_since='Fri, 31 Dec 2004 12:48:28 GMT' method='GET'></http></request>
<request><http url='/images/icon_problems.png' version='1.1' if_modified_since='Fri, 31 Dec 2004 12:48:26 GMT' method='GET'></http></request>
<request><http url='/images/icon_login.png' version='1.1' if_modified_since='Fri, 31 Dec 2004 12:48:32 GMT' method='GET'></http></request>
<request><http url='http://projecteuler.net/images/corner_tr.gif' version='1.1' if_modified_since='Thu, 10 Apr 2008 19:35:02 GMT' method='GET'></http></request>
<request><http url='/images/corner_tl.gif' version='1.1' if_modified_since='Thu, 10 Apr 2008 19:34:41 GMT' method='GET'></http></request>
<request><http url='/images/corner_br.gif' version='1.1' if_modified_since='Thu, 10 Apr 2008 19:35:10 GMT' method='GET'></http></request>
<request><http url='/images/corner_bl.gif' version='1.1' if_modified_since='Thu, 10 Apr 2008 19:34:55 GMT' method='GET'></http></request>
<request><http url='/images/euler_main.jpg' version='1.1' if_modified_since='Mon, 21 Jan 2002 19:18:20 GMT' method='GET'></http></request>

<thinktime random='true' value='2'/>

<request><http url='http://projecteuler.net/index.php?section=login' version='1.1' method='GET'></http></request>

<thinktime random='true' value='6'/>

<request subst="true">
<match do="continue" when="match">Logged in as %%readcsv:getUsername%%</match>
<http url='/index.php' version='1.1' contents='%%readcsv:getUserString%%' content_type='application/x-www-form-urlencoded' method='POST'></http>
</request>

<thinktime random='true' value='4'/>

<request><http url='/images/icon_tick.png' version='1.1' method='GET'></http></request>

<thinktime random='true' value='4'/>

<thinktime random='true' value='3'/>

<request><http url='http://projecteuler.net/index.php?section=logout' version='1.1' method='GET'></http></request>
</session>
</sessions>

</tsung>


Hence, the main thing you should note is the use of dynamic substitution (e.g. %%readcsv:getUsername%%) where i wrote a simple erlang program to read my username and password from a file (see the XML tag option above) and replacing each simulated user with a valid user id and password.

Next, i checked that the server response contains a string Logged in as XXX where XXX would be dynamically generated by the function (Check out the erlang code for the function, simple stuff).

The erlang program is shown below.

  1 -module(readcsv).
2 -export([getUserString/1, getUsername/1]).
3
4 getUserString({Pid, DynVar}) ->
5 {ok, Line} = ts_file_server:get_next_line(),
6 [Uid,Pwd] = string:tokens(Line, ","),
7 "username=" ++ Uid ++ "&password=" ++ Pwd ++ "&login=Login".
8
9 getUsername({Pid, DynVar}) ->
10 {ok, Line} = ts_file_server:get_next_line(),
11 [Uid,_] = string:tokens(Line, ","),
12 Uid.

The above erlang code must be compiled via erlc and you place the .beam file into the directory via a command like this

sudo mv readcsv.beam /usr/local/lib/erlang/lib/tsung-1.3.1/ebin/

Now, running the load test should be alright.

Note: In this load test, i defined a duration of 2 minutes with 2 users since load testing using 800 gazillion users is considered a chargeable offense so DON'T DO IT.



Have fun!

Saturday, October 24, 2009

Simple problem solving using Scala

If you like solving mathematical problems using Scala, i would suggest that you sign up at Project Euler (Thanks to my good friend Chi Hung for getting me hooked!) and use your favourite programming language or languages to solve it (There is more than 1 way to accomplish a goal)

So here's my take on it using Scala and just to provide an example, i'm using problem 4

scala> var largest = 1
largest: Int = 1

scala> for(i <- 1 until 1000) for( j <- i until 1000) {
| val v = i*j
| if ((v.toString.reverse.mkString) == (v.toString)) {
| if (v > largest) largest = v
| }
| }

scala> largest
res5: Int = 906609

This is not efficient since i'm looking for the largest palindrome made from 2 3-digit numbers so i should reverse this and possibly limit the data ranges to perhaps half. Anyway, i'd like to show you how this simple function can be factored into a style more Scala-like.


scala> var largest = 1
largest:Int = 1

scala> def isPalindrome(s:Any):(Boolean,Any) = {
| if (s.toString.reverse.mkString == s.toString) (true,s)
| else (false,s)
| }
isPalindrome: (Any)(Boolean, Any)
scala> (1 to 100).map(i => (1 to 100).map(j => if ( isPalindrome(i*j)._1 && (i*j) > largest) largest = (i*j)))
scala> largest
res12: Int = 9009
scala> (1 to 1000).map(i => (1 to 1000).map(j => if ( isPalindrome(i*j)._1 && (i
*j) > largest) largest = (i*j)))
res13: RandomAccessSeq.Projection[RandomAccessSeq.Projection[Unit]] = RangeM(Ran
geM((), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (),
(), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (),
(), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (),
(), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (), (),...
scala> largest
res14: Int = 906609

Ignore the part on the variable res13 and this version ran pretty fast for me too and i like this version much better than the former. It can be faster but i'd like to show you how to solve problems for now.

Feedback is appreciated!

Thursday, September 17, 2009

Scala - I like it!


Have you heard of Scala? Its a new programming language designed by Martin Odersky and i'm exploring it as i write this.


The motivation is this:
I'm working in an environment whereby Java is not the mainstream language being used and i don't want to lose the accumulated programming experiences using Java and i want to find a new way (think functional + imperative) to use it and create apps that scale on multi-core efficiently.

For those whom have followed my other blog on Erlang (You can pretty much tell i like functional programming) but here's a new language that combines both OOP + FP. Its a relatively young language when compared to Ruby, Python, Erlang ... but i think its got great potential. Here are two organizations that are using it *drum rolls*









Here's a class hierarchy diagram in Scala:

Monday, August 3, 2009

Job openings at Linden Lab

The company i'm working for (Linden Lab) has a great list of available positions and i would like to invite interested and passionate individuals about Virtual Worlds and Second Life to go check it out :D ---> http://lindenlab.com/employment

If you are good with software development (that means programming) and/or QA-ing experience, feel free to drop me a mailto:tay_boon_leong@yahoo.com.sg

Wednesday, July 15, 2009

Should free software depend on Mono or C#?

Here's a fantastic read on why free software should never depend on Mono or C# authored by one of the favourite authors of OSes Mr. Richard Stallman. Click here for the full story.

Debian's decision to include Mono in its principal way of installing GNOME, for the sake of Tomboy which is an application written in C#, leads the community in a risky direction. It is dangerous to depend on C#, so we need to discourage its use.
...
We should systematically arrange to depend on the free C# implementations as little as possible. In other words, we should discourage people from writing programs in C#. Therefore, we should not include C# implementations in the default installation of GNU/Linux distributions or in their principal ways of installing GNOME, and we should distribute and recommend non-C# applications rather than comparable C# applications whenever possible.

Make your own conclusions on this but personally, i've never been a fan of C# myself....not now and to be quite frank possibly never.

Sunday, July 12, 2009

OpenMP - 'firstprivate' and 'lastprivate' caveat

I've just started experimenting OpenMP and its another super cool multicore development paradigm as compared to NVIDIA's CUDA where the former concentrates on utilizing the actual CPU cores while the latter is on using the vast numbers of GPUs in the Nvidia's graphics cards.

To begin, let me do a demonstration of what 'firstprivate' or 'lastprivate' do.

extern int n; // external storage linked to some variable i've defined somewhere else.
void demo_firstprivate(void) {
int i, indx, TID;
int a[n];
for(i = 0; i < n; i++ )
a[i] = -i-1;
indx = 4;
int n1 = 1;
#pragma omp parallel default(none) firstprivate(indx) private(i, TID) shared(n1,a)
{
TID = omp_get_thread_num();
indx += n1*TID;
for( i = indx; i < indx + n1; i++)
a[i] = TID + 1;
}// end of parallel region

printf("After the parallel region:\n");
for( i = 0; i < n; i++ )
printf("a[%d] = %d\n", i, a[i]);
}

The output is

Demo 'firstprivate' clause begin..
After the parallel region:
a[0] = -1
a[1] = -2
a[2] = -3
a[3] = -4
a[4] = 1
a[5] = 2
a[6] = -7
a[7] = -8
a[8] = -9
a[9] = -10
Demo 'firstprivate' clause end..

There are two things you need to realize when using firstprivate
(1) The firstprivate variable is initialized once per thread
(2) In C++, the firstprivate object is constructed by calling its copy constructor with the master thread's copy of the variable as its argument.

Demo of lastprivate is the following code

extern int n;
void demo_lastprivate(void) {
int a = 0;
int i;
#pragma omp parallel for private(i) lastprivate(a)
for(i = 0; i < n; i++) {
a = i + 1;
printf("Thread:%d got value: %d, iteration: %d\n", omp_get_thread_num(), a, i);
}

// End of parallel region
printf("value of 'lastprivate' variable 'a' is %d\n", a);
}

The output is

Demo 'lastprivate' clause begin..
Thread:1 got value: 6, iteration: 5
Thread:0 got value: 1, iteration: 0
Thread:1 got value: 7, iteration: 6
Thread:0 got value: 2, iteration: 1
Thread:1 got value: 8, iteration: 7
Thread:0 got value: 3, iteration: 2
Thread:1 got value: 9, iteration: 8
Thread:1 got value: 10, iteration: 9
Thread:0 got value: 4, iteration: 3
Thread:0 got value: 5, iteration: 4
value of 'lastprivate' variable 'a' is 10
Demo 'lastprivate' clause end..

The caveat in using the lastprivate clause is
(1) If the lastprivate variable is some sort of an array or structure and only some elements or fields are assigned in the last iteration; then after the parallel execution, the elements or fields that were not assigned in the final iteration are undefined.
(2) In C++, this variable/object needs to have its copy assignment operator invoked with the master thread's copy with the sequentially last value of the variable as the argument.

That is, both copy assignment operator and copy constructor must be publicly available otherwise you'll find yourself in quite a fix - basically a complier error will be reflected and depending on your compiler e.g. Xcode, VS, gcc you might be able to figure out why the error is there in the first place.

The last thing i wanted to share is that under OpenMP, variables pointing to heap storage are shared by all threads in the program. So you need to be careful while dealing with memory allocation otherwise you will get runtime errors like
Non-aligned pointer being freed ... or double free



Sunday, June 28, 2009

Netflix has a winner! Well not if someone beats them to it in 30 days

Caught this on New York Times while trying to catchup with the rest of the world. See the link here for more details. I've got a snapshot of that story here in case you want to just get the gees of it.

Turns out that they're not revealing their secret success yet (since they hope there are no challengers within the next 30 days) which is understandable but i sure would like to find out how that team engineered the solution.


Sunday, May 31, 2009

CUDA and libcurl - a simple demo

Recently, i got to know cURL and libcurl and wondered whether i could get sample apps to run under CUDA. However i could only get it to run under device emulation mode under CUDA for the simple reason that libcurl is not CUDA enabled if i may use this term. 

So here's the sample run with runtimes and the source code. 
The one below is run under simple multi-threading implementation
ray:src ray$ time ./multiDownload 1>/dev/null 2>&1

real 0m3.261s
user 0m0.014s
sys 0m0.028s

This one below is run under CUDA-device emulation mode
ray:src ray$ time ./cudaDownload 1>/dev/null 2>&1

real 0m31.893s
user 0m0.022s
sys 0m0.038s

The performance was disappointing while running under device emulation mode under CUDA and I looked into the runtimes to find an explanation as to why. Here are 2 notes i've made ...

1. Under device emulation mode, the CUDA still launched as many threads as implied by the source code(s) which is cool stuff. See gdb output of runtime below. From the output below, you can see that the threads are in the semaphore_wait_trap() which probably explains why the runtimes suck that much.

(gdb) info threads
33 process 28855 thread 0x5303 0x945b32c2 in semaphore_wait_trap ()
32 process 28855 thread 0x5103 0x945b32c2 in semaphore_wait_trap ()
31 process 28855 thread 0x4f03 0x945b32c2 in semaphore_wait_trap ()
30 process 28855 thread 0x4d03 0x945b32c2 in semaphore_wait_trap ()
29 process 28855 thread 0x4b03 0x945b32c2 in semaphore_wait_trap ()
28 process 28855 thread 0x4903 0x945b32c2 in semaphore_wait_trap ()
27 process 28855 thread 0x4703 0x945b32c2 in semaphore_wait_trap ()
26 process 28855 thread 0x4503 0x945b32c2 in semaphore_wait_trap ()
25 process 28855 thread 0x4303 0x945b32c2 in semaphore_wait_trap ()
24 process 28855 thread 0x4103 0x945b32c2 in semaphore_wait_trap ()
23 process 28855 thread 0x3f03 0x945b32c2 in semaphore_wait_trap ()
22 process 28855 thread 0x3d03 0x945b32c2 in semaphore_wait_trap ()
21 process 28855 thread 0x3b03 0x945b32c2 in semaphore_wait_trap ()
20 process 28855 thread 0x3903 0x945b32c2 in semaphore_wait_trap ()
19 process 28855 thread 0x3703 0x945b32c2 in semaphore_wait_trap ()
18 process 28855 thread 0x3503 0x945b32c2 in semaphore_wait_trap ()
17 process 28855 thread 0x3303 0x945b32c2 in semaphore_wait_trap ()
16 process 28855 thread 0x3103 0x945b32c2 in semaphore_wait_trap ()
15 process 28855 thread 0x2f03 0x945b32c2 in semaphore_wait_trap ()
14 process 28855 thread 0x2d03 0x945b32c2 in semaphore_wait_trap ()
13 process 28855 thread 0x2b03 0x945b32c2 in semaphore_wait_trap ()
12 process 28855 thread 0x2903 0x945b32c2 in semaphore_wait_trap ()
11 process 28855 thread 0x2703 0x945b32c2 in semaphore_wait_trap ()
10 process 28855 thread 0x2503 0x945b32c2 in semaphore_wait_trap ()
9 process 28855 thread 0x2303 0x945b32c2 in semaphore_wait_trap ()
8 process 28855 thread 0x2103 0x945b32c2 in semaphore_wait_trap ()
7 process 28855 thread 0x1f03 0x945b32c2 in semaphore_wait_trap ()
6 process 28855 thread 0x1d03 0x945b32c2 in semaphore_wait_trap ()
5 process 28855 thread 0x1b03 0x945b32c2 in semaphore_wait_trap ()
4 process 28855 thread 0x1903 0x945b32c2 in semaphore_wait_trap ()
3 process 28855 thread 0x1703 0x946026fa in select$DARWIN_EXTSN ()
2 process 28855 thread 0x1503 0x945b32c2 in semaphore_wait_trap ()
* 1 process 28855 local thread 0x2d03 0x945ba46e in __semwait_signal ()
(gdb) disassemble semaphore_wait_trap
Dump of assembler code for function semaphore_wait_trap:
0x945b32b8 : mov $0xffffffdc,%eax
0x945b32bd : call 0x945b3ad4 <_sysenter_trap>
0x945b32c2 : ret
0x945b32c3 : nop
End of assembler dump.
2. The semaphore_wait_trap() resolves into launching a system call into the Mac OS X kernel which is not surprising as that's how CUDA implement kernels in device emulation mode and that causes most of the latencies since under this mode of execution, threads are executed on the CPUs and not on the GPUs.

Its expected that i could not cuda my sample application since its not possible to call host function from within kernel function (to borrow CUDA's terminology) so i had to compile and build it under device emulation but what this experiment demonstrated was that CUDA's device emulation mode may not be the answer that i was looking for but it raises my question "Wouldn't it be great if Nvidia could provide the software library to allow kernel functions to call host functions in the CUDA manner? " Perhaps its a work in progress.

A likely candidate for this sort of computing could be in OpenCL (Open Computing Language) and it'll be in the next Mac OS (Snow Leopard) Yay! Read the press release here.

Here are the sources codes i used (this multi-threaded program was lifted from the libcurl website's example code and i merely modified some stuff to fit my experiment)
#include <stdio.h>
#include <pthread.h>
#include <curl/curl.h>

#define NUMT 32

const char* const urls[NUMT] = {
"http://www.yahoo.com",
"http://www.cnn.com",
"http://www.hotmail.com",
"http://www.gmail.com",
"http://www.hp.com",
"http://www.microsoft.com",
"http://www.sun.com",
"http://blogs.sun.com/",
"http://www.acm.org",
"http://blogs.sun.com/d/",
"http://blogs.sun.com/jonathan",
"http://blogs.sun.com/jimgris",
"http://blogs.sun.com/theaquarium",
"http://blogs.sun.com/arungupta",
"http://blogs.sun.com/katakai",
"http://blogs.sun.com/webmink",
"http://blogs.sun.com/startups",
"http://blogs.sun.com/geertjan",
"http://blogs.sun.com/eclectic",
"http://blogs.sun.com/theplanetarium",
"http://blogs.sun.com/SDNProgramNews",
"http://blogs.sun.com/GullFOSS",
"http://blogs.sun.com/richb",
"http://blogs.sun.com/chrisg",
"http://blogs.sun.com/ontherecord",
"http://blogs.sun.com/HPC",
"http://blogs.sun.com/bblfish",
"http://blogs.sun.com/enterprisetechtips",
"http://blogs.sun.com/ahl",
"http://blogs.sun.com/jag",
"http://blogs.sun.com/bigadmin",
"http://blogs.sun.com/brendan"
};

static void *pull_one_url(void* url) {
CURL* curl;
curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url);
curl_easy_perform(curl);
curl_easy_cleanup(curl);

return NULL;
}

int main(int argc, char** argv) {
pthread_t tid[NUMT];
int i;
int error;

curl_global_init(CURL_GLOBAL_ALL);
for( i = 0; i < NUMT; i++)
pthread_create(&tid[i], NULL, pull_one_url, (void*)urls[i]);

for( i = 0; i < NUMT; i++)
pthread_join(tid[i], NULL);

return 0;
}

Here's the portion in CUDA-style (I've shown only the portion where its different)

__global__ void pull_one_url(char** url) {
int tid = threadIdx.x;
CURL* curl;
curl = curl_easy_init();
curl_easy_setopt(curl, CURLOPT_URL, url[tid]);
curl_easy_perform(curl);
curl_easy_cleanup(curl);
printf("%d finished@%s\n", tid, url[tid]);
return ;
}

int main(int argc, char** argv) {
char** d_a;
int memSize=0;
for( int i = 0; i < NUMT; i++)
memSize += strlen(urls[i]);
printf("size=%d\n", memSize);
cudaMalloc((void**)&d_a, memSize);
cudaMemcpy( d_a, urls, memSize, cudaMemcpyHostToDevice );
for( int i = 0; i < NUMT; i++)
printf("%s\n", d_a[i]);

pull_one_url<<<1, NUMT>>>(d_a);

cudaFree(d_a);
return 0;
}


Saturday, May 9, 2009

RunSnakeRun - View performance profile in Python

In my previous post on Greenlet/Eventlet, i went about testing and profiling the apps i wrote just to get a feel but admittedly, i do find reading the statistical data gathered thru profiling laborious and at times, it could get quite tiring..

So, i've found a nice application that can read python application's profiling data. Its called RunSnakeRun which is pretty cool as it helps to visualize the data in a much more interesting and readable manner and the caveat is that you will need to download the appropriate wxPython binaries or source code to build for the platform you're using. In this case, i'm using Ubuntu Linux to run the profiling.

Besides the proper wxPython, you will need to follow the instructions on the site to begin profiling and viewing your profiled data.

On a side note, if you are curious about profiling applications using python but have little idea what its all about, here's a link that gives you an idea why profiling is important in python code.


The screen shot i have for you is a caller-callee graph with runtimes (accumulated and exclusive) which allows a python developer to quickly isolate poor performing code and also provides you an idea how code is being executed. E.g. this screenshot illustrated the concurrent execution of the Greenlets


Obviously, this application allows a developer to quickly isolate problematic code using a square map since it highlights the largest consumer of time by the largest square and also provides color highlighting when hovering your mouse over the different squares. Neat!


Python's Greenlet/Linden Lab's Eventlet

Have you guys used Python's greenlet or eventlet before? Its pretty cool and i thought i write something about it. So here's a little experiment i did using a couple of Python's module i.e. optparse, greenlet, eventlet. I have also did some simple measuring to see what was causing latencies etc.

What i did was basically, collected statistics on running a computation against the standard matrix multiplication which is highly parallel. There are 3 programs i created using the normal iterative-approach i.e. for-loop, greenlet and lastly eventlet implementation.

Overall, i find that greenlet is much more suitable to eventlet (w.r.t time) and the iterative-approach (w.r.t flexibility and elegance). I include my scripts below at the end of this post.

Here's how i executed and profiled my script:

ray:~ ray$ python -m cProfile -o eventletprof_10 ./myEventletDemo.py
Sum of matrix multiplication: 285

ray:~ ray$ python -m cProfile -o eventletprof_100 ./myEventletDemo.py --i 100
Sum of matrix multiplication: 328350

ray:~ ray$ python -m cProfile -o eventletprof_1000 ./myEventletDemo.py --i 1000
Sum of matrix multiplication: 332833500

ray:~ ray$ python -m cProfile -o eventletprof_10000 ./myEventletDemo.py --i 10000
Sum of matrix multiplication: 333283335000

ray:~ ray$ python -m cProfile -o eventletprof_100000 ./myEventletDemo.py --i 100000
Sum of matrix multiplication: 333328333350000
...
...
ray:~ ray$ python -m cProfile -o noneventletprof_10 ./myNoneventletDemo.py
iteration version of matrix add: 285

ray:~ ray$ python -m cProfile -o noneventletprof_100 ./myNoneventletDemo.py --i 100
iteration version of matrix add: 328350

ray:~ ray$ python -m cProfile -o noneventletprof_1000 ./myNoneventletDemo.py --i 1000
iteration version of matrix add: 332833500

ray:~ ray$ python -m cProfile -o noneventletprof_10000 ./myNoneventletDemo.py --i 10000
iteration version of matrix add: 333283335000

ray:~ ray$ python -m cProfile -o noneventletprof_100000 ./myNoneventletDemo.py --i 100000
iteration version of matrix add: 333328333350000
...
...
ray:~ ray$ python -m cProfile -o greenletprof_10 ./myGreenletDemo.py
iteration version of matrix add: 285

ray:~ ray$ python -m cProfile -o greenletprof_100 ./myGreenletDemo.py --i 100
iteration version of matrix add: 328350

ray:~ ray$ python -m cProfile -o greenletprof_1000 ./myGreenletDemo.py --i 1000
iteration version of matrix add: 332833500

ray:~ ray$ python -m cProfile -o greenletprof_10000 ./myGreenletDemo.py --i 10000
iteration version of matrix add: 333283335000

ray:~ ray$ python -m cProfile -o greenletprof_100000 ./myGreenletDemo.py --i 100000
iteration version of matrix add: 333328333350000


What profiling does is to output the results into a file (e.g. greenletprof_100000) and you can view the statistics in the Python interpreter using the following:

Python 2.5.1 (r251:54863, Jan 13 2009, 10:26:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pstats
>>> profiledData = pstats.Stats('greenletprof_100000')
>>> profiledData.print_stats()
And you'll see the statistics printed in much detail. However, this is rather tedious so i would suggest a graphical UI to view the profiled data instead and one such option is RunSnakeRun.

Comparing the statistics in detail between the different runs reveals that the iterative approach is the most efficient followed by Greenlet and last eventlet. Here are the statistics in detail:

>>> import pstats
>>> stats10 = pstats.Stats('nongreenleteventlet100000.profile')
>>> stats10.print_stats()
Sun May 10 22:30:16 2009 nongreenleteventlet100000.profile

6 function calls in 0.069 CPU seconds

Random listing order was used

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.060 0.060 0.069 0.069 /home/tayboonl/Desktop/Greenlet_Eventlet/nongreenleteventletdemo.py:8(execMatrixAdd_Iter)
1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
1 0.000 0.000 0.069 0.069 :1()
3 0.009 0.003 0.009 0.003 {range}

...
>>> stats10 = pstats.Stats('greenlet100000.profile')
>>> stats10.print_stats()
Sun May 10 22:16:45 2009 greenlet100000.profile

500011 function calls in 1.511 CPU seconds

....
>>> stats10 = pstats.Stats('eventlet100000.profile')
>>> stats10.print_stats()
Sun May 10 22:20:17 2009 eventlet100000.profile

6500027 function calls (5800031 primitive calls) in 34.901 CPU seconds

You might have noticed the remarkable number of functions calls made in the scenario using Eventlet. It suffice at this point for me to say is that Eventlet was not designed for high performance computing w.r.t Greenlet since it was designed to be a asynchronous networking library and to me, it makes alot of sense to always profile the application or framework before using it on a wide basis.

In case you are interested, here are the scripts i used
All together, there are 3 in total first being that implemented using Greenlet (Cool stuff!)

#!/usr/bin/python

import greenlet
import optparse

tasks = []
accum = list()
sum = 0
vecA = vecB = None

def mulNAccum(idx, val1, val2) :
global accum
accum.insert(idx, val1*val2)
#print idx,val1*val2,accum

def createTasks(HowMany):
global vecA, vecB
vecA = range(0, HowMany)
vecB = range(0, HowMany)
for i in range(0,HowMany):
tasks.append(greenlet.greenlet(run=mulNAccum,parent=greenlet.getcurrent()))

def executeTasks(HowMany):
global tasks
global accum
global sum

for i in range(0,HowMany):
tasks[i].switch(i, vecA[i],vecB[i])
#print "tasks finished executing ...\n"
# print accum
for i in range(0, len(accum)):
sum = sum + accum[i]

print "Sum of matrix multiplication: %d\n" % sum

if __name__ == "__main__":
parser = optparse.OptionParser()
parser.add_option("--items", "-i", default=10, action="store", type="int", dest="HowMany", help="number of elements in matrix array to process")
(options, args) = parser.parse_args()

createTasks(options.HowMany)
executeTasks(options.HowMany)

Similar program using the for-loop a.k.a iterative approach

#!/usr/bin/python

import optparse

sum = 0
vecA = vecB = None

def execMatrixAdd_Iter(HowMany):
global vecA, vecB, sum
vecA = range(0, HowMany)
vecB = range(0, HowMany)
for i in range(0, HowMany):
temp = vecA[i] * vecB[i]
sum = temp + sum
print "iteration version of matrix add: %d\n" % sum


if __name__ == "__main__":
parser = optparse.OptionParser()
parser.add_option("--items", "-i", default=10, action="store", type="int", dest="HowMany", help="number of elements in matrix array to process")
(options, args) = parser.parse_args()
execMatrixAdd_Iter(options.HowMany)

Last script using Linden Lab's eventlet

#!/usr/bin/python

import eventlet
import eventlet.api
import optparse

tasks = []
accum = list()
sum = 0
vecA = vecB = None

def mulNAccum(idx, val1, val2) :
global accum
accum.insert(idx, val1*val2)
#print idx,val1*val2,accum

def createTasks(HowMany):
global vecA, vecB
vecA = range(0, HowMany)
vecB = range(0, HowMany)
for i in range(0,HowMany):
tasks.append(eventlet.api.spawn(mulNAccum, i, vecA[i-1], vecB[i-1]))

def executeTasks(HowMany):
global tasks
global accum
global sum

for i in range(0,HowMany):
tasks[i].switch()
#print "tasks finished executing ...\n"
# print accum
for i in range(0, len(accum)):
sum = sum + accum[i]

print "Sum of matrix multiplication: %d\n" % sum

if __name__ == "__main__":
parser = optparse.OptionParser()
parser.add_option("--items", "-i", default=10, action="store", type="int", dest="HowMany", help="number of elements in matrix array to process")
(options, args) = parser.parse_args()

createTasks(options.HowMany)
executeTasks(options.HowMany)