Finally, the tuning project came to a close (at least for now) and there were many takeaways from this experience with my customer which i like to share. Overall, i felt that the root cause of these issues were typical of many software development projects in J2EE and that was not understanding the technology well enough (not to mention project costing and time schedules which compromised software quality...after all these years, you would think that people got it...but apparently they don't...)
Understanding EJBs, Transactions and SQL
That application had more than 3,000 bean instances in active transacting but the average transaction rate was less than 5 per second, which highlighted an important fact that its heavily clogged. The real clog was really poor performing SQLs, injudicious use of CMT and BMT i.e. Container/Bean Managed Transaction, mixing CMT & BMT in the same bean, "chatty" EJBs. The EJB specifications clearly say that its best not to mix CMT & BMT as it makes transaction management complicated also never EVER place global variables of the kind that represent database connections, user preferences etc in stateless beans except of course, stateful beans. Since the concurrency now is pretty low, those kinds of issues related to concurrency won't surface but you can be sure they will surface once performance improves and you don't want these kinds of bugs are the HARDEST to locate.
It was so slow to the point that WebLogic actual emitted an "STUCK THREAD" message and these occur when execution exceeds 10 minutes.
The next thing to do is to re-look at the design of those beans and remove those CMT & BMT mixtures and to convert globals into locals.
Understanding Clustering Technology
Clustering can be so easily abused by developers and lack of design control could worsen the entire situation. This project's implementation resulted in creating a 1 cluster group of 12 machines and its observed that Minor + Major GCs were happening on such a frequency and pause times were worsening that it totally wrecked the concurrency of the application; after which they changed it to 3 cluster groups which significantly reduced GC (however it still wasn't good enough) ... i suggested to them to reduced to several cluster groups of 2 servers which should/will significantly reduced overall heap size usage, frequency of GCs (and GC pause times) since there's less pressure on the Java memory system now that there should see less replication activity.
Developers placed huge objects (> 500KB) into the session via attributes combined with a fact that session timeouts ranged in hours (can't do anything since that's a requirement of the business) implied that memory can't be freed during GCs. But i suspect its probably an application design issue since these objects tend to reflect database data displayed on the GUI and all those nifty controls that gives users flexbility in the business functions.
Understanding Garbage Collection
GC are quite a killer of typical J2EE applications and its important to keep the pause times as low as it is possible. One major improvement was in using the Concurrent Low Pause Collector i.e. -XX:+UseConcMarkSweepGC which allowed GC to happen in a concurrent fashion in both the Young and Tenured Generation. Before this change, the Major GC time took 10 - 17 secs on average occurring once every 3 hours on a typical day; and 15 - 20 secs on average occuring once every 30 - 45 minutes on a busy day. However, i am skeptical whether this will work because the average object creation rate is 800 MB per minute (since Minor GC occurs twice per minute and each time it ocurred freed approximately 400 MB of memory)...there's still a chance that the Stop-The-World GC will happen; to understand why do read Jon Masamitsu's blog here.
Class Loading
The developers of the application did something which is very commonly seen in development and that is dumping their classes into the application server's classpath which not only prevented the class from unloading and what's worse is that a type of memory leak a.k.a Class Loader Leaks occur and GC doesn't collect this memory space. I admit i used to do this but i don't now. The effect was OutOfMemoryError: PermGen Space and of course increasing the Perm Gen size would effectively stamp out this error.
Of course, now my customer is pursuing the Queue OverFlow exception with BEA Systems and let's see what happens next ... perhaps a patch or BEA/Oracle will say " Upgrade your WebLogic 10, it solves that problem !" Psss tell you a secret the problem is not resolved there either ...
Sunday, October 12, 2008
Thursday, October 2, 2008
J2EE/WebLogic Performance Tuning Project...continued
In my previous article, my tuning project with a customer ran into some trouble with WebLogic's Work Manager and in particular, on the Java exception weblogic.utils.UnsyncCircularQueue$FullQueueException where the WebLogic server indicated that the queue where the server works on submitted requests. Checked the WebLogic docs and accordingly, the server will automatically resize the queue to fit the requests but what the docs didn't mention was that the resize will fail to work if the size of the queue equals the capacity which happens to be 65536 and that's the reason why it threw the error message "Queue exceeded the maximum capacity of '65536' elements".
However, checking the code reveals something quite peculiar and that is the constructor suggests that only queues of sizes exceeding 1 GB will throw this error but the default capacity is always 256 and reaching a maximum of 65536 elements and btw, its WebLogic 9.2 ; so my guess is that source codes need to be cleaned up ? If anybody has any idea why, do drop me a comment, thanks in advanced.
However, checking the code reveals something quite peculiar and that is the constructor suggests that only queues of sizes exceeding 1 GB will throw this error but the default capacity is always 256 and reaching a maximum of 65536 elements and btw, its WebLogic 9.2 ; so my guess is that source codes need to be cleaned up ? If anybody has any idea why, do drop me a comment, thanks in advanced.
Labels:
BEA WebLogic,
J2EE,
Java
Tuesday, September 30, 2008
J2EE/WebLogic Performance Tuning Project
This article contains a series of investigations for a customer of mine where the environment is running a WebLogic cluster of 20 machines in round-robin on HP-UX to service a global J2EE application and it performed slowly during peak periods and occasional hangs. The application was a typical 3-tier architecture whereby web relegates requests to the middle-tier (EJBs, MQs, MDBs) and this middleware goes to the Database-tier (SQL inserts, updates, deletes, stored procedures etc). The application was found to be experiencing heavy load during peak periods everyday.
There were a couple of issues related to poor performing SQLs, poorly designed middleware apps, WebLogic cluster design and runtime issues, JVM memory consumption and frequent garbage collections. Let me try to detail them a bit without giving away too much customer information. Hopefully, it can help you in your investigations in your environment.
During the peak period, the major contributing factors of the apps slow-ness were:
The heap size was 1.5GB (min,max), 512MB for Eden and the PermGen was 192MB. The minor GC kicked in frequently releasing approximately 60MB on average; the major GC kicked in twice every minute (avg. 3-5s on average, 40s on max) releasing 400 - 500 MB each time and reverse engineering the figures reveals that the object creation rate was roughly 800 M - 1.0 GB per minute. As GC is primarily a CPU-intensive operation (with saving state, freeing memory, compacting the heap etc). The large object creation rate combined with the relatively long pauses GCs occurences suggests that the application are creating objects in an in-efficient manner and that created problems with the cluster's session replication mechanism as the users of the system would see stale data - due to long pauses in GC, the data in the session was not replicated *properly* across to the other servers.
Applications were attached to the WebLogic system classpath which meant that the Java classes were never unloaded from memory and combined with the fact that there are ALWAYS classloader leaks meant that whenever the operation team redeploy a.k.a "hot"-redeployment the apps, it worsens the memory footprint since the previous memory was never release due to this leakage. If you keep hot-deploying these stuff you will almost certainly get an OutOfMemory Error: PermGen out of space.
EJBs (4,000+ EJBs deployed, in my opinion too many) were utilizing remote interfaces when there was no need as those apps were not doing cross-vm operations and based on my previous experimentation, you would get a 3-fold runtime improvement when you convert the EJBs to local interfaces. This improvement is because there is less object marshalling/unmarshalling via RMI since everything is on the same JVM heap and consumes less system resources like file descriptors/socket & memory since local interfaces implies a local/normal Java call.
As i mentioned previously, the apps were deployed in the cluster and that meant that all persistent objects (e.g. session data, user preferences etc) must be Serializable (i.e. persistent objects need to implement java.lang.Serializable) since there would be session replication across the servers in the cluster which further degraded the performance as the cluster needs to maintain state across all 20 machines. Source code analysis found that user's were keeping results of database fetches in session data! You can imagine the pressure faced by the JVM memory subsystem + WebLogic cluster replication.
WebLogic cluster was also malfunctioning during peak periods throwing an exception message like <WorkManager> <BEA-002911> <WorkManager weblogic.kernel.System failed to schedule a request due to weblogic.utils.UnsyncCircularQueue$FullQueueException: Queue exceed maximum capacity of: '65536' elements and this is an critical error thrown from the Work Manager which replaced the BEA traditional thread pool. What this meant was that the WebLogic cluster could no longer handle user's requests and hanged. *I plan to unravel this mystery in a while to understand why this is happening*
The hardware loadbalancer was in "sticky" mode even though the WebLogic cluster was in round-robin mode which negated this round-robin-ness and resulted in certain servers encountering more stress than others and this was made worse by the long session timeout of 20+ hrs. That's the cost of doing business....
After tracing the SQL statements execution times, it was found that they were causing alot of problems from missing indexes, lack of functional indexes, improper SQL statements which causes large database table joins and many "select count(*)..." from large table joins statements contributed to this object creation rate.
When i looked at these issues, the first couple of items i advised my customer was to do the following:
(1) Convert the EJBs to use local interfaces i.e. call-by-reference
(2) Tune the SQL statements via SQL reordering, indexes etc
(3) Tune the JVM heap to use more aggressive + parallel heap collectors via -XX:+UseParallelGC -XX:+UseConcMarkSweepGC (We are still experimenting this portion)
(4) Do not use system classpath to load application classes
(5) Review source codes to remove known classloader leaks
The customer and myself are still in the process of implementing/reviewing this so i hope to have an update for you in a couple of weeks time. Meanwhile, do visit Jon Masamitsu's blog for an understanding of some JVM tuning parameters.
There were a couple of issues related to poor performing SQLs, poorly designed middleware apps, WebLogic cluster design and runtime issues, JVM memory consumption and frequent garbage collections. Let me try to detail them a bit without giving away too much customer information. Hopefully, it can help you in your investigations in your environment.
During the peak period, the major contributing factors of the apps slow-ness were:
The heap size was 1.5GB (min,max), 512MB for Eden and the PermGen was 192MB. The minor GC kicked in frequently releasing approximately 60MB on average; the major GC kicked in twice every minute (avg. 3-5s on average, 40s on max) releasing 400 - 500 MB each time and reverse engineering the figures reveals that the object creation rate was roughly 800 M - 1.0 GB per minute. As GC is primarily a CPU-intensive operation (with saving state, freeing memory, compacting the heap etc). The large object creation rate combined with the relatively long pauses GCs occurences suggests that the application are creating objects in an in-efficient manner and that created problems with the cluster's session replication mechanism as the users of the system would see stale data - due to long pauses in GC, the data in the session was not replicated *properly* across to the other servers.
Applications were attached to the WebLogic system classpath which meant that the Java classes were never unloaded from memory and combined with the fact that there are ALWAYS classloader leaks meant that whenever the operation team redeploy a.k.a "hot"-redeployment the apps, it worsens the memory footprint since the previous memory was never release due to this leakage. If you keep hot-deploying these stuff you will almost certainly get an OutOfMemory Error: PermGen out of space.
EJBs (4,000+ EJBs deployed, in my opinion too many) were utilizing remote interfaces when there was no need as those apps were not doing cross-vm operations and based on my previous experimentation, you would get a 3-fold runtime improvement when you convert the EJBs to local interfaces. This improvement is because there is less object marshalling/unmarshalling via RMI since everything is on the same JVM heap and consumes less system resources like file descriptors/socket & memory since local interfaces implies a local/normal Java call.
As i mentioned previously, the apps were deployed in the cluster and that meant that all persistent objects (e.g. session data, user preferences etc) must be Serializable (i.e. persistent objects need to implement java.lang.Serializable) since there would be session replication across the servers in the cluster which further degraded the performance as the cluster needs to maintain state across all 20 machines. Source code analysis found that user's were keeping results of database fetches in session data! You can imagine the pressure faced by the JVM memory subsystem + WebLogic cluster replication.
WebLogic cluster was also malfunctioning during peak periods throwing an exception message like <WorkManager> <BEA-002911> <WorkManager weblogic.kernel.System failed to schedule a request due to weblogic.utils.UnsyncCircularQueue$FullQueueException: Queue exceed maximum capacity of: '65536' elements and this is an critical error thrown from the Work Manager which replaced the BEA traditional thread pool. What this meant was that the WebLogic cluster could no longer handle user's requests and hanged. *I plan to unravel this mystery in a while to understand why this is happening*
The hardware loadbalancer was in "sticky" mode even though the WebLogic cluster was in round-robin mode which negated this round-robin-ness and resulted in certain servers encountering more stress than others and this was made worse by the long session timeout of 20+ hrs. That's the cost of doing business....
After tracing the SQL statements execution times, it was found that they were causing alot of problems from missing indexes, lack of functional indexes, improper SQL statements which causes large database table joins and many "select count(*)..." from large table joins statements contributed to this object creation rate.
When i looked at these issues, the first couple of items i advised my customer was to do the following:
(1) Convert the EJBs to use local interfaces i.e. call-by-reference
(2) Tune the SQL statements via SQL reordering, indexes etc
(3) Tune the JVM heap to use more aggressive + parallel heap collectors via -XX:+UseParallelGC -XX:+UseConcMarkSweepGC (We are still experimenting this portion)
(4) Do not use system classpath to load application classes
(5) Review source codes to remove known classloader leaks
The customer and myself are still in the process of implementing/reviewing this so i hope to have an update for you in a couple of weeks time. Meanwhile, do visit Jon Masamitsu's blog for an understanding of some JVM tuning parameters.
Labels:
BEA WebLogic,
J2EE,
Java,
JVM
"man" crashes on my Lenovo T61 OpenSolaris snv_95
Didn't expect "man" to crash on me today. But it's an interesting problem because when i tried the same command that crashed on my OpenSolaris snv_95 on another machine running Solaris (SunOS raymond 5.10 Generic_118855-36 i86pc i386 i86pc) it didn't crash but gave me instead a error message "No manual entry for make" which is good since its more friendly to the user and i found similar behavior on Ubuntu Linux.
After investigating for a while, i believe its a bug. The command i attempted was
After examining the code + coredump, it appears to me that the reason of the crash is because the program attempted to release memory via "free" but wasn't assigned previously. Filed a bug report with OpenSolaris. Keep you updated on this when i have news.
This is an update on this post. Here is the bug id issued, click here for more details.
After investigating for a while, i believe its a bug. The command i attempted was
You will noticed that i made a mistake and its intentional. "-M" is suppose to mean the path but in this case i gave the absolute path to a file (which is executable btw) and when i ran on my OpenSolaris it generated a coredump.
man -M /usr/bin/man make
After examining the code + coredump, it appears to me that the reason of the crash is because the program attempted to release memory via "free" but wasn't assigned previously. Filed a bug report with OpenSolaris. Keep you updated on this when i have news.
This is an update on this post. Here is the bug id issued, click here for more details.
Friday, September 12, 2008
Building OpenJDK 7
I guess alot of people already knew that Sun Microsystems has released the source code for Java. Well i tried my hand at compiling the OpenJDK 7 and its important you read the installation manual before you attempt to compile it. Reason i did it was because i wanted to try out the new features of the JVM and also to understand the build-release process. Of course, i am still learning it and this post is about my experiences of compiling it and hope that it can provide you useful information on building your own, customizing it, fixing bugs if you find it etc
What i did first was to read the instructions found here. Read it a couple of times to understand exactly what you need to do. For my experience, here's what i did
(1) Alter my .bashrc file to include the environment variables needed
(2) Invoke the sanity check as defined in the build instructions
Note: Its important to fix all the warnings and errors before proceeding. If in doubt, check the openjdk forums. Remember to set the ALT_* variables, they are pretty critical to the success of the build. Also remember to install findbugs
(3) Re-run the sanity checks till everything is fixed, then start the build (From the output below, you can tell i am building the JDK for 32-bit instead of 64-bit)
(4) Wait a while...grab a coffee...grab a bagel...sleep a little
If you have reached this stage, you will probably find that your terminal screen is scrolling away building , compiling etc and you might have hit this error
"Error: ia_nice is not a member of iaparms."
This is a bug 6712505 and you can comment out the offending line. I wonder why didn't the folks at Sun took this away...
Next thing i encountered was a build error where it complains that it cannot find the file "sys/audio.h", "sys/audioio.h", "sys/mixer.h" and i subsequently found the 3 files from src.opensolaris.org and placed them into their proper directories.
After that, re-running the build took another hit in the form of attempting to locate the file "X11/Intrinsic.h" and some other header files from the X11 package and on OpenSolaris its known as SUNWxwinc. Well, here's the irritating part whereby the OpenSolaris package manager tells you that it has the files but when you visit the folder(s), i simply couldn't find it so i have to re-install the package SUNWxwinc and the header files are there.
(5) Finally, the entire build process kicked off without any further disappointments and i see IT, i saw the message
Next is to invoke the commandline and when you see something like
You know you are DONE. Woo Hoo! What a relief, now i hope the next build is not going to make me weep even further but its best that you test out the build with the demo apps found in your latest build. Here's a snapshot of the ArcTest demo found after completing the build

Last thing i did was to dump the JVM and from the output it appears that its fine
Hopefully, what i have done would go to show that you can do it too. Now it's time for lunch.
What i did first was to read the instructions found here. Read it a couple of times to understand exactly what you need to do. For my experience, here's what i did
(1) Alter my .bashrc file to include the environment variables needed
ALT_BINARY_PLUGS_PATH=/export/home/tayboonl/build_jdk/openjdk-binary-plugs
ANT_HOME=/export/home/tayboonl/apache-ant-1.7.1
ALT_COMPILER_PATH=/opt/SunStudioExpress/bin
ALT_GCC_COMPILER_PATH=/usr/bin/
ALT_CUPS_HEADERS_PATH=/usr/include
ALT_JDK_IMPORT_PATH=/usr/jdk/jdk1.6.0_06
LANG=C
export ALT_BINARY_PLUGS_PATH
export ALT_CUPS_HEADERS_PATH
export ALT_COMPILER_PATH
export ALT_GCC_COMPILER_PATH
export ALT_JDK_IMPORT_PATH
export ANT_HOME
export LANG
(2) Invoke the sanity check as defined in the build instructions
gmake sanity ARCH_DATA_MODEL=32
Note: Its important to fix all the warnings and errors before proceeding. If in doubt, check the openjdk forums. Remember to set the ALT_* variables, they are pretty critical to the success of the build. Also remember to install findbugs
(3) Re-run the sanity checks till everything is fixed, then start the build (From the output below, you can tell i am building the JDK for 32-bit instead of 64-bit)
gmake ARCH_DATA_MODEL=32
(4) Wait a while...grab a coffee...grab a bagel...sleep a little
If you have reached this stage, you will probably find that your terminal screen is scrolling away building , compiling etc and you might have hit this error
"Error: ia_nice is not a member of iaparms."
This is a bug 6712505 and you can comment out the offending line. I wonder why didn't the folks at Sun took this away...
Next thing i encountered was a build error where it complains that it cannot find the file "sys/audio.h", "sys/audioio.h", "sys/mixer.h" and i subsequently found the 3 files from src.opensolaris.org and placed them into their proper directories.
After that, re-running the build took another hit in the form of attempting to locate the file "X11/Intrinsic.h" and some other header files from the X11 package and on OpenSolaris its known as SUNWxwinc. Well, here's the irritating part whereby the OpenSolaris package manager tells you that it has the files but when you visit the folder(s), i simply couldn't find it so i have to re-install the package SUNWxwinc and the header files are there.
(5) Finally, the entire build process kicked off without any further disappointments and i see IT, i saw the message
gmake[2]: Leaving directory `/export/home/tayboonl/build_jdk/openjdk/jdk/make'
gmake[1]: Leaving directory `/export/home/tayboonl/build_jdk/openjdk'
Control solaris i586 1.7.0-internal build_product_image build finished:
Control solaris i586 1.7.0-internal all_product_build build finished:
Control solaris i586 1.7.0-internal all build finished:
tayboonl@opensolaris:~/build_jdk/openjdk$
Next is to invoke the commandline and when you see something like
tayboonl@opensolaris:~/build_jdk/openjdk/build/solaris-i586$ ./j2sdk-image/bin/java -version
openjdk version "1.7.0-internal"
OpenJDK Runtime Environment (build 1.7.0-internal-tayboonl_2008_09_12_11_16-b00)
OpenJDK Server VM (build 14.0-b04, mixed mode)
You know you are DONE. Woo Hoo! What a relief, now i hope the next build is not going to make me weep even further but its best that you test out the build with the demo apps found in your latest build. Here's a snapshot of the ArcTest demo found after completing the build

Last thing i did was to dump the JVM and from the output it appears that its fine
2008-09-12 14:00:35
Full thread dump OpenJDK Server VM (14.0-b04 mixed mode):
"TimerQueue" daemon prio=3 tid=0x08320c00 nid=0x11 waiting on condition [0xb6d7e000..0xb6d7ebf0]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0xf3d65360> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1974)
at java.util.concurrent.DelayQueue.take(DelayQueue.java:209)
at javax.swing.TimerQueue.run(TimerQueue.java:170)
at java.lang.Thread.run(Thread.java:674)
...
Hopefully, what i have done would go to show that you can do it too. Now it's time for lunch.
Tuesday, September 2, 2008
Rules for Writing 64-bit Clean Code
I lifted this from the book "Solaris Systems Programming" by Rich Teer. This is good advice. So no surprises there and if you know Rich Teer will feel uncomfortable about this, let me know and i will gladly take this off.
The following rules should be observed to ensure the 64-bit cleanliness of our code. Following these rules will also make it easier to port 32-bit code to the 64-bit model.
There's a lot more where that came from ... which i suggest you buy a copy (new or used doesn't matter)
The following rules should be observed to ensure the 64-bit cleanliness of our code. Following these rules will also make it easier to port 32-bit code to the 64-bit model.
- Do not assume a pointer and an "int" are the same size. Unfortunately, a lot of code relies on this assumption, because it is true in the ILP32 model. Pointers are sometimes cast to "int"s or "unsigned int"s to perform address arithmetic. Instead, they should be cast to "long" (or "unsigned long") because pointers and "long"s are the same size in both the ILP32 and LP64 data type models. Even better is to cast pointers to "uintptr_ts", because it expresses the intent more clearly, and make the code more portable.
- Do not make assumptions about the relative sizes of variable types. A classic example of this is to assume that the size of an "int" is the same as the size of a "long" and use them indiscriminately while implicitly or explicitly assuming they are interchangeable. Although this is typically true for 32-bit processes, it is not true for 64-bit ones.
- Be wary of sign extension problems. This is quite a common problem when converting code to 64-bits, and is hard to detect before it actually occurs because "lint" doesn't warn us about it. Also, the type conversion and promotion rules are quite obscure. Hence, we should use explicit casting to fix sign extension problems.
- Use pointer arithmetic rather than address arithmetic. As well as leading to cleaner code, pointer arithmetic is independent of the data model...
- By default, external variables and functions are assumed to be or return an "int". by the compiler unless we declare them otherwise....in the ILP32 model its OK but on the LP64 then we will loose information which will cause the program to crash because of an illegal memory reference.
There's a lot more where that came from ... which i suggest you buy a copy (new or used doesn't matter)
Saturday, August 30, 2008
Update OpenSolaris using Boot Environment Administration tool
Was updating my OpenSolaris snv_93 to snv_95 when all of a sudden my wireless connection was dropped and the update was incomplete, so the next thing to do is to remove it. So i removed it by first unmounting the volume and invoke beadm to remove it like as follows unless it's going to be pretty irritating to see it when i reboot my laptop.
If you don't un-mount it, then you are going to get error messages like "Unable to destroy XXX"
-bash-3.2# umount rpool/ROOT/opensolaris-5/opt
-bash-3.2# umount rpool/ROOT/opensolaris-5
....
-bash-3.2# beadm destroy opensolaris-5
Are you sure you want to destroy opensolaris-5? This action cannot be undone (y/[n]):
y
The BE that was just destroyed was the 'active on boot' BE. opensolaris-4 is now the 'active on boot' BE. Use 'beadm activate' to change it.
...
-bash-3.2# beadm list
BE Active Active on Mountpoint Space
Name reboot Used
---- ------ --------- ---------- -----
opensolaris no no - 66.89M
opensolaris-4 yes yes / 12.54G
...
If you don't un-mount it, then you are going to get error messages like "Unable to destroy XXX"
Labels:
Administration,
OpenSolaris,
Solaris
Friday, August 22, 2008
Quitting my day job
I was hanging around a Starbucks caffe today doing my usual research, browsing the internet, reading my "Learn Python" oreilly book today when i was approach by a caucasian male and something really funny happen
Caucasian: I am sorry to bother you but i need to ask you a favour
Me: Errr...sure, what can i do for you?
Caucasian: I bought a USB hard drive from "funan centre" (here's a picture of the IT super mall a.k.a "Funan Centre")
Caucasian: I am sorry to bother you but i need to ask you a favour
Me: Errr...sure, what can i do for you?
Caucasian: I bought a USB hard drive from "funan centre" (here's a picture of the IT super mall a.k.a "Funan Centre")
and wanted to test it out
Me: Errr...*puzzled look* (which translates to didn't the store tested out for you?)
...
subsequently, we tested it out and chatted his travels, what he liked/disliked about Singapore (my country)
...
... 5 mins later (yeah no doubt! all that happened in 5 mins)
Caucasian: Ahhh...yes i can see the drive and there's a song
Me: (I looked at the song and recognized it to be Rihana's umbrella)and immediately volunteered to hymm a couple of lines
Caucasian: *giggle* and finally broke out into a ROAR!
Caucasian: Please...Please...DONT QUIT YOUR DAY JOB
...
And we both broke out into a good laugh!
Me: Errr...*puzzled look* (which translates to didn't the store tested out for you?)
...
subsequently, we tested it out and chatted his travels, what he liked/disliked about Singapore (my country)
...
... 5 mins later (yeah no doubt! all that happened in 5 mins)
Caucasian: Ahhh...yes i can see the drive and there's a song
Me: (I looked at the song and recognized it to be Rihana's umbrella)and immediately volunteered to hymm a couple of lines
Caucasian: *giggle* and finally broke out into a ROAR!
Caucasian: Please...Please...DONT QUIT YOUR DAY JOB
...
And we both broke out into a good laugh!
Saturday, August 16, 2008
OpenSolaris Pydoc browser cannot be launched
Got inspired by my friend to learn Python and i am beginning to appreciate its power and flexibility and it has elements from another favourite of mine Erlang. So following his example, i got a book called "Learning Python - 3rd Edition" and begun exploring the language and when i tried to launch the pydoc (fyi, its a tool for displaying available APIs) the script was telling me that it couldn't find the browser.
Here was the stack trace and like any other language, the "root" exception/cause is always found at the last line (look at the line in bold)
So, i looked at the file /usr/lib/python2.4/webbrowser.py and discovered that it was attempting to execute a list of known browser commands and in my case i simply added the string "firefox" to the list and it worked. *A quick hack*. Here the code snippet that i edited on the file.
Here was the stack trace and like any other language, the "root" exception/cause is always found at the last line (look at the line in bold)
tayboonl@opensolaris:~/Desktop$ Exception in Tkinter callback
Traceback (most recent call last):
File "/usr/lib/python2.4/lib-tk/Tkinter.py", line 1345, in __call__
return self.func(*args)
File "/usr/lib/python2.4/pydoc.py", line 2086, in open
webbrowser.open(url)
File "/usr/lib/python2.4/webbrowser.py", line 43, in open
get().open(url, new, autoraise)
File "/usr/lib/python2.4/webbrowser.py", line 38, in get
raise Error("could not locate runnable browser")
Error: could not locate runnable browser
So, i looked at the file /usr/lib/python2.4/webbrowser.py and discovered that it was attempting to execute a list of known browser commands and in my case i simply added the string "firefox" to the list and it worked. *A quick hack*. Here the code snippet that i edited on the file.
# X browsers have more in the way of options
if os.environ.get("DISPLAY")$
_tryorder = ["galeon", "skipstone", "firefox",
"mozilla-firefox", "mozilla-firebird", "mozilla", "netscape",
"kfm", "grail"] + _tryorder
Friday, August 8, 2008
DTrace and Tail-Call Optimization
In many of the C compilers today, there's an optimization known as tail-call optimization and you need to pay careful attention to this when using DTrace's Function Boundary Tracing a.k.a fbt provider. A common application would be to tail-recursion.
Having said that, you probably need to realize that it occurs most often in SPARC systems than compared to x86 Solaris systems. How can you tell that you ran against a tail-call optimized routine or function ?
In DTrace, the variable arg0 contains the assembly return instruction with its offset in the DTrace return probe. So in your scripts you need to put in a extra statement that traces this variable and check whether this offset is a pure return instruction or does it point to something else otherwise you might get a nasty shock while trying to evaluate a result thinking it was the result of a function and in some cases, produce misleading information.
and you should see something like
and next you need to use mdb or your favourite disassembler and look at the function at that offset. In my case, i didn't find any tail-call optimized functions since i saw the ret-instruction but in your case you might. Be aware to take care on this when you are tracing the Tcl/Python code since there are providers for them now.
Having said that, you probably need to realize that it occurs most often in SPARC systems than compared to x86 Solaris systems. How can you tell that you ran against a tail-call optimized routine or function ?
In DTrace, the variable arg0 contains the assembly return instruction with its offset in the DTrace return probe. So in your scripts you need to put in a extra statement that traces this variable and check whether this offset is a pure return instruction or does it point to something else otherwise you might get a nasty shock while trying to evaluate a result thinking it was the result of a function and in some cases, produce misleading information.
# dtrace -n fbt::squeue*:return'{printf("%s, 0x%x\n", probefunc, arg0);}'
and you should see something like
squeue_enter_chain,0x3a5
squeue_enter,0x47a
squeue_enter,0x47a
squeue_fire,0x85
^C
and next you need to use mdb or your favourite disassembler and look at the function at that offset. In my case, i didn't find any tail-call optimized functions since i saw the ret-instruction but in your case you might. Be aware to take care on this when you are tracing the Tcl/Python code since there are providers for them now.
Wednesday, August 6, 2008
Solaris 10 Network Stack
The new implementation in the latest Solaris 10 and OpenSolaris is different and faster. Click here for details. Alternatively, download a pdf version of it here.
Here's a picture of the latest architecture
The new implementation is built on re-using the Solaris STREAMS framework found in the pre-Solaris 10 OS. Using the picture below, you can see how its evolved through the versions and after reading the PDF, i believe you would have a better idea of this network implementation.

And DTrace currently supports the tracing of this and here's a sample output of tracing it on my OpenSolaris 2-CPU machine
What you are seeing is that there is a 1-1-1 ratio to thread:cpu:squeue and what you see is something like
"squeue name: ip_squeue_cpu 1/1/0 is bound to CPU-1 where the bounded thread id is 380" and next you can see that the threads id 34,35, 63619 is interacting with the 2 squeues.
Here's another method you can try using mdb and attaching to the process in question
Visit the opensolaris website and browse through its source code, but most importantly have fun!
Here's a picture of the latest architecture

The new implementation is built on re-using the Solaris STREAMS framework found in the pre-Solaris 10 OS. Using the picture below, you can see how its evolved through the versions and after reading the PDF, i believe you would have a better idea of this network implementation.

And DTrace currently supports the tracing of this and here's a sample output of tracing it on my OpenSolaris 2-CPU machine
tayboonl@opensolaris:~/SunStudioProjects/TimeServer# dtrace -s ./trace.d -c ./dist/Debug/Sun12-Solaris-x86/timeserver
dtrace: script './trace.d' matched 2 probes
CPU FUNCTION
1 -> squeue_getprivate Bound to CPU(1),Squeue Name:ip_squeue_cpu_1/1/0, Thread ID:380
ip`tcp_get_conn+0x22
ip`tcp_open+0x1c6
ip`tcp_openv4+0x24
genunix`qattach+0x160
genunix`stropen+0x490
sockfs`socktpi_open+0xac
sockfs`sotpi_create+0x11e
sockfs`so_socket+0x146
unix`_sys_sysenter_post_swapgs+0x14b
OS Thread ID:63619
0 -> squeue_getprivate Bound to CPU(0),Squeue Name:ip_squeue_cpu_0/0/0, Thread ID:290
ip`tcp_time_wait_collector+0x20
genunix`callout_execute+0xbf
genunix`taskq_thread+0x1a7
unix`thread_start+0x8
OS Thread ID:34
1 -> squeue_getprivate Bound to CPU(1),Squeue Name:ip_squeue_cpu_1/1/0, Thread ID:380
ip`tcp_time_wait_collector+0x20
genunix`callout_execute+0xbf
genunix`taskq_thread+0x1a7
unix`thread_start+0x8
OS Thread ID:35
...
What you are seeing is that there is a 1-1-1 ratio to thread:cpu:squeue and what you see is something like
"squeue name: ip_squeue_cpu 1/1/0 is bound to CPU-1 where the bounded thread id is 380" and next you can see that the threads id 34,35, 63619 is interacting with the 2 squeues.
Here's another method you can try using mdb and attaching to the process in question
> ::squeue
ADDR STATE CPU FIRST LAST WORKER
ffffff01ca044dc0 02060 1 0000000000000000 0000000000000000 ffffff0007ec2c80
ffffff01ca044e80 02060 0 0000000000000000 0000000000000000 ffffff0007c25c80
> ffffff0007ec2c80::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR
ffffff0007ec2c80 sleep 8 0 3 60 0 0 n/a
> ffffff0007c25c80::thread
ADDR STATE FLG PFLG SFLG PRI EPRI PIL INTR
ffffff0007c25c80 sleep 8 0 3 60 0 0 n/a
Visit the opensolaris website and browse through its source code, but most importantly have fun!
Sunday, July 13, 2008
Java Generics - exploration ....
Coding in Java generics (download this tutorial to gain an understanding) needs some time and experimentation in order to get it right and i discovered one of the pitfalls is how generic methods and type parameters are easily mixed up. Here's what i did and it took a java byte code viewer like jclasslib to understand what is happening.
The thing to realize from this experiment was that the type parameter declaration statement "public E weird" in WeirdBox.java is not related to that declared in the overridden methods isGlob(...) nor isTwistable(...). This is evident when you de-compile the java class(s) and here's what i saw
public E weird is translated to a object of type "java.lang.Object"
isGlob(...) is translated to a method of signature "<(Ljava/lang/Boolean;)Ljava/lang/Boolean;>" and similarly for isTwistable(...) and therefore it explains why i had to comment out the statement "this.weird = t; return this.weird;" because it wasn't type-compatible.
One of the ways i found (please drop me a email if you found another way) was to do the following changes in WeirdBox.java
You have to provide unique type parameters inside the code otherwise it won't work. You can see that i had to perform some type casts to get it going and if you compiled it using "-Xlint" you would find that the compiler has the following warning messages:
which is again weird because i assumed that E and E1 are now of type java.lang.Boolean but the warning of an unchecked cast exception makes me a little jittery but perhaps i used it in the wrong way.
--- An update on this post ---
I found an alternative site on Java Generics (here's the link to the PDF version) and its done by Angelika Langer
----------- WeirdBox.java --------------------------
package generics.box;
public class WeirdBox<T, X, Y , E> extends PaperBox<T, X, Y > implements WeirdBoxProp {
public E weird;
WeirdBox(T id, X name, Y manu, E weird) {
super.changeId(id);
super.setManufacturer(manu);
super.setName(name);
this.weird = weird;
}
// isGlob(E t) & isTwistable(E t) is declared in the interface WeirdBoxProp
public <E extends Boolean> E isGlob(E t) {
// this.weird = t;
// return this.weird;
return t;
}
public <E extends Boolean> E isTwistable(E t) {
// this.weird = t;
// return this.weird;
return t;
}
}
----------- WeirdBoxProp.java --------------------------
package generics.box;
public interface WeirdBoxProp {
<E extends Boolean> E isGlob(E t); // suppose to be a glob so subclass must return true
<E extends Boolean> E isTwistable(E t); // suppose to be twistable so subclass must return true
}
The thing to realize from this experiment was that the type parameter declaration statement "public E weird" in WeirdBox.java is not related to that declared in the overridden methods isGlob(...) nor isTwistable(...). This is evident when you de-compile the java class(s) and here's what i saw
public E weird is translated to a object of type "java.lang.Object"
isGlob(...) is translated to a method of signature "<(Ljava/lang/Boolean;)Ljava/lang/Boolean;>" and similarly for isTwistable(...) and therefore it explains why i had to comment out the statement "this.weird = t; return this.weird;" because it wasn't type-compatible.
One of the ways i found (please drop me a email if you found another way) was to do the following changes in WeirdBox.java
package generics.box;
public class WeirdBox<T, X, Y, E1 extends Boolean> extends PaperBox<T, X, Y > implements WeirdBoxProp {
public E1 weird;
WeirdBox(T id, X name, Y manu, E1 weird) {
super.changeId(id);
super.setManufacturer(manu);
super.setName(name);
this.weird = weird;
}
public <E extends Boolean> E isGlob(E t) {
this.weird = (E1)t;
return (E)this.weird;
}
public <E extends Boolean> E isTwistable(E t) {
this.weird = (E1)t;
return (E)this.weird;
}
}
You have to provide unique type parameters inside the code otherwise it won't work. You can see that i had to perform some type casts to get it going and if you compiled it using "-Xlint" you would find that the compiler has the following warning messages:
found : E
required: E1
this.weird = (E1)t;
found : E1
required: E
return (E)this.weird;
which is again weird because i assumed that E and E1 are now of type java.lang.Boolean but the warning of an unchecked cast exception makes me a little jittery but perhaps i used it in the wrong way.
--- An update on this post ---
I found an alternative site on Java Generics (here's the link to the PDF version) and its done by Angelika Langer
Labels:
Java
Tuesday, July 8, 2008
Detecting Endian-ness in Java
Got a question on this from someone and the exact question was "how to check Endianness in Java coding"
The fact that Java is so prevalent in the industry and IT world is because you don't have to worry about such stuff. Java takes care of that for you in the Java Virtual Machine implementation and there isn't any API (correct me if i am wrong) that you can invoke like "isEndian()" , "BigIndian()" etc.
In general, the only area i can think at the moment is when you are running Java on 2 different Endian machines like Windows <--> Unix. And in these cases, Java takes care of the byte-order/endian conversion to/fro in the JVM implementation. One good resource i can think of is the OpenJDK project where you can browse through their source codes to discover the mechanism.
Hope this helps clear the air.
The fact that Java is so prevalent in the industry and IT world is because you don't have to worry about such stuff. Java takes care of that for you in the Java Virtual Machine implementation and there isn't any API (correct me if i am wrong) that you can invoke like "isEndian()" , "BigIndian()" etc.
In general, the only area i can think at the moment is when you are running Java on 2 different Endian machines like Windows <--> Unix. And in these cases, Java takes care of the byte-order/endian conversion to/fro in the JVM implementation. One good resource i can think of is the OpenJDK project where you can browse through their source codes to discover the mechanism.
Hope this helps clear the air.
BTrace glitches ?
Some glitches here on tracing Kind.NEW. Fyi, it's an feature that is used to track object creation.
Didn't work for Kind.NEW but worked for Kind.NEWARRAY and filed an issue with BTrace's website. Read about it by clicking here and i'll update you when i have updates. This is not bashing BTrace because i still think its a cool tool and the potential to challenge commercial implementations out there and like all software there will be glitches here and there, give them a break right ?
--- This is an update on this article ---
It turns out that this wasn't a glitch at all but due to my lack of understanding of the subject matter and i decided not remove this post so as a reminder. Having said that, i should proceed to show you how to do it correctly.
Below is the correct way to do it on a BTrace script
Or
// UDO == User Defined Object
class UDO {
UDO() {
System.out.println("UDO");
}
}
public class ObjAlloc {
Object id;
ObjAlloc() {
// Does not appear to work for Kind.NEW
//id = new String("ME");
//id = new String();
id = new Object();
//id = new UDO();
// Works for Kind.NEWARRAY
//id = new int[10];
}
private void A(String... str) {
// Works for Kind.NEW & Kind.NEWARRAY
//id = new int[10];
//id = new Object();
}
public static void main(String[] args) throws Exception {
System.out.println("Here");
ObjAlloc oa = new ObjAlloc();
while(true) {
new ObjAlloc().A();
Thread.currentThread().sleep(500);
System.out.println(".");
}
}
}
Didn't work for Kind.NEW but worked for Kind.NEWARRAY and filed an issue with BTrace's website. Read about it by clicking here and i'll update you when i have updates. This is not bashing BTrace because i still think its a cool tool and the potential to challenge commercial implementations out there and like all software there will be glitches here and there, give them a break right ?
--- This is an update on this article ---
It turns out that this wasn't a glitch at all but due to my lack of understanding of the subject matter and i decided not remove this post so as a reminder. Having said that, i should proceed to show you how to do it correctly.
Below is the correct way to do it on a BTrace script
@BTrace
public class test {
@OnMethod(clazz="Main", method="/.+/",
location=@Location(value=Kind.NEW,clazz="/.+/"))
public static void p() {
println("Object Created");
}
}
Or
what i failed to do previously was to investigate the possibility of adding the "clazz" attributed to the "Location" which was why i wasn't able to instrument and detect my code from the perspective of object creation.
import com.sun.btrace.annotations.*;
import static com.sun.btrace.BTraceUtils.*;
@BTrace
public class test {
@OnMethod(clazz="Main", method="/.+/",
location=@Location(value=Kind.NEW,clazz="java.lang.Object"))
public static void p() {
println("Object Created");
}
}
Labels:
btrace,
Java,
Java Byte Codes,
JVM
Sunday, June 29, 2008
Using BTrace in Java 6
Got up to speed with BTrace recently and thanks to Sundararajan's clarifications on syntax. This tool is simply TOO cool to give it up since in my opinion, someone whom has developed in Java and/or J2EE would quickly pick this technology up in no time :-)
However, beware of some caveats as i found out while attempting to monitor certain Java classes in Java 6 - coincidentally i attempt to monitor the EJB lifecycle etc.
If you attempted to instrument certain code in your EJBs like ejbXXX() methods you might or would find something similar to the above and that would certainly kill your interest but i would advise you not to be too hasty in this as it still works in the J2SE 1.5.x platform.
But this certainly sparked my interest in the algorithm that is causing this problem and i went browsing through the fundamental technology in which BTrace was built - ASM. According to the ASM 3.0 documentation & the Java Virtual Machine specifications 2nd Edition, it turns out ASM is having a hard time trying to run its algorithm "execute()" which essentially will simulate the java bytecode instruction on the output stack frame ; so the current version of this software will actually throw a Java unchecked exception i.e. java.lang.RuntimeException to inform the user of this tool (in this case, myself) that its not supported .... yet.
However, beware of some caveats as i found out while attempting to monitor certain Java classes in Java 6 - coincidentally i attempt to monitor the EJB lifecycle etc.
btrace DEBUG: java.lang.RuntimeException: JSR/RET are not supported with computeFrames option
java.lang.RuntimeException: JSR/RET are not supported with computeFrames option
at org.objectweb.asm.Frame.a(Unknown Source)
at org.objectweb.asm.MethodWriter.visitJumpInsn(Unknown Source)
at org.objectweb.asm.MethodAdapter.visitJumpInsn(Unknown Source)
at org.objectweb.asm.ClassReader.accept(Unknown Source)
at org.objectweb.asm.ClassReader.accept(Unknown Source)
at com.sun.btrace.runtime.InstrumentUtils.accept(InstrumentUtils.java:66)
at com.sun.btrace.runtime.InstrumentUtils.accept(InstrumentUtils.java:62)
at com.sun.btrace.agent.Client.instrument(Client.java:261)
at com.sun.btrace.agent.Client.transform(Client.java:101)
at sun.instrument.TransformerManager.transform(TransformerManager.java:169)
at sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:365)
at sun.instrument.InstrumentationImpl.retransformClasses0(Native Method)
at sun.instrument.InstrumentationImpl.retransformClasses(InstrumentationImpl.java:124)
at com.sun.btrace.agent.Main.handleNewClient(Main.java:278)
at com.sun.btrace.agent.Main.startServer(Main.java:245)
at com.sun.btrace.agent.Main.access$000(Main.java:53)
at com.sun.btrace.agent.Main$1.run(Main.java:127)
at java.lang.Thread.run(Thread.java:619)
If you attempted to instrument certain code in your EJBs like ejbXXX() methods you might or would find something similar to the above and that would certainly kill your interest but i would advise you not to be too hasty in this as it still works in the J2SE 1.5.x platform.
But this certainly sparked my interest in the algorithm that is causing this problem and i went browsing through the fundamental technology in which BTrace was built - ASM. According to the ASM 3.0 documentation & the Java Virtual Machine specifications 2nd Edition, it turns out ASM is having a hard time trying to run its algorithm "execute()" which essentially will simulate the java bytecode instruction on the output stack frame ; so the current version of this software will actually throw a Java unchecked exception i.e. java.lang.RuntimeException to inform the user of this tool (in this case, myself) that its not supported .... yet.
Labels:
btrace,
Java,
Java Byte Codes,
JVM
Subscribe to:
Posts (Atom)
