OK, now that you had your dinner and ate your meat, it's time for the pudding. This conclusion to Stack Wars just ties up a few loose ends and walks through a "real-world" example of analyzing stack traces. Have fun...
Making Thread dumps easier to view
Seeing as how most people's idea of a good time isn't sifting through pages upon pages of JVM stack traces, several good people have developed tools that make it easier to view and analyze complex thread dumps. Personally, I like and use the IBM Alphaworks Thread and Monitor Dump Analyzer. Pretty straight forward to use. Just save your thread dump off to a file and open it up in the Analyzer for a pretty view of all your threads that looks like this:
I've found this tool useful when I'm trying to look for patterns across many threads in a thread dump. Give it a try, you might like it.
A real world example of how all the rambling I've just gone through came in handy
Long ago, in a galaxy far, far away, I was working on a Plumtree consulting project with a large company. These folks had been having problems with their portal deployment. Periodically, the portal would start eating up 100% of the server CPU, and would never release it until the portal application server was bounced. During the time that the server CPU was pegged, users couldn't access the portal...the customer was not happy. It was one of those big deals where VP-level people at the customer were yelling at VP/Executive level people at Plumtree on a daily basis. So here I am in this high stress situation. I arrive onsite with the customer and, for three days, nothing happens. The application doesn't freak out, and all is fine. At this point, I'm starting to get cautiously optimistic that maybe we won't see any problems during my remaining two days onsite, and somebody else will have to deal with the problem later :) But, alas, it wasn't to be. About halfway through my second to last day with the customer everything went to hell. All of a sudden nobody could access the portal, so we start debugging. Sure enough, first thing we see is that the portal process is eating up 100% of the server CPU. These folks were running on a Unix platform, so I ask them to try:
kill -QUIT <portal pid>
After explaining for 20 minutes that, no, this command won't kill the process, the server ops folks generated a thread dump and sent it over to review. After about 5 minutes of looking at the portal thread dump, it was pretty obvious that something was amiss with the portal code. The stack trace for EVERY SINGLE THREAD (All 100 or so of them) was exactly the same. They were all stuck running the same native method. Just to be sure this wasn't some freak anomaly, we generated another thread dump 15 minutes later, and, yep, all the stack traces looked the same. So this tells us that there's a problem with some native C code being used by the application...a good start. Unfortunately, since the problem was in native code, we couldn't get a full stack trace from the JVM...it only goes so far as to let you know that it's making a native call. So we dig one level deeper and trace the portal process. Running a trace/truss on a Unix process spits out all the system level calls that are being made. Wouldn't you know it, when we looked at the trace output, 99% of the calls being made were:
poll(0)
Now, I wasn't an expert at interpreting this level of data, but I knew enough that "poll" had something to do with sockets, and that there definitely shouldn't be so many of those calls. Long story a little bit shorter, after a bunch of conference calls with Plumtree support and engineering, where we shared the data we'd gathered from the thread dumps and process traces, the engineering team found a very low level bug in the software and shipped a fix out to the customer...tragedy narrowly averted.
Conclusion
Did you really make it this far? If so, thanks for sticking with me, and I hope you found something of use in this long, winding post. If you're ever stuck trying to make sense of a thread dump, feel free to drop us a line or post a comment here...we'll do our best to help.
Take care, and remember, when dealing with production troubleshooting, "Do...or do not, there is no try". (This doesn't have really have much to do with the post, but I'm kind of tired and pulling at straws for good ways to tie the Star Wars metaphor back in). Anyhow, as always, thanks for reading...see you next time.

Leave a comment