Stack Wars III (Or VII, depending on how you count): Return of the Stack

Comments (0)

READER BEWARE: This got long and geeky, so make sure you're really trying to avoid doing "real work" before you sit down to read.

Welcome back all, and thanks for joining me for the final installment of a three part series on decompiling Java code and analyzing stack traces. If you're interested in the back story, you can read about decompiling Java here, and analyzing basic stack traces here.

When we last left off, our hero was frozen in a block of Carbonite and left to his doom. Err...sorry, wrong story, let me start again.

When we last left off, we walked through analyzing a simple stack trace to run down a bug in a standalone Java program. This is all fine and good, but what happens when you're running a Java application server (like Tomcat or Weblogic) that has hundreds of concurrent threads running tens of different web apps? And you're just trying to figure out why your particular web application is hung?

Well, if you're lucky, the developer of your web application was a good boy/girl, and they're logging stack traces for you in a log file somewhere. If this is the case, then you can open up the log and analyze the trace like we did the last go round.

Often times though, you're not so lucky, and you have to dig a little deeper to figure out what's going on. Say, for instance, your web application just starts responding very slowly. You see from access logs that responses are being served back to users, but they're about 5 times slower than normal...WTF? Or, what if your application server starts pegging the box at 100% CPU...what to do? Or, out of nowhere, your app server starts throwing Java.lang.OutOfMemory exceptions...Gah!!! Sadly, these nebulous problems seem to happen in a production environment more often than most of us would like to admit. And when they do occur, it's usually a high stress situation because there's probably a production outage and nobody really knows why. So, how do we find a better fix to the problem than the traditional, "Let's just bounce it and see what happens" response? Why, we use "The Force", of course. Except in this case, "The Force", is just a set of debugging tips that I'm getting ready to share with you as follows:

  1. Take a deep breath and don't freak out when a bunch of people start yelling.
  2. Understand the severity of the situation. Figure out how much downtime you can tolerate for debugging before things really hit the fan.
  3. Set expectations. Let people know when they can expect the issue to be fixed. If you know a bounce will temporarily fix the problem, then set a drop-dead time for debugging and schedule a bounce. Let folks know that if you don't find a root-cause of the problem, that they should expect to see the issue again.
  4. Gather as much information about the problem as you can. What, specifically, are users experiencing? Can you reproduce? What are the symptoms? What log messages do you have? What does the server environment look like? etc.
  5. Try to relate this issue to something you've seen in the past. Does this look like a problem you saw last week or last month? What did you do to fix it then? Why is it popping up again now?
  6. Eliminate possible external causes. Is this actually a network problem in disguise? Is the database acting up? Is some other process on the server eating up CPU? Is the server constantly swapping because it doesn't have enough RAM to handle all its business?
  7. Eliminate environment changes as possible causes. Has the server environment changed recently? Could this be causing your problem?
  8. Look in earnest at your application server process. Are we bumping up against the Java max. heap size? Are we seeing lots of garbage collection? Is our thread pool exhausted? Is our JDBC pool maxed out?
  9. Take a look inside of the application server JVM. Generate a thread dump to get a snapshot of how your application server is behaving. Analyze the stack traces in the thread dump, and look for points of interest, like...race conditions, stuck threads, deadlock, etc.
  10. Open a ticket with the software vendor (If applicable)
  11. Pray the issue fixes itself and doesn't come back
  12. Find a new job so you don't have to deal with these problems anymore

Read on for more fascinating details about The Stack Trace...

At the risk of getting back on topic, the rest of this post will focus on point 9: Take a look inside of the application server JVM. Did you know that you can generate a snapshot of all the JVM's threads? Doing so, in practice, creates a big stack trace that tells you what your application server is doing at a point in time. For instance, here's part of a thread dump/stack trace from a Weblogic Server running on my laptop:

 

wl_thread_dump.png 

 

Looks a lot more complicated than that simple stack trace we looked at last time, huh?

stack_trace_read_info.png 

 

Before we start trying to make sense out of all this information, let's first step back and talk about how we generate a thread dump in the first place. Basically you have a few options. If you're running on a Unix platform, you can send SIGQUIT to the JVM and it will push a thread dump to the standard error stream. You send SIGQUIT to the process as follows:

kill -QUIT <pid>

or

kill -3 <pid>

The above commands are identical, and don't worry...even though we're using the "kill" command, they won't actually shutdown/kill your process.

What's that you say? You're running your app server on Windows and this Unix example is totally useless? Let me digress for a minute and say...if you're going to be running on Windows, you might as well be running IIS/.NET, it's just a better fit. But I assume it will take you a while to change your OS and/or development platform, so while we're waiting, here's how you can generate a thread dump for your app server on Windows.

1) Start your Application Server from a DOS prompt

2) Type <Ctrl><Break> in DOS window running your application to send a thread dump to the DOS terminal

Additionally, if you happen to be running Weblogic 9+, you can generate a stack trace through the Admin console by going to Servers-><Your Server>->Monitoring->Threads->Dump Thread Stats

wl_admin_console.png 

Great, so now we have about 2500 lines of output...how do we do something useful with all this data? Let's start by discussing what we're looking at from a high-level. As I mentioned earlier, our thread dump is a snapshot of the state of the JVM at a point in time. Since we know our application server is a multi-threaded application, and that each thread has its own call stack, the thread dump is actually just "n" stack traces, where "n" is equal to the number of threads running inside of your application server. So let's first look at each stack in isolation. For instance, you might have one thread that looks like:

waiting_thread_overview.png

The first line just gives us some information about the thread itself. We see that this thread looks like it's an execution thread that services user requests. Furthermore, it's the sixth member of a thread pool that's used to handle all incoming requests.

waiting_thread_first_line.png 

 

Now, following the same principles discussed in my last post, let's look at the trace from the bottom up:

waiting_thread_stack.png 

Reading bottom to top, we see that the thread was spawned by the application server code, and not through a main method, and that it's sitting in Object Wait state. We can interpret this to mean that the thread is sitting idle, just waiting to service requests. In short, not much of interest is going on here. So let's look at another example like this:

gc_thread.png 

By the name, we can assume that this thread is being used internally by the application server to monitor garbage collection. Once you start looking at a few application server thread dumps, you'll notice that there are a bunch of threads like this one. That is, the app server spawns a bunch of threads to do internal housekeeping. Do yourself a favor, just ignore these stacks. For our purposes, they're just noise. So either skim over them, or delete them altogether from your thread dump, they're only taking up space and causing confusion.

OK, so we've looked over all the individual stack traces that are actually serving user requests and don't have anything that's particularly compelling...now what? Now we look for trends. Did you happen to notice that the stack traces for all your threads look the same? Maybe they're all waiting for a JDBC call to be returned...this implies that perhaps we have a problem at the database level. Let's go look at that database again. Are all the threads in your execution pool busy without any waiting to service requests? If so, maybe you're over-loading your application server and need to add more threads to the execution pool. Still nothing interesting? All the threads look like they're behaving normally? Well, let's complicate things a bit more by taking another thread dump. By taking a second (or third or fourth) thread dump, we now have a chronology of how the JVM is behaving at distinct points in time. So compare and contrast what specific threads are doing across each of your thread dumps. If, for instance, you see that all the threads show the same stack trace in all of your thread dumps, you can logically conclude that the JVM/application server isn't servicing requests the way it should be, and you have "stuck threads". So what's causing these stuck threads? Two likely candidates:

  1. Something external to the JVM is acting up. Is some other process on the server eating up all the CPU and starving your application server process? Is the server out of RAM?
  2. Your web application is dead-locked, or in a race condition. This is the result of poor programming in multi-threaded applications, and is often very difficult to run down. Basically, thread A is holding a resource that thread B is waiting for, and thread B is holding a resource that thread A is waiting for. The two threads will never move forward in their execution because they're in a stalemate over who will give up their resources first. This is probably the worst explanation ever of dead-lock, so if you're interested in more details, Google "Dining Philosphers".

For the sake of argument, let's assume that there's nothing wrong with the server that's running your process, so you think you might have a deadlock. How can you be sure? Conveniently, stack traces in JDK 1.5+ actually give you a little deadlock summary at the bottom like so:

jvm_deadlock.png

If you see a deadlock message in the stack trace, you can be about 100% sure there's something funky going on with the code. At this point, you need to either dive deep into the code and figure out where the threading logic is flawed, or send all the information you've collected over to someone who can.

I know I told you this was the final installment at the start of this post...but it got longer than expected.  So...join us next week for  some cool stack trace tools and a real world example.  Until then....may the force be with you.

Leave a comment

Recent Entries

Oh ALBPM, why you gotta go and load images like 'dat?
Howdy all.  Hope the nice weather is finding happy, healthy, and allergy free. As we were doing spring ALBPM house-keeping…
Import Content into Publisher
These are exciting times with Aqualogic (ALI) and its direction as a Portal technology after Oracle's acquisition of BEA Systems. …
PTSpy log messages linked to ... hair loss?
For the past few years, researchers have been perplexed about the causes behind the increased rate of hair loss in ALUI portal administrators.  So, a team…